AGI's Pop Quiz: TheoremQA Stumps AI
·45 words·1 min
How does the model do on a new unseen out-of-distribution benchmark is a core distinguishing feature towards AGI and a fundamental differentiator from classical ML where we only cared about fixed set of specific benchmarks or in-distribution test sets. The new work, TheoremQA,… https://x.com/WenhuChen/status/1710827254408679807