Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
OpenAI’s latest o3 model has achieved a breakthrough that has surprised the AI research community. o3 scored an unprecedented 75.7% on the super-difficult ARC-AGI benchmark under standard compute conditions, with a high-compute version reaching 87.5%.
While the achievement in ARC-AGI is impressive, it does not yet prove that the code to artificial general intelligence (AGI) has been cracked.
Abstract Reasoning Corpus
The ARC-AGI benchmark is based on the Abstract Reasoning Corpus, which tests an AI system’s ability to adapt to novel tasks and demonstrate fluid intelligence. ARC is composed of a set of visual puzzles that require understanding of basic concepts such as objects, boundaries, and spatial relationships. While humans can easily solve ARC puzzles with very few demonstrations, current AI systems struggle with them. ARC has long been considered one of the most challenging measures of AI.
Example of ARC puzzle (source: arcprize.org)
ARC has been designed in a way that it can’t be cheated by training models on millions of examples in hopes of covering all possible combinations of puzzles.
The benchmark is composed of a public training set that contains 400 simple examples. The training set is complemented by a public evaluation set that contains 400 puzzles that are more challenging as a means to evaluate the generalizability of AI systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without running the risk of leaking the data to the public and contaminating future systems with prior knowledge. Furthermore, the competition sets limits on the amount of computation participants can use to ensure that the puzzles are not solved through brute-force methods.
A breakthrough in solving novel tasks
o1-preview and o1 scored a maximum of 32% on ARC-AGI. Another method developed by researcher Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3.
In a blog post, François Chollet, the creator of ARC, described o3’s performance as “a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models.”
It is important to note that using more compute on previous generations of models could not reach these results. For context, it took 4 years for models to progress from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. While we don’t know much about o3’s architecture, we can be confident that it is not orders of magnitude larger than its predecessors.
Performance of different models on ARC-AGI (source: arcprize.org)
“This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs,” Chollet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.”
It is worth noting that o3’s performance on ARC-AGI comes at a steep cost. On the low-compute configuration, it costs the model $17 to $20 and 33 million tokens to solve each puzzle, while on the high-compute budget, the model uses around 172X more compute and billions of tokens per problem. However, as the costs of inference continue to decrease, we can expect these figures to become more reasonable.
A new paradigm in LLM reasoning?
The key to solving novel problems is what Chollet and other scientists refer to as “program synthesis.” A thinking system should be able to develop small programs for solving very specific problems, then combine these programs to tackle more complex problems. Classic language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack compositionality, which prevents them from figuring out puzzles that are beyond their training distribution.
Unfortunately, there is very little information about how o3 works under the hood, and here, the opinions of scientists diverge. Chollet speculates that o3 uses a type of program synthesis that uses chain-of-thought (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. This is similar to what open source reasoning models have been exploring in the past few months.
Other scientists such as Nathan Lambert from the Allen Institute for AI suggest that “o1 and o3 can actually be just the forward passes from one language model.” On the day o3 was announced, Nat McAleese, a researcher at OpenAI, posted on X that o1 was “just an LLM trained with RL. o3 is powered by further scaling up RL beyond o1.”