Which AI chatbot is the best at simple math? Gemini, ChatGPT, Grok put to the test

Artificial Intelligence (AI) is becoming an integral part of daily life, including everyday calculations. But how well do these systems actually handle basic math? And how much should users trust them?

A recent study advises caution. The Omni Research on Calculation in AI (ORCA) shows that when you ask an AI chatbot to perform everyday math, there is roughly a 40 per cent chance it will get the answer wrong. Accuracy varies significantly across AI companies and across different types of mathematical tasks.

So which AI tools are more accurate, and how do they perform across different types of calculations, such as statistics, finance, or physics?

The results are based on performance across 500 prompts drawn from real-world, calculable problems. Each AI model was tested using the same set of 500 questions. The five AI models were tested in October 2025.

The chosen models are:

ChatGPT-5 (OpenAI)
Gemini 2.5 Flash (Google)
Claude 4.5 Sonnet (Anthropic)
DeepSeek V3.2 (DeepSeek AI)
Grok-4 (xAI).

The ORCA Benchmark found that no AI model scored above 63 per cent in everyday maths. The leader, Gemini (63 per cent), still gets nearly 4 out of 10 problems wrong. Grok has almost the same score at 62.8 per cent. DeepSeek ranks third at 52 per cent. ChatGPT follows with 49.4 per cent, and Claude comes last at 45.2 per cent.

The simple average of five models is 54.5 per cent. These scores reflect the models’ overall performance across all 500 prompts.

“Although the exact rankings might shift if we repeated the benchmark today, the broader conclusion would likely remain the same: numerical reliability remains a weak spot across current AI models,” Dawid Siuda, co-author of the ORCA Benchmark, told Euronews Next.

Highest accuracy at math & conversions, lowest at physics

Their performance varies in different categories. In math and conversions (147 of the 500 prompts), Gemini leads with 83 per cent, followed by Grok at 76.9 percent and DeepSeek at 74.1 percent. ChatGPT scores 66.7 percent in this category.

The simple average accuracy across all five models is 72.1 percent, the highest among the seven categories.

By contrast, physics (128 prompts) is the weakest category, with an average accuracy of just 35.8 per cent. Grok performs best at 43.8 per cent, slightly ahead of Gemini at 43 per cent, while Claude drops to 26.6 per cent.

Across the seven categories, Gemini and Grok each rank first in three, and they share the top spot in one.

DeepSeek’s accuracy is just 11 per cent in biology and chemistry

DeepSeek recorded the lowest accuracy across all categories in biology and chemistry at 10.6 per cent. This means the model failed to provide a correct answer in roughly nine out of ten questions.

The largest performance gaps appear in finance and economics. Grok and Gemini both reach accuracy levels of 76.7 per cent, while the other three models, which are ChatGPT, Claude, and DeepSeek, fall below 50 per cent.

Warning to users: Always double-check with a calculator

“If the task is critical, use calculators or proven sources, or at least double-check with another AI,” double check with a calculator Siuda said.

Four mistakes that AI models make

The experts grouped the mistakes into four categories. The challenge lies in ‘translating’ a real-world situation into the right formula, according to the report.

“Sloppy math” errors (68 percent of all mistakes). In these cases, the AI understands the question and the formula but fails in the actual computation. This category includes ‘precision and rounding issues’ (35 percent) and ‘calculation errors’ (33 percent).

For example, the prompt asked: “For a lottery where 6 balls are drawn from a pool of 76, what are my chances of matching 5 of them?” The result should be ‘1 in 520521’. ChatGPT-5 found it was ‘1 in 401397’

2. “Faulty logic” errors (26 percent of all mistakes). These are more serious because they show the AI is struggling to understand the underlying logic of the problem. They include ‘method or formula errors (14 percent), such as using a completely incorrect mathematical approach, and ‘wrong assumptions (12 percent).

3. “Misreading the instructions” errors (5 per cent of all mistakes). These occur when the AI fails to correctly interpret what the question is asking. Examples include ‘wrong parameter errors’ and ‘incomplete answers’.

4. “Giving up” errors. In some cases, the AI simply refuses or deflects the question rather than attempting an answer.

“Their weak spot is rounding – if the calculation is multi-step and requires rounding at some point, the end result is usually far off,” Siuda said.

The research used the most advanced models available to the general public for free. Every single question prompt had one, and only one, correct answer.

What's On

Italy’s Meloni comes out fighting as she faces potential referendum loss – POLITICO

Le Livret A va financer les futurs réacteurs nucléaires d’EDF – POLITICO

Man shot dead by security after ramming car into Michigan synagogue

L’Europe s’apprête à interdire les IA qui dénudent, à la suite du scandale Grok

Are Iran’s claims that US-Israeli strikes have had ‘no impact’ on its military capabilities true?

Which AI chatbot is the best at simple math? Gemini, ChatGPT, Grok put to the test

‘Enemy technology infrastructure’: Iran threatens Amazon, Google and Microsoft assets in Middle East

EU Parliament urges new rules to protect creative works from AI training

Supercomputers and sustainability: Taiwanese company Gigabyte shares vision for democratising AI

How Dassault Systèmes AI companions redefine industrial design and manufacturing

Would a taxpayer-funded European social media platform work?

Meta faces privacy lawsuit over AI smart glasses

NASA honours astronomers who helped confirm humanity’s first asteroid deflection

Iran’s state media ramps up disinformation campaign as the US-Iran conflict wages

Honor’s new ‘robot phone’ wants to be your best AI friend and dance with you

Le Livret A va financer les futurs réacteurs nucléaires d’EDF – POLITICO

Man shot dead by security after ramming car into Michigan synagogue

L’Europe s’apprête à interdire les IA qui dénudent, à la suite du scandale Grok

Are Iran’s claims that US-Israeli strikes have had ‘no impact’ on its military capabilities true?

Time runs out to avert new trade war as US patience with EU wears thin – POLITICO

Chilean Smiljan Radić Clarke wins 2026 Pritzker Architecture Prize

A Perpignan, battre Louis Aliot semble mission impossible – POLITICO

What's On

Which AI chatbot is the best at simple math? Gemini, ChatGPT, Grok put to the test

Highest accuracy at math & conversions, lowest at physics

DeepSeek’s accuracy is just 11 per cent in biology and chemistry

Warning to users: Always double-check with a calculator

Four mistakes that AI models make

Keep Reading