Bitget App
Trade smarter
Buy cryptoMarketsTradeFuturesBotsEarnCopy
OpenAI’s new ChatGPT models found to “hallucinate” more often

OpenAI’s new ChatGPT models found to “hallucinate” more often

CryptopolitanCryptopolitan2025/04/19 12:01
By:By Shummas Humayun

Share link:In this post: OpenAI’s new o3 and o4‑mini models hallucinate more than older versions, breaking past improvement trends. Tests show o3 and o4‑mini make up facts up to twice as often, with o4‑mini hallucinating nearly half the time. Real-time search may help reduce errors, but the root cause of rising hallucinations remains unknown.

OpenAI’s newest reasoning models, o3 and o4‑mini, produce made‑up answers more often than the company’s earlier models, as shown by internal and external tests. 

The rise in so‑called hallucinations breaks a long‑running pattern in which each new release tended to make up stuff less than the previous model.

OpenAI ’s own numbers put the problem in stark terms. On PersonQA, a company benchmark that checks how well a model recalls facts about people, o3 invented material in 33 percent of responses, about double the rates logged by o1 and o3‑mini, which scored 16 percent and 14.8 percent. O4‑mini fared even worse, hallucinating 48 percent of the time.

A technical report details the findings. Engineers write that the new models outperform earlier versions in coding and math, yet because they “make more claims overall,” they also make “more accurate claims as well as more inaccurate / hallucinated claims.” The document adds that “more research is needed” to explain the slide in reliability.

OpenAI classifies o‑series systems as reasoning models, a line the firm and much of the industry have embraced over the past year. Traditional, non‑reasoning models such as GPT‑4o with web search beat the latest duo on truthfulness: GPT‑4o with search achieves 90 percent accuracy on SimpleQA, another in‑house benchmark.

See also Huawei takes Nvidia’s spot after Trump blocked H20 AI chip sales to China

OpenAI’s o3 model is making up steps

Transluce, an AI nonprofit lab, reported the o3 model making up steps. In one run, the model said it had executed code on a 2021 MacBook Pro “outside of ChatGPT,” then copied the numbers back. The model simply is not capable of doing that. 

“Our hypothesis is that the kind of reinforcement learning used for o‑series models may amplify issues that are usually mitigated (but not fully erased) by standard post‑training pipelines,” said Neil Chowdhury, a Transluce researcher and former OpenAI employee, in an email.

Transluce co‑founder Sarah Schwettmann said the higher error rate could make o3 less helpful than its raw skills suggest.

Kian Katanforoosh, a Stanford adjunct professor, told TechCrunch his team is already testing o3 for coding tasks and sees it as “a step above the competition.” Yet he reported another flaw: the model often returns web links that do not work when clicked.

Hallucinations can spur creativity, but they make the systems a tough sell for businesses that need accuracy. A law firm drafting contracts, for example, is unlikely to tolerate frequent factual mistakes.

Real-time search could reduce hallucinations in AI models

One possible solution is real‑time search. OpenAI’s GPT‑4o version, which consults the web, already scores better on SimpleQA. The report suggests the same tactic could cut hallucinations in reasoning models, at least when users are willing to send prompts to a third‑party engine.

See also Elon Musk’s xAI adds ChatGPT-like ‘memory’ feature to Grok

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” OpenAI spokesperson Niko Felix said in an email.

Whether real-time search alone will solve the problem remains unclear. The report warns that if scaling up reasoning models keeps worsening hallucinations, the hunt for fixes will grow more urgent. Researchers have long called hallucinations one of the hardest issues in AI, and the latest findings underline how far there is to go.

For OpenAI, credibility is important as ChatGPT is used in workplaces, classrooms, and creative studios. Engineers say they will keep tuning reinforcement learning, data selection, and tool use to bring the numbers down. Until then, users must balance sharper skills against a higher chance of being misled. 

Cryptopolitan Academy: Tired of market swings? Learn how DeFi can help you build steady passive income. Register Now

0

Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.

PoolX: Locked for new tokens.
APR up to 10%. Always on, always get airdrop.
Lock now!

You may also like

Nike Faces $5M Lawsuit Over RTFKT NFT Shutdown

Nike sued for $5M after RTFKT’s shutdown, with NFT buyers citing heavy financial losses.Nike Hit with $5M Lawsuit Following RTFKT ShutdownNFT Buyers Cite Heavy Losses and Broken PromisesBroader Impact on the NFT Market

Coinomedia2025/04/28 00:00
Nike Faces $5M Lawsuit Over RTFKT NFT Shutdown

BlockDAG Resets to $0.0025 & Launches Buyer Battles as Solana Sees ETF Boosts & NEAR Faces Growing Downside Risks

Check out the 2025 price forecast for Near Protocol (NEAR), Solana (SOL), and BlockDAG. See how BlockDAG’s $0.0025 presale price rollback is opening new doors among top crypto coins to watch.NEAR Protocol’s 2025 Forecast: Tough Road AheadSolana’s Institutional Momentum Could Reshape Its 2025 OutlookBlockDAG Resets Presale Price & Launches Buyer Battles to Energize BuyersNew Windows Are Opening, but Timing Is Key

Coinomedia2025/04/28 00:00
BlockDAG Resets to $0.0025 & Launches Buyer Battles as Solana Sees ETF Boosts & NEAR Faces Growing Downside Risks

Whale Buys 30K ETH and 600 BTC via Wintermute OTC

A whale acquires 30K ETH and 600 BTC through Wintermute OTC, transferring over $111M in USDC today.Details of the TransactionWhat It Means for the Market

Coinomedia2025/04/28 00:00
Whale Buys 30K ETH and 600 BTC via Wintermute OTC

Unstaked Enters Stage 2 with a Price Surging to $0.006695 as Cronos Eyes $0.12 and Aptos Aims for $13

Explore Aptos (APT) price prediction of $13, Cronos aiming for $0.12, and why Unstaked’s $0.006695 presale could offer 27x ROI as the best long-term crypto play.Why Unstaked’s Presale Could Be the Benchmark for 2025Aptos Builds Bullish Momentum: Updated Aptos Price PredictionCronos Price Clears Key Resistance, Eyes Bigger GainsWhy Unstaked May Outperform APT and CRO

Coinomedia2025/04/28 00:00
Unstaked Enters Stage 2 with a Price Surging to $0.006695 as Cronos Eyes $0.12 and Aptos Aims for $13