Is ChatGPT’s Output Degrading?

photo of a cracked surface
Photo by Tóth Viktor on Pexels.com

A recent study from Stanford University and UC Berkeley has found that the behavior of large language models (LLMs) like ChatGPT has “drifted substantially” over time, but this does not necessarily indicate a degradation of capabilities. The researchers tested two versions of GPT-3.5 and GPT-4 on tasks such as math problems, answering sensitive questions, code generation, and visual reasoning. They found significant changes in performance between the March and June 2023 versions of these models. For instance, GPT-4’s accuracy in solving math problems dropped from 97.6% to 2.4%, while GPT-3.5’s accuracy increased from 7.4% to 86.8%.

The study’s findings highlight the risks of building applications on top of black-box AI systems like ChatGPT, which could produce inconsistent or unpredictable results over time. The researchers recommend continuous evaluation and assessment of LLMs in production applications and call for more transparency in the data and methods used to train and fine-tune these models. However, some experts argue that the media has misinterpreted the paper’s results as confirmation that GPT-4 has gotten worse, stating that the changes in behavior do not necessarily indicate a degradation in capability.


Thanks for taking the time to read this post. If you’ve enjoyed the insights and stories, consider showing your support by subscribing to my weekly newsletter. It’s a great way to stay updated and dive deeper into my content. Alternatively, if you love audiobooks or want to try them, click here to start your free trial with Audible. Your support in any form means the world to me and helps keep this blog thriving. Looking forward to connecting with you more!