A recent study from Stanford University and UC Berkeley has found that the behavior of large language models (LLMs) like ChatGPT has “drifted substantially” over time, but this does not necessarily indicate a degradation of capabilities. The researchers tested two versions of GPT-3.5 and GPT-4 on tasks such as math problems, answering sensitive questions, code generation, and visual reasoning. They found significant changes in performance between the March and June 2023 versions of these models. For instance, GPT-4’s accuracy in solving math problems dropped from 97.6% to 2.4%, while GPT-3.5’s accuracy increased from 7.4% to 86.8%.
The study’s findings highlight the risks of building applications on top of black-box AI systems like ChatGPT, which could produce inconsistent or unpredictable results over time. The researchers recommend continuous evaluation and assessment of LLMs in production applications and call for more transparency in the data and methods used to train and fine-tune these models. However, some experts argue that the media has misinterpreted the paper’s results as confirmation that GPT-4 has gotten worse, stating that the changes in behavior do not necessarily indicate a degradation in capability.
The Eclectic Educator is a free resource for everyone passionate about education and creativity. If you enjoy the content and want to support the newsletter, consider becoming a paid subscriber. Your support helps keep the insights and inspiration coming!