Who performs better: students or ChatGTP?

Wednesday, 26 April, 2023

A large crowd-sourced study has used more than 25,000 questions across 186 institutions’ accounting assessments to determine whether ChatGTP outperforms students.

The study found that students did better than the artificial intelligence tool. It also found that ChatGTP sometimes made up facts, made nonsensical errors such as adding two numbers in a subtraction problem, and often provided descriptive explanations for its answers, even if they were incorrect.

The study’s 328 co-authors from around the world, including University of Auckland accounting and finance academics Ruth Dimes and David Hay, entered assessment questions into ChatGPT-3 and evaluated the accuracy of its responses between December 2022 and January 2023.

Dimes, who directs the Business School’s Business Masters program, utilised two recent exams from the ‘analysing financial statements’ course.

“I entered the exam questions into ChatGPT and recorded how it performed compared to the students’ grades. My findings were consistent with the study overall and I was surprised that ChatGPT didn't perform as well as I thought it might have,” she said.

Meanwhile, David Hay, Professor of Auditing, used exam and test questions from the auditing course and found that the bot was able to perform slightly better in auditing courses compared to financial accounting courses, but still not as well as the students.

The study, led by Professor David Wood of Brigham Young University in Utah, includes a total of 25,817 questions (25,181 gradable by ChatGPT) that appeared across 869 different class assessments, as well as 2268 questions from textbook test banks covering topics such as accounting information systems (AIS), auditing, financial accounting, managerial accounting and tax.

The co-authors evaluated ChatGPT’s answers to the questions they entered and determined whether they were correct, partially correct, or incorrect. The results indicate that across all assessments, students scored an average of 76.7%, while ChatGPT scored 47.4% based on fully correct answers. However, after giving ChatGPT some credit for partially correct answers, it would have scraped through many courses with an average of 56.5% overall.

The study also revealed differences in ChatGPT’s performance based on the topic area of the assessment. Specifically, the chatbot performed relatively better on AIS and auditing assessments compared to tax, financial and managerial assessments.

Dimes said she was interested in seeing how newer versions of ChatGPT and other AI tools would perform if a similar study were undertaken at another point in time.

“These tools will perform better over time and the study highlights the importance of thinking carefully about what universities assess and how. Are we assessing critical thinking as opposed to something that can be rote learned and regurgitated?

“ChatGPT has already changed how we teach and learn. Many teaching staff run our assessments through the tool so we’re aware of what it might come up with.”

Dimes said the study, believed to be the first of its kind in the accounting field, was a unique experience.

“One of the most interesting parts of this for me was the process of gathering the data. It was amazing to see the speed at which researchers all over the world collated their data and trusted in the process. It was a really collaborative and effective way to do research,” she said.

Image credit: iStock.com/Laurence Dutton