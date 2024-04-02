“GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.” So reads a paper released by OpenAI last year through the open-access repository ArXiv, about the company’s latest GPT-4 large language model.

OpenAI’s claim was based on a 2023 study in which a group of researchers led by Illinois Tech’s Chicago-Kent College of Law professor Daniel Martin Katz administered the Uniform Bar Exam to GPT-4. In their tests, the results of which were published in another repository called Social Science Research Network (SSRN), GPT-4 scored an impressive 297 out of 400 on the bar exam. (Minimum passing scores vary by state and range from 260 to 272.)

But new research by Massachusetts Institute of Technology PhD candidate Eric Martínez suggests both the characterization, and the model’s actual score, may have been misleading.

Martínez, who earned his law degree at Harvard, says that the 90th-percentile figure was relative to a pool that is “heavily skewed” toward repeat test-takers who had already failed the exam one or more times in the past. That’s a significantly lower-performing group than the general population of people who took the exam, writes Martínez in a paper of his own, published over the weekend in the journal Artificial Intelligence and Law.