Accordingly, even the best performing artificial intelligence model configuration they tested, OpenAI's GPT-4-Turbo, still only achieved a 79% correct answer rate despite reading the entire profile and often experienced "hallucinations" of unreal figures or events.
“That kind of performance rate is completely unacceptable,” said Anand Kannappan, co-founder of Patronus AI. “The correct answer rate needs to be much higher to be automated and production-ready.”
The findings highlight some of the challenges facing AI models as large companies, especially in highly regulated industries like finance, look to incorporate advanced technology into their operations, whether it's customer service or research.
Financial data "illusion"
The ability to quickly extract key numbers and perform financial statement analysis has been seen as one of the most promising applications for chatbots since ChatGPT was released late last year.
SEC filings contain important data, and if a bot can accurately summarize or quickly answer questions about their contents, it could give users an edge in the competitive financial industry.
Over the past year, Bloomberg LP has developed its own AI model for financial data, and business school professors have been studying whether ChatGPT can analyze financial headlines.
Meanwhile, JPMorgan is also developing an AI-powered automated investment tool. A recent McKinsey forecast said generative AI could boost the banking industry by trillions of dollars a year.
But there’s still a long way to go. When Microsoft first launched Bing Chat with OpenAI’s GPT, it used the chatbot to quickly summarize earnings press releases. Observers quickly noticed that the numbers the AI spit out were skewed or even fabricated.
Same data, different answers
Part of the challenge of incorporating LLM into real-world products is that algorithms are not deterministic, meaning they are not guaranteed to produce the same results given the same inputs. This means companies need to conduct more rigorous testing to ensure that AI is working correctly, not going off-topic, and delivering reliable results.
Patronus AI built a set of more than 10,000 questions and answers drawn from SEC filings from large publicly traded companies, called FinanceBench. The dataset includes the correct answers as well as the exact location in any given file to find them.
Not all answers can be taken directly from the text and some questions require calculation or light reasoning.
The 150-question subset test involved four LLM models: OpenAI's GPT-4 and GPT-4-Turbo, Anthropic's Claude 2, and Meta's Llama 2.
As a result, GPT-4-Turbo, when granted access to the SEC's underlying filings, only achieved an accuracy rate of 85% (compared to 88% when it did not have access to the data), even though a human pointed the mouse to the exact text for the AI to find the answer.
Llama 2, an open-source AI model developed by Meta, had the highest number of “hallucinations,” getting 70% of the answers wrong and only 19% correct when given access to a portion of the underlying documents.
Anthropic's Claude 2 performed well when given a "long context," in which nearly the entire relevant SEC filing is included along with the question. It was able to answer 75% of the questions posed, incorrectly answering 21% and refusing to answer 3%. GPT-4-Turbo also performed well with a long context, correctly answering 79% of the questions and incorrectly answering 17% of them.
(According to CNBC)
Big Tech's Race to Invest in AI Startups
AI Technology Revolutionizes E-Commerce Startups
AI successfully turns human thoughts into realistic images for the first time
Source
Comment (0)