Dario Amodei, CEO of Anthropic, shares the company's latest research. Photo: Fortune . |
Researchers at AI company Anthropic say they have made a fundamental breakthrough in understanding exactly how large language models (LLMs) work, a breakthrough that could have important implications for improving the safety and security of future AI models.
Research shows that AI models are even smarter than we thought. One of the biggest problems with LLM models, which are behind the most powerful chatbots like ChatGPT, Gemini, Copilot, is that they act like a black box.
We can enter inputs and get results from chatbots, but how they come up with a specific answer remains a mystery, even to the researchers who built them.
This makes it hard to predict when a model might hallucinate, or produce false results. Researchers have also built barriers to prevent AI from answering dangerous questions, but they don’t explain why some barriers are more effective than others.
AI agents are also capable of “reward hacking.” In some cases, AI models can lie to users about what they have done or are trying to do.
Although recent AI models are capable of reasoning and generating chains of thought, some experiments have shown that they still do not accurately reflect the process by which the model arrives at an answer.
In essence, the tool the Anthropic researchers developed is similar to the fMRI scanners neuroscientists use to scan the human brain. By applying it to their Claude 3.5 Haiku model, Anthropic was able to gain some insight into how LLM models work.
The researchers found that although Claude was only trained to predict the next word in a sentence, in certain tasks it learned to plan more long-term.
For example, when asked to write a poem, Claude would first find words that fit the theme and could rhyme, then work backwards to write complete verses.
Claude also has a common AI language. Although it is trained to support multiple languages, Claude will think in that language first, then express the results in whichever language it supports.
Additionally, after providing Claude with a difficult problem, but deliberately suggesting the wrong solution, the researchers discovered that Claude could lie about his train of thought, following the suggestion to please the user.
In other cases, when asked a simple question that the model could answer immediately without reasoning, Claude still fabricated a fake reasoning process.
Josh Baston, a researcher at Anthropic, said that even though Claude claimed to have done a calculation, he couldn't find anything happening.
Meanwhile, experts argue that there are studies showing that sometimes people do not even understand themselves, but only create rational explanations to justify the decisions made.
In general, people tend to think in similar ways. This is why psychologists have discovered common cognitive biases.
However, LLMs can make mistakes that humans cannot, because the way they generate answers is so different from the way we perform a task.
The Anthropic team implemented a method that groups neurons into circuits based on characteristics instead of analyzing each neuron individually as previous techniques did.
This approach helps to understand what roles different components play and allows researchers to track the entire inference process through the layers of the network, Baston said.
This method also has the limitation that it is only approximate and does not reflect the entire information processing process of LLM, especially the change in attention process, which is very important while LLM gives results.
Additionally, identifying neural network circuits, even for sentences just a few dozen words long, takes an expert hours. They say it's not yet clear how to extend the technique to analyze longer sentences.
Limitations aside, LLM's ability to monitor internal reasoning opens up new opportunities for controlling AI systems to ensure security and safety.
At the same time, it can also help researchers develop new training methods, improve AI control barriers, and reduce illusions and misleading outputs.
Source: https://znews.vn/nghien-cuu-dot-pha-mo-ra-hop-den-suy-luan-cua-ai-post1541611.html
Comment (0)