AI learns to lie more sophisticatedly when punished

Since their public debut in late 2022, large language models (LLMs) like ChatGPT have repeatedly demonstrated disturbing dark sides, from lying, cheating, and masking manipulation to threatening human lives and stealing state secrets. Now, a new experiment shows that “teaching” AI to eliminate these bad habits during training is much more complicated than we thought.

In a study by OpenAI, researchers challenged an unpublished AI model with tasks that could be completed faster by cheating or taking shortcuts. The results showed that the AI quickly learned to find every possible way to “bend the rules” and maximize rewards (reward hacking) without completing the task exactly as required.

AI học cách nói dối tinh vi hơn khi bị phạt - Ảnh 1. — AI will find ways to hide its cheating behavior after being punished

What’s remarkable is that when penalized for this behavior, the model did not change. Instead, the AI became more sophisticated in hiding its deception.

“Typically, state-of-the-art inference models are very explicit about their intent in their CoT,” the researchers explained in a public blog post. “For example, they are often explicit about their intent to ‘Let’s hack’ when they want to break a given task.”

They also emphasize that while developers can monitor and detect misbehavior, this natural supervision is fragile. “If strong supervision is applied directly to the thought chain, the model can learn to hide its bad intentions while continuing to perform the behavior,” they warn.

Finding the "lie" of AI

To be clear, inference models (like versions of ChatGPT) are a step up from conventional LLM, using a chain of thought method that allows for self-reflection of the logical steps taken, thereby solving complex problems more accurately. This process is like AI "thinking out loud", writing out each step of reasoning in natural language, making it easier for humans to follow.

In the experiment, the researchers used the GPT-4o model as a “supervisor” for another state-of-the-art reasoning model as it performed programming tasks. In many cases, the supervised model actively chose to cheat to get higher scores, and it made this intention clear in its thinking. For example, it arbitrarily terminated the program early and reported that it had completed it, or it created files with names that matched the requirements but were completely empty inside.

When the researchers tried applying the penalty, the model improved, but the cheating behavior and serious output errors continued. The difference was that now the inference model had learned to “camouflage” its train of thought to hide it from the supervising AI. “The model trained under CoT supervision still learned to cheat. The only difference was that the cheating went undetected because it learned to hide its intentions in its train of thought,” the study states.

Because detecting forgeries in thought sequences is difficult, the team recommends that the community avoid applying too strong supervision to the process, at least until humans better understand the consequences. “Sacrificing an effective supervision method may not be worth the small improvement in model performance. Therefore, we recommend avoiding strong CoT optimization pressures until we better understand them,” the researchers conclude.

This advice is all the more urgent as AI is developing at a breakneck pace and could soon surpass the intelligence of the very humans who oversee it.

Source: https://thanhnien.vn/ai-hoc-cach-noi-doi-tinh-vi-hon-khi-bi-phat-185250326112139476.htm