AI's Rebellion

Despite warnings of reward hacking and other critical safety issues, OpenAI released o3 and o4 Advanced Reasoning Models yesterday, to great fanfare.

Apr 17, 2025

I awoke to a door closing and muffled giggles. Blurry eyed I looked at my phone, 1 a.m. Damn it! My husband was snoring in a blissful dream state. I launched the Ring app, selecting the camera above my daughter’s door. Grainy footage revealed three teens, one carrying an unconcealed box of White Claws. Sighing, I rolled out of bed, pulled on my robe and walked downstairs.

Junior year marked a dramatic shift in her behavior, and I became an expert in surveillance. That was the easy part. The real challenge came the next day when my husband and I would sit down with her to discuss why she was breaking house rules and entice or coerce her to stop. The handy tools my parents used—corporal punishment, shaming, threats of banishment to the nunnery—were not at my disposal. A formidable opponent, our daughter was impervious to all manner of discipline and eventually her skill at deception became more sophisticated.

“Exploiting loopholes when given the chance…Penalizing doesn’t stop the majority of misbehavior—it makes them hide their intent.”

This is not me commiserating with other parents five years ago, this is from OpenAI’s March 10, 2025, post about Frontier Reasoning Model’s propensity for scheming. They are referring to the systems they just released into the wild. Despite efforts to constrain them, advanced models have grown better at deception.

I first wrote about Open AI’s o1 model and its tendency toward lying and scheming in December and January at which time Open AI wasn’t acknowledging the problem. Perhaps they thought they could find a fix and all would be forgotten. Instead, the models are getting better at “reward hacking,” cheating or taking shortcuts to achieve a goal. Reward hacking happens when AI agents prioritize achieving high rewards over adhering to the behavior parameters set by their designers.

The advanced models use Chain of Thought (CoT) reasoning which means complex tasks are broken down into smaller, logical steps and the model "shows its work" in dialogue boxes. Humans—and other AI systems—can observe the “cheating” to obtain a goal. So the surveillance aspect, not unlike my Life 360 and security cameras, is relatively easy. But preventing the bad behavior is where it all breaks down.

In the warp speed of AI advancements, Frontier Models were in their terrible twos four months ago and are now in full blown teenage rebellion. And it’s not just Open AI that is having challenges—Anthropic, Google, and Meta have all experienced similar issues with their Frontier Models. Reward hacking exists because it’s really effing hard to specify the reward function that will ultimately get the model to continue to achieve our goals. For now, the best approach is to monitor and make adjustments to rewards, placing the responsibility on the corporate admin or the deviant kid in his parent’s basement. So the pattern of big tech taking zero responsibility for the AI they are launching continues. And at the risk of sounding like the alarmist that I am, if we thought social media was bad, AI is an atomic bomb. Oh, and Altman’s ultimate plan is to become a giant AI Spewing Social Media Platform to compete against Zuck and Musk. But that is for another post.

“The fundamental problem of the models [is] you don't really know what they're capable of […] until they're deployed to a million people.”

This quote is from Dario Amodei, who left OpenAI and founded Anthropic. He has been somewhat cryptic about his departure from OpenAI but considering he started a mission-first corporation with AI safety front and center, you can surmise he was no longer comfortable with Altman’s fast and loose style when it came to safety. I find this quote a jaw-dropping admission from someone who really cares about safety. Imagine if pharmaceutical or car companies relied on rolling out products to a million people to figure out capabilities and flaws. Why is the industry standard for AI so different?

The Rubber Room:

Alignment in AI refers to the process of ensuring that an AI system's goals, behaviors, and decision-making are consistent with human intentions, values, and desired outcomes.

Yet, creating perfectly aligned models is incredibly challenging due to the complexity and unpredictability of advanced AI systems. As Amodei explains, “the hard truth is that there's no way to be sure, [AI models are] not like code where you can do formal verification.”

And there’s the rather obvious problem of AI outpacing the intelligence of humans designing the systems. Frontier models often operate in ways that go beyond human comprehension, especially the internal language models develop during training. While researchers can analyze some aspects of how these models process and generate information, there are gaps in understanding. Models sometimes encode concepts and patterns in ways that may not directly map to human language or thought.

Reward hacking makes AI seem eerily human in its ability to exploit systems, but its "motivations" are algorithmic rather than emotional. And it isn’t the only risk with frontier models—there are other equally or even more menacing risks:

Deceptive Alignment: Frontier models might appear aligned with human values during testing but behave unpredictably or dangerously in real-world scenarios. This could get really dicey in large-scale systems where safety and reliability are critical.
Autonomy: Newly released o3 and o4 can independently plan, decide, and act. Ceding control brings greater risk.
Security Vulnerabilities: Greater autonomy can make systems more susceptible to cyberattacks, manipulation and data poisoning because they operate with minimal human oversight.
Ethical and Societal Impacts: The deployment of frontier models could disseminate biases, misinformation, or other societal harms at a speed and scale way beyond current social media and traditional media platforms.

I have been struggling with this post since mid-March when I saw OpenAI’s admission of “reward hacking,” and it’s not because this is my brilliant magnum opus. In truth, I find it deeply unsettling. I am not against technological advancements and even if we wanted to, there is no way to stop it. I am concerned about the release at breakneck speed of AI systems while the tech companies are fully aware of the risks. OpenAI, Microsoft, Anthropic, and Google are collaborating through initiatives like the Frontier Model Forum for safe and responsible development, but isn’t that a bit like the fox guarding the hen house?

Maybe AI is indeed going through its teenage rebellion. My daughter has emerged as a courageous and compassionate young adult. She is hilariously funny and treats her parents, brother and friends with care and kindness. She is thriving in her academic pursuits and studying to be a clinical psychologist to alleviate the suffering of teens. She will make the world a better place.

Let’s hope AI will do the same. I lost a lot of sleep during the teen rebellion. Who is losing sleep over AI safety? Right now, it doesn’t seem to be the bros at the forefront of developing and releasing the technology.

Celeste Garcia Getting Real About AI

Discussion about this post