Big tech is having a Vanilla Ice moment—the infectious 1990 song “Ice Ice Baby” was a breakout hit, with one glaring problem. The baseline was stolen from a Queen/David Bowie song without permission.
But unlike Mr. Van Winkle, who quite possibly didn’t know any better, Big Tech does. LLMs and Generative AI models have already lifted almost everything that exists on the internet for training data and inference. OpenAI, Meta, Anthropic, Stability AI, and Microsoft are all involved in copyright lawsuits with teams of lawyers weaponizing a clause in copyright law against creatives and artists.
In the Substack community and with artists everywhere, there is anxiety regarding how Generative AI uses their content for training, then produces output that competes directly against them. This 10-second clip below demonstrates how much Sam Altman doesn’t give a bleep. TED Curator Chris Anderson confronted him while displaying a Sora-created Charlie Brown cartoon. “At first glance this looks like IP theft […] you guys don’t have a deal with the ‘Peanuts’ estate?”
The audience clapped and Altman retorted, “You can clap about that all you want. Enjoy.”
The arrogance of his response brought a hush to the crowd followed by a few nervous laughs. He then launched into his “creative spirit of humanity” spiel and how “we want to build tools that lift up […] so that new people can create better art, better content, write better novels that we all enjoy.”
There is so much to hate about this. Since you are on Substack, I can assume you are a creative and/or appreciate writing and visual art.
I’m not sure anyone can define art, but when I see, read or hear truly inspiring works, I care about the human heart, mind, and complexity of emotion that drives creation and the extraordinary skill that allows an artist to transform inner thought into something tangible. I think about the time, energy, expense and personal sacrifice that went into the creation.
Generative AI is taking a singularly unique and sacred human process and breaking it down into binary code to be produced at scale.
I’ve been writing for about 10 years and have a shot at possibly getting my novel published. Against all odds, I landed an agent, and she is now approaching publishers. Every step in the process is daunting and there are so many hurdles. Thousands and thousands of hours have been spent writing, re-writing and editing. If I don’t get a book deal, I will be faced with the decision to self-publish, which is very costly. Not great timing when Big Tech is weaponizing copyright law against creatives.
OpenAI, Google & Meta believe AI models should be allowed to train on publicly available content and are actively lobbying the US government for broader Fair Use, a legal doctrine that permits limited use of copyrighted material without permission in certain circumstances. They claim Fair Use should be broadened because national security and technological competitiveness are at stake.
It's worth noting that most content that exists on the internet has implicit copyright protection the moment the original work is posted in any form on any platform. The creator does not need to register their work for it to be protected. Published materials like books, movies, and music have explicit copyright documentation.
There are approximately 25 lawsuits currently working their way through the courts. But existing large language and frontier models have already been trained on almost everything available on the internet through web crawlers. From this content, AI learns generalized patterns and relationships, that become deeply ingrained in the model’s structure, shaping how it generates responses. The AI generative models are not retrieving from a database, so data cannot be “removed” in a traditional sense and the output produced by these prompts is not always easily attributable.
Meta blatantly decided against licensing works, fearing licensing agreements would weaken their Fair Use argument, plus they were using LibGen—Library Genesis—a massive database of pirated books and academic papers for training. These astonishing admissions were made public in an ongoing copyright-infringement lawsuit Kadrey vs. Meta, brought against Meta by Richard Kadrey, Sarah Silverman, Ta-Nehisi Coates, and other authors.
This could be a precedent-setting case for Fair Use in AI training. On May 1, U.S. District Judge Vince Chhabria told Meta’s lawyers:
"You are dramatically changing, you might even say obliterating, the market for that person's work, and you're saying that you don't even have to pay a license to that person.”
The statement appears to bode well for the plaintiffs. Let’s look closer at Fair Use, since this is what Meta and big tech are basing their legal arguments upon. According to the U.S. Copyright Office Fair Use Index, “Fair Use is a legal doctrine that promotes freedom of expression by permitting the unlicensed use of copyright-protected works in certain circumstances.”
The four key factors are generally referred to as:
Purpose and character of the use: commercial vs. educational, and is it transformative—the heart of tech’s argument for Fair Use.
Nature of the copyrighted work: fiction/non, unpublished/published
Amount and substantiality of the portion used: In Meta’s case, they used entire pirated books for training from LibGen, but they argue AI models do not store or reproduce exact text. While the judge calls using shadow libraries “kind of messed up,” he is currently not focusing on this factor.
Effect on the market value of the original work—whether the use harms the copyright owner’s ability to profit. The judge believes it does, but proving loss of income is not straightforward.
Big Tech is relying on Transformative Use, claiming it alters the original work in a significant way by adding new expression, meaning, or purpose. The more transformative a work is, the more likely it is to be considered Fair Use. AI models do not memorize or store exact content from training data, and responses are based on statistical relationships and connections between words, concepts, or pixels, so by nature of the technology, it’s transformative.
I am not a legal expert, but it seems Fair Use works both for and against Big Tech— they have a strong argument for transformative use while simultaneously damaging the market value of the original work.
Ed Newton-Rex, very publicly resigned from his role leading the Audio team at Stability AI because he didn’t “agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use.’” He points out that “AI companies spend vast sums on people and compute but expect to get training data for free.”
As an industry insider for many years, Newton-Rex debunks many of the AI industry’s go-to arguments. Altman likens AI training on the existing corpus of copyrighted art, music and writing to artists that for centuries have studied the work of other artists. But Newton-Rex points out that generative AI can mass produce works from its training and will ultimately compete against the original copyright owner and almost certainly will cause economic harm.
An obvious solution is licensing, despite Tech’s claim that it’s impractical, hard to track, and ironically stifles (their) creativity. Several of the tech giants have licensing agreements already in place with global publishers, and yes, it is a long and arduous process, but obviously doable. And hey, why don’t they use AI’s capabilities to figure out how to track training and inference in a way to compensate individual copyright owners?
AI Tech’s modus operandi has always been launch first, figure it out later, but I can’t help feeling the creative human spirit is at risk. Truly talented artists regardless of genre typically don’t do their art to get rich. Tremendous bravery is required to pursue a passion in arts and there have always been obstacles standing in the way, financial chief among them. But never has there been such a menacing threat against artists and their creative output by robbing them of recognition or recompense for their original works on a massive scale.
Once again, we are relying on the tech bros engaged in the AI arms race to do the right thing. I’m not feeling particularly confident considering their track record. I am not opposed to AI or technolgy. There is so much good that can come, but there will be pain, and I don’t like raising alarm bells without offering solutions. I have a few suggestions regarding “Tech’s Vanilla Ice Problem,” and by the way he did pay the copyright owners their due. See below for instructions to opt out your Substack for AI training and you can review Newton-Rex’s The Statement on AI Training and sign.
Much will play out in the courts. AI is inevitable, but as a community, we can continue to shine a light on the miracle of human-produced works, support and show up when and where we can for artists.
Substack Content creators can opt out for AI training. Here’s how:
BIBLIOGRAPHY:
How AI Models Steal Creative Work — and What to Do About It | Ed Newton-Rex | TED
The Statement on AI Training, sign the open letter launched on October 22, 2024
gov.uscourts.cand.415175.417.6.pdf
Copy + Paste + Steal: Artists Battle For Copyright vs Generative AI | Undercover Asia | Full Episode
Judge in Meta case warns AI could 'obliterate' market for original works
OpenAI’s Sam Altman Talks ChatGPT, AI Agents and Superintelligence — Live at TED2025
The AI Copyright Battle: Why OpenAI And Google Are Pushing For Fair Use
Research Assistants: Microsoft Co-pilot, MidJourney, ChatGPT
Read this first time around, thanks.
L/C