AI companies often keep the sources of their training data secret, but an investigation by Proof News revealed that some of the world’s wealthiest AI companies used thousands of YouTube videos to train their AI systems. Despite violating YouTube’s terms of service, which prohibit unauthorized collection of platform materials, these companies persisted.
The investigation found that Silicon Valley giants, including Anthropic, Nvidia, Apple, and Salesforce, used subtitles from 173,536 YouTube videos from over 48,000 channels.
This dataset, named YouTube Subtitles, includes text records from educational and online learning channels such as Khan Academy, MIT, and Harvard University. Videos from The Wall Street Journal, NPR, and BBC were also used to train AI, along with content from shows like “The Late Show With Stephen Colbert,” “Last Week Tonight With John Oliver,” and “Jimmy Kimmel Live.”
Proof News also discovered data from YouTube superstars, including MrBeast (289 million subscribers, 2 videos used for training), Marques Brownlee (19 million subscribers, 7 videos used), Jacksepticeye (31 million subscribers, 377 videos used), and PewDiePie (111 million subscribers, 337 videos used). Some of the materials used to train AI also promoted conspiracy theories like “Flat Earth.”
Proof News created a tool to search for creators in the YouTube AI training dataset.
“No one asked me if they could use it,” said David Pakman, host of “The David Pakman Show,” a left-leaning political channel with over 2 million subscribers and over 2 billion views. Nearly 160 of his videos were included in the YouTube Subtitles training dataset.
Pakman, who employs four full-time staff, believes he should be compensated if AI companies profit from using his data. He noted that some media companies have already signed agreements to be compensated for their work used in AI training.
“This is my livelihood, and I’ve invested time, resources, money, and staff time into creating this content,” Pakman said.
“This is theft,” said Dave Wiskus, CEO of Nebula, a streaming service partially owned by its creators, some of whose work has been taken from YouTube for AI training.
Wiskus said using creators’ work without consent is “disrespectful,” especially as studios might “use generative AI to replace as many artists as possible.”
“Will this be used to exploit and harm artists? Yes, absolutely,” Wiskus said.
EleutherAI, the dataset creator, did not respond to Proof’s findings, including allegations of unauthorized video use. The organization’s website claims its goal is to lower the barriers to AI development, allowing those outside Big Tech to participate, and has historically provided “access to cutting-edge AI technology through training and releasing models.”
Big Tech refers to the world’s largest technology companies, typically including the five largest U.S. tech companies: Alphabet (Google’s parent company), Amazon, Apple, Meta (Facebook’s parent company), and Microsoft.
YouTube Subtitles do not include video images but consist solely of the text of the video subtitles, often including translations in languages like Japanese, German, and Arabic.
According to a research paper published by EleutherAI, this dataset is part of a compilation called Pile. The developers of Pile included data from not only YouTube but also the European Parliament, English Wikipedia, and a trove of Enron employee emails released as part of federal investigations.
Most of Pile’s datasets are publicly accessible, allowing any internet user with enough storage and computing power to use them. Scholars and other developers outside of Big Tech have used this dataset, but they are not the only ones.
Apple, Nvidia, and Salesforce—companies worth billions and trillions of dollars—described in their research papers and posts how they used Pile to train AI. Documents also showed Apple used Pile to train OpenELM, a high-profile model released in April, weeks before announcing new AI features for iPhone and MacBook. Bloomberg and Databricks also trained models on Pile, according to their publications.
Leading AI manufacturer Anthropic, which received a $4 billion investment from Amazon and promotes its focus on “AI safety,” also used Pile.
“Pile contains a very small subset of YouTube subtitles,” confirmed Jennifer Martinez, a spokesperson for Anthropic, in a statement, acknowledging the use of Pile in Anthropic’s generative AI assistant Claude. “YouTube’s terms cover direct use of its platform, which differs from using the Pile dataset. For questions regarding potential YouTube terms of service violations, we must refer you to the authors of Pile.”
Salesforce also confirmed using Pile to build an AI model for “academic and research purposes.” Caiming Xiong, Salesforce’s vice president of AI research, emphasized in a statement that the dataset is “publicly available.”
Salesforce later publicly released the same AI model in 2022, which has been downloaded at least 86,000 times according to its Hugging Face page. In their research paper, Salesforce developers noted that Pile also contained profanities and biases against gender and certain religious groups, warning that this could lead to “vulnerabilities and security issues.” Proof News found thousands of instances of profanity and examples of racial and gender slurs in YouTube Subtitles. Salesforce representatives did not respond to questions about these security issues.
Nvidia representatives declined to comment. Apple, Databricks, and Bloomberg representatives did not respond to requests for comment.
YouTube Data “Goldmine”
Competition among AI companies partly involves acquiring higher-quality data, said Jai Vipra, an AI policy researcher at the CyberBRICS project at Fundação Getulio Vargas Law School in Rio de Janeiro. This is one reason companies keep data sources secret.
Earlier this year, The New York Times reported that Google, which owns YouTube, used text from the platform’s videos to train its models. In response, a spokesperson told the paper that such use is allowed under agreements with YouTube creators.
The New York Times investigation also found that OpenAI used YouTube videos without authorization. Company representatives neither confirmed nor denied the newspaper’s findings.
OpenAI executives have repeatedly declined to answer questions publicly about whether they used YouTube videos to train their AI product, Sora, which can create videos based on text prompts. Earlier this year, a Wall Street Journal reporter asked OpenAI’s chief technology officer, Mira Murati, this question.
“I’m actually not sure,” Murati replied.
Vipra said YouTube Subtitles and other types of speech-to-text data could be a “goldmine” because they can help train models to replicate how people speak and converse.
“It’s still a matter of principle,” said Dave Farina, host of “Professor Dave Explains,” a channel that features chemistry and other science tutorials with 3 million subscribers, of which 140 videos were used in YouTube Subtitles.
“If you’re leveraging my work to build a product to profit from and that product puts me or people like me out of work, then there needs to be a discussion about compensation or some kind of regulation,” he said.
The YouTube Subtitles dataset, released in 2020, includes subtitles from over 12,000 videos that have since been deleted from YouTube. This means the content of these videos, although no longer available on YouTube, was included in the dataset and possibly used to train AI models. In one case, a creator deleted all traces of themselves online, but their work remained in the AI model, and it is unknown how many AI models have used this data.
Proof News attempted to contact the owners of the channels mentioned in this article. Many did not respond to requests for comment. Among those who did speak with the media, none knew their data had been taken, let alone how it was used.
Those surprised included the producers of Crash Course (nearly 16 million subscribers, 871 videos used) and SciShow (8 million subscribers, 228 videos used), pillars of Hank and John Green’s educational video empire.
“We are dismayed to learn that our carefully crafted educational content has been used without our consent,” said Julie Walsh Smith, CEO of Complexly, the production company behind these shows, in a statement.
YouTube Subtitles is not the first AI training dataset to trouble the creative industry.
Proof News writer Alex Reisner obtained a copy of another Pile dataset, Books3, and reported in The Atlantic last year that over 180,000 books had been taken, including works by Margaret Atwood, Michael Pollan, and Zadie Smith. Subsequently, many authors sued AI companies for unauthorized use of their work and alleged copyright infringement. Similar cases have since snowballed, and the platform hosting Books3 has taken it down.
In response to these lawsuits, defendants like Meta, OpenAI, and Bloomberg argue that their actions constitute fair use. The plaintiffs voluntarily withdrew their lawsuit against EleutherAI, the initial scraper and publisher of these books.
The remaining lawsuits are in their early stages, with questions of licensing and payment yet to be resolved. Pile has since been removed from its official download site but remains available on file-sharing services.
“Tech companies have been brazen,” said Amy Keller, a consumer protection lawyer and partner at DiCello Levitt Law Firm, who has filed lawsuits on behalf of creators whose work was collected without consent by AI companies.
“People are concerned that they have no choice in the matter,” Keller said. “I think that’s the real issue.”
Parrot Mimicry of Mimicry
Many creators feel uncertain about the future.
Full-time YouTubers regularly check for unauthorized use of their work and submit takedown notices, worried that AI might soon generate content similar to theirs, possibly even direct imitations.
David Pakman, creator of “The David Pakman Show,” recently saw AI’s power while scrolling through TikTok. He saw a video labeled as a Tucker Carlson clip but was shocked when he watched it. It sounded like Carlson but was verbatim what Pakman had said on his YouTube show, even matching his intonation. Alarmingly, only one video commenter seemed to recognize it was fake—a cloned Carlson voice reading Pakman’s script.
“This is going to be a problem,” Pakman said in his YouTube video about the fake clip. “You can basically do this with anyone.”
EleutherAI founder Sid Black wrote on GitHub that he created YouTube Subtitles using a script that downloaded subtitles from YouTube’s API in the same way viewers’ browsers download subtitles while watching videos. According to documentation on GitHub, Black used 495 search terms to filter videos, including “funny vlogger,” “Einstein,” “Black Protestant,” “protective social services,” “information warfare,” “quantum chromodynamics,” “Ben Shapiro,” “Uyghur,” “raw foodist,” “cake recipe,” and “flat earth.”
Despite YouTube’s terms prohibiting access to its videos “through automated means,” over 2,000 GitHub users have bookmarked or starred the code.
“If YouTube really wanted to stop this module from working, there are many ways to do it,” wrote Jonas Depoix, a machine learning engineer, in a discussion on GitHub where he posted Black’s code for accessing YouTube subtitles. “So far, that hasn’t happened.”
In an email to Proof News, Depoix said he hadn’t used the code since writing it years ago as a student project and was surprised people found it useful. He declined to answer questions about YouTube rules.
Google spokesperson Jack Malon said in an email response that the company has taken “actions to prevent abuse and unauthorized scraping” for years. He did not respond to questions about other companies using this data for training.
Among the videos used by AI companies, 146 came from “Einstein Parrot,” a channel with nearly 150,000 subscribers. Marcia, the African grey parrot’s caretaker (who declined to use her last name to protect the famous parrot’s safety), initially found it amusing that an AI model absorbed the words of a mimicking parrot.
“Who would want to use a parrot’s voice?” Marcia said. “But then, I realized he speaks well. He talks using my voice. So he’s mimicking me, and then the AI is mimicking the parrot.”
Once data is absorbed by AI, it cannot be “forgotten.” Marcia is troubled by the unknown ways her bird’s words might be used, including creating a digital replica that might make it say inappropriate things.
“We’re venturing into uncharted territory,” Marcia said.
Leave a Reply