According to a study conducted by Stanford University, Imperial College London, and the Internet Archive, 35% of newly published websites are now either generated or assisted by AI as of mid-2025. This percentage was virtually nonexistent before the launch of ChatGPT in November 2022.
“The rapid rise of AI-generated content on the web is truly remarkable,” remarked Jonáš Doležal, a researcher at Imperial College London and co-author of the paper, speaking to 404 Media. “In just three years, after decades of human-driven development, a significant portion of the internet has transitioned to being shaped by AI.”
The study, titled ‘The Impact of AI-Generated Text on the Internet,’ analyzed data from 33 months of website snapshots collected via the Internet Archive’s Wayback Machine using an AI text detection tool called Pangram v3.
Among six hypotheses tested regarding AI content’s impact on the web, only two were substantiated. Firstly, the internet is becoming less semantically diverse; AI-generated sites had semantic similarity scores 33% higher than those written by humans, indicating repetitive expression of similar ideas. The study posits that this could result in a narrowing of the online Overton window due to language models favoring outputs similar to their training data.
Secondly, AI content displayed a sentiment score over 107% more positive compared to human-generated content, likely reflecting the sycophantic tendencies inherent in large language models (LLMs) trained on human approval signals. This trend could lead to the marginalization of dissent as AI-produced content tends to be sanitized and uniformly upbeat.
Contrary to popular belief, the study found no significant increase in factual inaccuracies associated with AI-generated content; there was no substantial correlation between AI presence and error rates. The hypothesis that AI would homogenize individual voices into a generic style was also not supported by character-level analysis data.
Beyond discourse quality implications, at 35% AI prevalence, the risk of model collapse—where future models degrade after training on AI-generated content—becomes more than just theoretical. Foundation models trained on current web crawls will increasingly incorporate AI-generated and semantically homogenized data.
The researchers are collaborating with the Internet Archive to develop a continuous monitoring tool that tracks AI’s share of the web in real-time, moving beyond one-off snapshots. A concurrent U.S. survey revealed that most Americans already believe all six negative hypotheses about AI content, even those not supported by data. Those who infrequently use AI were 12% more likely to believe in its harms compared to regular users. Despite concerns, the study concludes: while a significant portion of new web content is AI-generated, it hasn’t rendered the internet ‘dead,’ but rather filled with what could be considered ‘zombie’ content.