News/Media Alliance Study Finds Pervasive Unauthorized Use of Publisher Content to Power Generative AI Technologies


While the Copyright Office submission and White Paper discuss the wider publisher landscape in the face of the GAI revolution, including relevant principles of copyright law, the accompanying technical analysis documents the extent to which GAI developers rely on high-quality journalistic content to power their models. In particular, the results show:

  • GAI developers have copied and used news, magazine and digital media content to train large language models (LLMs).
  • Popular curated datasets underlying LLMs significantly overweight publisher content by a factor ranging from over 5 to almost 100 as compared to the generic collection of content that the well-known entity Common Crawl has scraped from the web.
  • Other studies show that news and digital media ranks third among all categories of sources in Google’s C4 training set, which was used to develop Google’s GAI-powered products like Bard. Half of the top ten sites represented in the data set are news outlets.
  • The LLMs also copy and use publisher content in their outputs. The LLMs can reproduce the content on which they were trained, demonstrating that the models retain and can memorize the expressive content of the training works.

Alliance President & CEO Danielle Coffey stated, “The research and analysis we’ve conducted shows that AI companies and developers are not only engaging in unauthorized copying of our members’ content to train their products, but they are using it pervasively and to a greater extent than other sources. This shows they recognize our unique value, and yet most of these developers are not obtaining proper permissions through licensing agreements or compensating publishers for the use of this content. This diminishment of high-quality, human created content harms not only publishers but the sustainability of AI models themselves and the availability of reliable, trustworthy information.”

The Copyright Office comments and the White Paper offer multiple recommendations to policymakers, including recognizing that unauthorized use of publishers’ expressive content for commercial GAI training and development is likely to compete with and harm publisher businesses in a manner that infringes copyright; creating transparency requirements to require disclosure of the use of copyright protected content in training; encouraging and facilitating effective licensing solutions; supporting international cooperation and harmonization on GAI regulations; and adopting legislation to remedy existing market imbalances that prevent publishers from engaging in fair negotiations for the use of their content against dominant platforms.

Coffey continued, “Generative AI systems should be held responsible and accountable, just like any other business. This White Paper demonstrates that these systems rely on journalistic and creative content, which have the benefit of investment in quality on the front end, as well as publishers who are required by law to take responsibility for the content they share with the public. Continued unauthorized use will harm existing markets that acknowledge the value of archived and real-time quality content, and over time the GAI models themselves will deteriorate. You get out what you put in. It is critical that our copyright protections are properly enforced and that high standards of quality and accountability are the foundation of these and other new technologies.”

The News/Media Alliance is a nonprofit organization representing more than 2,200 news and magazine media organizations and their multiplatform businesses in the United States and globally. Alliance members include print and digital publishers of original journalism. Headquartered just outside Washington, D.C., the association focuses on ensuring the future of journalism through communication, research, advocacy, and innovation. Information about the News/Media Alliance can be found at http://www.newsmediaalliance.org.

Media Contact
Lindsey Loving, News/Media Alliance, 5713661000, [email protected], www.newsmediaalliance.org

SOURCE News/Media Alliance

Leave a Reply