Time:2025-08-15
Publication Date:2025-08-15
The past few years have witnessed an explosion in generative artificial intelligence (AI),1 as these systems rapidly advance in their ability to produce human-like expression in text, images and audio.2 Flagship models like OpenAI’s GPT-4 (a large language model), Stability AI’s Stable Diffusion (a text-to-image generator) and Meta’s MusicGen (a controllable music-generation transformer) illustrate the technology’s dramatic leap forward.3 These systems have swiftly attracted enormous user bases.4 Yet, they also raise pressing questions about the risk of infringing copyrighted works across jurisdictions.
A primary concern revolves around the training phase, where most generative AI systems ingest vast datasets that often contain large amounts of copyrighted works, frequently without explicit licence from rights holders.5 Investigations into the Books3 dataset, for instance, revealed 191 000 pirated books—many still under copyright—used to develop models like Meta’s LLaMA and BloombergGPT.6 Similar controversies have emerged around datasets such as The Pile, a corpus containing fiction and nonfiction scraped from numerous online repositories.7 This practice of training on copyrighted works without authorization raises significant legal questions regarding direct infringement, currently being debated intensely and explored through different lenses internationally.8
While the legality of the training process itself is a critical and complex issue involving doctrines like fair use (in the USA) or specific exceptions (in the EU/UK), this article focuses primarily on a related but distinct challenge: potential copyright liability arising from the outputs of generative AI systems and the applicability of existing intermediary liability limitations (safe harbours) to the actors involved in creating and disseminating those outputs. AI-driven tools have been reported to produce outputs that bear striking similarity to, or even directly reproduce, protected expression.9 Authors in the OpenAI litigation claim that ChatGPT can be prompted to spit out ‘very accurate summaries’ of their works.10 These revelations have triggered legal disputes and raised the concern that generative AI’s outputs—at the instruction of users—may, in particular instances, be substantially similar enough to constitute unauthorized reproductions or infringing derivatives.
In the USA, the legal regime that has long provided intermediaries with immunity and legal certainty, while simultaneously incentivizing them to take measures that reduce infringement, is the safe harbour framework established under the Digital Millennium Copyright Act (DMCA).11 When enacted in 1998, the DMCA’s safe harbours primarily sought to shield the online intermediaries prevalent at the time—such as internet access providers, web hosting services and early forms of online platforms—from liability for user-submitted infringing content, on the condition that they comply with certain procedural and substantive safeguards;12 this framework later proved crucial for the operations of services like YouTube and social media platforms. While influential globally, this model faces challenges from generative AI.
Generative AI disrupts the foundational assumptions of frameworks like the DMCA section 512 in at least three core respects, these being challenges echoed in debates worldwide. First, responsibility for AI outputs is frequently fragmented among multiple actors across the supply chain: data suppliers, model developers and model deployers.13 Second, generative AI providers do not merely host or transmit user-posted files in the traditional sense; rather, they actively participate in generating outputs, undermining the assumption that intermediaries are passive conduits.14 Third, mechanisms like the DMCA’s notice-and-takedown system—designed for discrete and stable infringing files—are poorly suited to AI-generated works that are dynamically produced and delivered directly to users in ways that make detection and removal difficult for rights holders.15
In short, simply applying traditional safe harbour frameworks, exemplified by the DMCA, to generative AI overlooks the complex ecosystem and the dynamic, opaque content generation processes at the core of these systems. These shortcomings may undermine legal certainty for participants in the AI supply chain, including investors, and inadvertently stifle innovation. Alternatively, they may leave authors and copyright owners with insufficient remedies when infringing content emerges on a large scale. Despite this growing tension, policymakers, courts and scholars globally have yet to reach a consensus on how to best adapt intermediary liability regimes. Current debates often focus on questions like whether AI-generated outputs can be copyrighted,16 how authorship might be redefined17 and whether fair use or infringement standards need revision.18
This article contributes to this international discussion by using US law as a detailed case study and then extracting principles that travel. In particular, the proposal shows how the DMCA’s conditional-immunity bargain can be modernized for large-scale, multi-actor AI systems and then transplanted to other jurisdictions that are already updating their own rules. For the EU, the framework could inform the application of safe harbours under the Digital Services Act (DSA),19 dovetail with the AI Act’s transparency and documentation duties for general-purpose models,20 and supply a role-based liability shield.21 It may also offer insights for interpreting obligations similar to those found in other EU provisions, such as Article 17 of the Directive on Copyright in the Digital Single Market (DSMD).22 In the UK, it offers a concrete template for the Intellectual Property Office’s 2024–2025 consultation on a post-Brexit text and data mining (TDM) exception and infringement and liability relating to AI outputs.23 For Hong Kong, which is examining AI providers’ liability relating to end-users’ infringement, the model demonstrates how tiered obligations can be grafted onto its existing, technology-neutral Copyright Ordinance.24 By showing how, where and why DMCA-style conditional immunity still works—once recalibrated—the article aims to give legislators and regulators in these jurisdictions a set of plug-and-play design choices rather than an American blueprint to be copied wholesale.
Section II begins by examining how the DMCA’s safe harbour provisions emerged from a legal environment built around passive intermediaries and then shows how generative AI disrupts the foundational assumptions of section 512—issues relevant to any jurisdiction relying on similar concepts. Section III responds with the ‘AI harbour’ proposal, assigning role-specific obligations to data suppliers, model developers and model deployers, creating a graduated immunity scheme. Section IV evaluates potential criticisms, including administrative feasibility and the interplay with doctrines like fair use, considering challenges observed in existing systems. Finally, Section V concludes.
Source: https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpaf043/8221820