Time:2025-12-10
Publication Date:2025-12-10
Shadow libraries are vast online repositories containing millions of pirated books and articles made free to download on websites.1 The sharing of copyrighted work outside ‘official channels’ has been identified as a longstanding practice, with scholars such as Bodó tracing the historic roots of key shadow libraries such as Library Genesis (‘LibGen’) and Sci-Hub to unauthorized photocopying in resistance to Soviet censorship and attempts by Russian academics in the 1990s to disseminate literature during economic crisis.2 The consolidation, however, of unauthorized copies of works into digital collections emerged in the early 2000s when technical developments enabled the ‘large scale automation’ of such activities.3 In the decades that followed the first digital library, Project Gutenberg, which provided public domain works on the predecessor to the internet, personal collections of digitized works grew into larger projects and became increasingly more susceptible to copyright infringement claims.4 While some collections migrated to offline ‘dark collections’, others developed into more ‘curated archives’ for the public under a philosophy of ‘resistance’ against ‘publishing oligopolies’ and academic paywalls, gaining many characteristics of a library such as catalogues.5 In their modern form, shadow libraries such as Sci Hub also use private credentials linked to Western universities to gain access to key databases to meet users’ searches, operating such that a copy of the work is saved to LibGen when the user downloads the work.6 Prior to 2023, legal action involving shadow libraries such as LibGen and Sci Hub largely centred on takedown notices and injunctions targeting shadow libraries directly.7 On 7 July 2023, however, author Sarah Silverman commenced a lawsuit with various authors against Meta and OpenAI, claiming the companies had used copyrighted text extracted from shadow libraries to train artificial intelligence (‘AI’) models.8
The emerging role of shadow libraries within the generative AI (GAI) industry is best understood through the lens of how GAI is developed. GAI is a form of AI9 that generates text, sounds or images in response to specific prompts.10 To create GAI systems, developers require large collections of material or ‘AI datasets’, comprising text and images that serve as ‘input data’.11 Developers subsequently undertake text and data mining (TDM) on the dataset, which is a computer process that analyses and extracts information and patterns from collated data.12 The information extracted is then ingested by the GAI model to train it to generate similar content or ‘outputs’.13 A challenge for GAI developers, however, is that the quality of outputs heavily depends upon the amount and quality of the training data used.14 Consequently, AI training often involves copying significant amounts of reliable, high-quality works from online repositories and internet databases.15
Against this background, concerns have emerged regarding whether GAI developers, in scraping online works, are infringing the exclusive rights of copyright owners by obtaining their works from shadow libraries.16 This is of particular concern in light of Meta admitting to using the Books3 dataset, which is popularly used to train text-generating GAI models and consists predominantly of pirated books from the shadow library, Bibliotik.17 While the European Union (EU) Directive on Copyright in the Digital Single Market (CDSM)18 has created copyright exceptions for TDM, the copyright frameworks of many jurisdictions are presently silent on the legal status and copyright protection of datasets. Within this context, the paper’s central objective is to determine how effective copyright laws for datasets can be developed to promote GAI innovation while protecting authors’ rights against unlawful uses of shadow library material.
Australia, as one such jurisdiction without specific copyright laws on AI or TDM, has been selected as a case study for examination given the increasing political and stakeholder interest in AI training material. On 5 December 2023, the Attorney General announced the establishment of the Copyright and Artificial Intelligence Reference Group, which operates as a ‘standing mechanism’ for engaging with stakeholders in technology, creative and media sectors across various issues, including the legality of AI training data.19 Since then, Australian authors and creators have voiced increasing discontentment at uses of their works without permission, exemplified by a recent statement released by the Australian Society of Authors that it was ‘horrified’ that Australian books were being included in datasets of ‘pirated books’, while Creative Australia expressed concern that the ‘global nature’ of the technology industry, including AI development, was rendering IP rights enforcement more difficult where works are being used without consent.20 Consequently, political discourse on AI-oriented reform has progressed, with the Productivity Commission in August 2025 raising the possibility of expanding fair dealing exceptions to accommodate TDM undertaken by AI developers, while the Select Committee on Adopting Artificial Intelligence has recommended implementing dedicated AI legislation that, amongst other things, requires transparency around the works used when developing AI.21 Given the current progression of political and stakeholder discourse in this area, Australia is therefore suitably placed to serve as a case study for AI-oriented copyright reform, and it is hoped that this theoretical and legal analysis will inform the discourse of other jurisdictions similarly engaged in interrogating the adequacy of their present copyright frameworks and designing reforms to better regulate pirated dataset material. Specifically, as shadow libraries have not been the subject of extensive scholarly analysis in the context of AI, this paper aims to investigate how proposed reforms can enable the use of copyrighted works in datasets to advance AI innovation, while avoiding the inadvertent legitimization of shadow libraries.
To address this central issue, this paper will commence by developing evaluative theoretical criteria which can be used to determine what constitutes ‘effective’ copyright laws in the AI era. The paper will present a ‘IAP test’ (incentives for authors, access to works and public interest) that can be used to both evaluate the effectiveness of present laws and design future laws that are more closely aligned with the commercial, social and cultural realities of the AI era. The paper will then analyse the Australian reproduction right under the Copyright Act 1968 (Cth) (‘Copyright Act’) and associated exceptions for temporary reproduction and fair dealing to determine the present legality of datasets and assess the law’s effectiveness in balancing innovation with authors and the public interest in restricting shadow library usage. Informed by this theoretical and legal foundation, the paper will subsequently investigate how the EU’s CDSM and Artificial Intelligence Act (‘AI Act’)22 could operate together to regulate uses of shadow libraries in dataset creation. Based on these insights, this paper will recommend that Australia implement a modified CDSM-style copyright exception for TDM, with clearer definitions for ‘lawful access’ requirements to ensure that shadow libraries are not inadvertently legitimized. Finally, the paper will recommend that these reforms be accompanied by mandatory obligations for disclosing dataset sources, similar to those in the AI Act, to enable the policing and prosecution of unlawful dataset creation and use.
This paper contains five sections. Section II presents a pluralistic framework, referred to as the ‘IAP test’ in this paper, for evaluating the effectiveness of copyright law in the AI era, integrating aspects of utilitarianism and Held’s valid judgment test.23 Section III considers the legal status of datasets under Australian copyright law and evolving scholarly discourse, noting various areas of uncertainty and disagreement. Building upon this foundation, Section IV applies the IAP test developed in Section II to the findings in Section III to evaluate the present effectiveness of Australian copyright law in relation to dataset creation and shadow library usage. To provide potential insights for law reform, Section V investigates the legal status of datasets under the EU’s TDM exception and whether disclosure obligations under the AI Act would expose developers’ uses of shadow library material. Finally, Section VI presents a suite of recommendations for reform to develop more effective dataset copyright laws that better balance the promotion of AI innovation with author protections against shadow library usage.
Source: https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpaf072/8341529