A Sony Doctrine to Protect AI LLMs and to Balance Copyright with Innovation
In an article published on Techdirt Ira P. Rothken wrote "The technological marvel of large language models (LLMs) like ChatGPT, developed by AI engineers and experts, has posed a unique challenge in the realm of copyright law. These advanced AI systems, which undergo extensive training on diverse datasets, including copyrighted material, and provide output highly dependent on user “prompts,” have raised questions about the bounds of fair use and the responsibilities of both the AI developers and users.
Building upon the Sony Doctrine, which protects dual use technologies with substantial non-infringing uses, I propose the TAO (“Training And Output”) Doctrine for AI LLMs like chatGPT, Claude, and Bard. This AI Doctrine recognizes that if a good faith AI LLM engine is trained using copyrighted works, where the (1) original work is not replicated but rather used to develop an understanding, and (2) the outputs generated are based on user prompts, the responsibility for any potential copyright infringement should lie with the user, not the AI system. This approach acknowledges the “dual-use nature” of AI technologies and emphasizes the crucial role of user intent and inputs such as prompts and URLs in determining the nature of the output and any downstream usage.
Understanding LLMs and Their Training Mechanism
LLMs operate by analyzing and synthesizing vast amounts of text data. Their ability to generate responses, write creatively, and even develop code stems from this training. However, unlike traditional methods of copying, LLMs like ChatGPT engage in a complex process of learning and generating new content based on patterns and structures learned from their training data. This process is akin to a person learning a language through various sources but then using that language independently to create new sentences. AI LLMs are important for the advancement of society as they are “idea engines” that allow for the efficient processing and sharing of ideas.
Copyright law does not protect facts, ideas, procedures, processes, systems, methods of operation, concepts, principles, or discoveries, even if they are expressed in copyrighted works. This principle implies that the syntactical, structural, and linguistic elements extracted during the training of LLMs fall outside the scope of copyright protection. The use of texts to train LLMs primarily involves analyzing these non-copyrightable elements to understand and statistically model language patterns....