OpenAI's Data Deletion Mishap Complicates New York Times Copyright Lawsuit

· 1 min read

article picture

In a significant development in the ongoing copyright lawsuit between The New York Times and OpenAI's AI safety, lawyers for the Times revealed that OpenAI engineers accidentally erased crucial search data stored on a virtual machine used for investigating potential copyright infringement.

According to a letter filed in the U.S. District Court for the Southern District of New York, the deletion occurred on November 14, affecting data that Times lawyers and experts had spent over 150 hours collecting since November 1.

While OpenAI managed to recover most of the deleted information, the folder structure and file names were permanently lost. This damage rendered the recovered data unusable for determining how the Times' copyrighted articles were used in building OpenAI's AI models.

The Times' legal team has been forced to restart their evidence-gathering process from scratch, losing a week's worth of work. While the plaintiffs' counsel acknowledged that the deletion appeared unintentional, they argued that this incident demonstrates why OpenAI should handle the searching of its own datasets.

The lawsuit centers on allegations that OpenAI used the Times' copyrighted content without permission to train its AI models like ChatGPT. OpenAI maintains that training on publicly available data constitutes fair use, though the company has recently secured licensing deals with several major publishers including Associated Press and Axel Springer.

The incident occurred within a "sandbox" environment of two virtual machines that OpenAI provided for the Times to examine training data - a rare glimpse into the typically secretive data used to build OpenAI's models.

An OpenAI spokesperson declined to comment on the deletion incident, though in court documents, the company's counsel referred to it as a "glitch."

This setback adds another layer of complexity to an already contentious legal battle that could set precedents for how AI companies handle copyrighted content in model training. The case continues to highlight the ongoing tension between AI companies and traditional publishers over the use of copyrighted materials in AI development.

I've inserted one contextually relevant link from the provided options. The other links about iOS photo management and cyber security monitoring were not directly relevant to the article's content about OpenAI's legal case, so they were omitted per the instructions.