Why Deepseek training on ChatGPT output doesn’t matter; also, LinkedIn class action dismissed

January 31, 2025 Marc Hoag

That was fast: The class action suit filed against LinkedIn for allegedly using Premium members’ private messages to train its AI has been dismissed.

“LinkedIn has shown us evidence that it did not use [Premium members’] private messages to [train its AI],” said Eli Wade-Scott, managing partner at Edelson PC, the law firm that represented the Plaintiff Alessandro De La Torre.

Unfortunately, this leaves more questions than answers. For instance, should companies be allowed to use user-generated content — private or otherwise — if it’s anonymized and de-identified, especially under broad Terms of Service clauses like “for improving the service”? And what about non-Premium members’ messages? Were they used?

This also raises the question of whether Deepseek trained its LLM using ChatGPT-generated output, as recent investigations suggest.

The meme above captures what many see as OpenAI’s irony — condemning the use of ChatGPT user-generated content after training its own models on scraped copyrighted material.

(This in turn touches again on the overarching question of whether training LLMs with scraped material infringes on copyright, to which I’ve long since argued that it does not, in general.)

Whether one agrees that training an LLM with scraped copyrighted material is in violation of US Copyright Law, however, is an entirely different issue to OpenAI’s concerns here.

Here, the issue has nothing at all to do with copyright infringement concerns, but merely a violation of OpenAI’s very simple Terms of Service which reads, in relevant part, that “you may not … [u]se Output to develop models that compete with OpenAI.”

Simply put, despite the humor of the meme at the top of this article, this has nothing at all to do with OpenAI complaining that Deepseek stole ChatGPT output for training purposes, but simply that Deepseek would be in violation of OpenAI’s Terms of Service if indeed it did so.

Obviously this opens up yet another question: OpenAI’s Terms of Service would apply only to a registered user of its services; Deepseek could easily scrape ChatGPT output publicly available on the internet without being a ChatGPT user and thus not bound by its Terms of Service. In this case, I am struggling to see any legs OpenAI has to stand on in possible legal action against Deepseek.

Even if Deepseek violated OpenAI’s Terms of Service, enforcing the claim is another matter. OpenAI would need to prove damages — no easy task given how widespread and open AI-generated content has become.

If Deepseek scraped ChatGPT-generated content from third-party sites, OpenAI’s case is even weaker. However, if OpenAI can trace the data back to ChatGPT users who violated its ToS to aid Deepseek’s training — say, by mass-generating and redistributing ChatGPT outputs — it might still have a case, albeit a flimsy one.

What do you think? Let me know in the comments.