Stack Overflow Will Charge AI Giants for Training Data

1 year ago 147

Large language models can generate strings of text based on word patterns learned from the web pages, books, and other bodies of text in their training data. Besides ChatGPT, the programs make up the guts of search chatbots such as Microsoft Bing chat and Google’s Bard, and they underlie a growing number of applications that produce professional and creative copy in a flash. Their counterparts that generate AI-composed illustrations and videos draw on patterns from image datasets such as photos gathered from Pinterest and Flickr.

Often, data sets used in AI development are built through unofficial means such as dispatching software that scrapes content from websites. In the US that is typically considered legal, though copyright issues and websites’ terms of use against the practice have left it in dispute.

A few websites such as Reddit and Stack Overflow have been more inviting. They offer downloadable “data dumps” or real-time data portals to help software to access their content known as APIs. In Stack Overflow’s case, LLM developers are getting their hands on data through a mix of dumps, APIs, and scraping, Chandrasekar says, all of which today can be done for free.

But Chandrasekar says that LLM developers are violating Stack Overflow’s terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.

Neither Stack Overflow nor Reddit has released pricing information. “We're working on that as we speak,” Reddit spokesperson Tim Rathschmidt says, “and will share more with partners in the coming weeks.” Stack Overflow will study Reddit’s strategy and consult with its own potential customers, some of whom have already reached out about data access, Chandrasekar says.

A potential roadmap to pricing could come from Elon Musk, who this month hiked prices for access to Twitter data. They start at $42,000 per month for access to 50 million tweets. About three times the volume of tweets had been previously available for free. In a tweet this week, Musk accused Microsoft, a major AI developer and close partner of OpenAI, of training algorithms “illegally using Twitter data.” Without elaboration, he added, “Lawsuit time.”

Both Stack Overflow and Reddit will continue to license data for free to some people and companies. Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes. “When people start charging for products that are built on community-built sites like ours, that's where it's not fair use,” he says.

Reddit CEO Steve Huffman told The New York Times this week that he didn’t want to give a freebie to the world’s largest companies. “Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with,” he said.

Read Original