Stack Overflow charge for training AI
Stack Overflow, an online forum for computer programming, is set to begin charging large AI developers for access to its 50 million questions and answers as part of a broader generative AI strategy. The site's data will be used by developers to create AI chatbots and text generators such as ChatGPT and Dall-E. Other companies, such as OpenAI and Google, have traditionally scraped their data from the web and paid nothing for their training data. Meta and Google did not provide immediate comments, while OpenAI did not respond to a request for comment.
Community sites such as Stack Overflow and Reddit, which charge AI developers for their data, have been backed by the News/Media Alliance, a US trade group that includes publishers such as Condé Nast. The Alliance has called on generative AI developers to negotiate any use of their data for training and other purposes and respect their right to fair compensation.
Feeding text from online discussions about programming into machine learning algorithms known as large language models (LLMs) can help AI text generators or chatbots become more fluent and knowledgeable. Using LLMs to generate programming code is viewed as one of the technology's biggest opportunities, with Microsoft charging as much as $19 a month per person for its code generator GitHub Copilot.
Stack Overflow’s CEO, Prashanth Chandrasekar, argued that the additional revenue is vital to ensure that Stack Overflow can keep attracting users and maintaining high-quality information, which will help future chatbots progress knowledge forward. Proper licensing will also help accelerate the development of high-quality LLMs.
Although AI developers have traditionally used dispatching software to scrape content from websites, which is typically legal in the US, Stack Overflow argues that LLM developers are violating its terms of service. This is because users own the content they post on Stack Overflow, as outlined in its TOS, but all of it falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.
AI developers are seeking ways to bring down the costs of developing large-scale AI systems, which require expensive computers to power. Having to pay for data they once grabbed for free could extend the already unclear timelines to turning a profit on their emerging technologies. Although fencing off valuable data could deter some AI training and slow the improvement of LLMs, Chandrasekar says proper licensing will only help accelerate the development of high-quality LLMs.
Comments
Post a Comment