Facing the Data Drought: How AI Companies are Navigating the Impending End of Open Internet Data

By
April 3, 2024
0 comments

Activities

Divisions

Programs

Activities

Divisions

Programs

AI firms have utilized the whole internet for training their models and are now experiencing a data shortage

In an attempt to make every large language model superior to its predecessor, AI firms have nearly exhausted all accessible internet resources and are facing a data scarcity. They might have to resort to training their future models on AI-produced data, which comes with its own set of issues.

AI businesses are confronting a massive hurdle that could make all the hefty investments by major tech firms worthless: they are exhausting their internet resources.

AI firms, in their quest to create increasingly bigger and more sophisticated language models, have virtually exhausted all available open internet data. Now they are on the brink of a data shortage, as highlighted by the Wall Street Journal.

The problem is prompting certain companies to explore different sources for training data, like video transcripts that are publicly accessible and the development of AI-produced "synthetic data". Yet, employing AI-produced data to educate AI models presents its own set of challenges — it increases the likelihood of AI models experiencing hallucinations.

Moreover, debates around artificial data have sparked considerable apprehension about the possible implications of training AI models on data produced by AI. Specialists assert that an excessive dependence on data created by AI can cause what's known as digital "inbreeding," which may eventually cause the AI model to implode.

Companies such as Dataology, established by Ari Morcos, an ex-researcher at Meta and Google DeepMind, are investigating techniques to develop comprehensive models using less data and resources. However, majority of the key players are experimenting with somewhat unconventional and controversial data training methods.

OpenAI, as an instance, is contemplating training its GPT-5 model utilizing transcriptions from public YouTube videos, based on sources mentioned by the WSJ. Despite facing backlash for using these videos to train Sora, the AI firm could potentially be subject to legal action by video creators.

Regardless, firms such as OpenAI and Anthropic intend to tackle this issue by creating advanced artificial data, yet the exact details about their techniques are still ambiguous.

Concerns about AI businesses have been circulating for a while. Although some, such as Epoch investigator Pablo Villalobos, forecast that AI might deplete its valuable training data in the near future, there is a widespread belief that major advancements could alleviate these worries.

Nonetheless, there's another way to address this problem: AI firms might choose to avoid developing bigger and more sophisticated models, bearing in mind the environmental impact of their creation, such as high energy usage and the dependence on scarce-earth minerals for the production of computing chips.

(Incorporating information from various sources)

Look for us on YouTube

Best Programs

Locate us on YouTube

Prime Shows are available on YouTube

Firstpost holds all rights exclusively, as protected by copyright law, as of