Data Dilemma: The Impending Crisis for AI Companies and the Controversial Quest for Alternative Training Data Sources

3 min read

Activities

Divisions

Programs

Activities

Divisions

Programs

Artificial Intelligence firms have exhausted the entire web in order to educate their systems and they are now facing a data shortage

Striving to make every Large Language Model (LLM) superior to its predecessor, AI firms have nearly depleted the accessible internet and are facing a data shortage. They might have to resort to using AI-produced data for training their forthcoming models, which brings its own set of challenges.

AI firms are confronting a significant hurdle, which could make all the massive investments by the tech giants in them, worthless: they are exhausting their internet resources.

AI firms in the competition to create increasingly sophisticated and bigger language models have virtually devoured the entire open internet. Now, they are on the brink of a data shortage, according to a report from the Wall Street Journal.

This problem is forcing several companies to look for other means of gaining training data, such as using accessible video transcripts and developing AI-created "synthetic data". Nevertheless, employing AI-created data to instruct AI models presents its own set of challenges — it increases the risk of AI models producing imaginary results.

Moreover, debates on artificial data have brought up significant worries about the possible implications of training AI models using data produced by AI. Specialists warn that over-dependence on such data may cause digital "inbreeding", which could ultimately lead to the collapse of the AI model.

Companies such as Dataology, established by Ari Morcos who previously worked with Meta and Google DeepMind, are investigating ways to train extensive models using less data and resources. However, most leading organizations are experimenting with quite unorthodox and controversial data training methods.

OpenAI, as an instance, is contemplating the use of transcripts from publicly accessible YouTube videos to train its GPT-5 model, as per the information shared by the Wall Street Journal's sources. However, the AI firm is under scrutiny for utilizing these videos to train Sora, and it may potentially be subject to legal action from video creators.

Despite this, corporations such as OpenAI and Anthropic aim to tackle this issue by creating advanced artificial data. However, details about their approach are still not fully disclosed.

Concerns about AI companies have been circulating for a while. Even though some experts, such as Epoch researcher Pablo Villalobos, anticipate that AI may deplete its available training data in the near future, there's a strong belief that major advances could alleviate these worries.

Nonetheless, there's another way to solve this problem: AI firms could choose not to chase after bigger, more sophisticated models, taking into account the environmental cost tied to their creation, which involves substantial energy use and dependency on scarce earth elements for processing chips.

(Incorporating information from various sources)

Search for us on YouTube

Best Programs

Locate us on YouTube

Leading Programs

are available on YouTube

Firstpost holds all rights and protections © 2024.

You May Also Like

More From Author

+ There are no comments

Add yours