Data Drought: The Impending Crisis of AI Companies Running Out of Internet

3 min read

Activities

Divisions

Programs

Activities

Divisions

Programs

AI corporations are exhausting all internet data in an effort to enhance their models

As part of the endeavor to advance each succeeding LLM or big language model, AI corporations have nearly depleted the entire openly accessible internet and are facing data shortage. This might lead them to rely on AI-produced data to train their future models, which presents its own set of challenges.

AI firms are confronted with a significant hurdle which could render the massive investments from major technology companies futile: they are exhausting their internet resources.

In their quest to create increasingly larger and more sophisticated language models, AI firms have virtually exhausted all open internet resources. Now, they are on the brink of a data scarcity, as disclosed by the Wall Street Journal.

This problem is motivating certain companies to look for different means to gather training data, such as using openly accessible video scripts and developing AI-produced "synthetic data". However, the use of AI-produced data for training AI models presents a separate issue — it increases the probability of AI models generating false data.

Moreover, debates concerning artificial data have sparked significant worries about the possible implications of training AI systems on data produced by AI. Specialists suggest that an overdependence on data created by AI can cause a digital "inbreeding" effect, potentially leading to the self-destruction of the AI model.

Organizations such as Dataology, established by Ari Morcos, an ex-researcher at Meta and Google DeepMind, are investigating ways to train extensive models using less data and resources. However, most significant entities are using somewhat unusual and controversial methods for data training.

According to information obtained by the Wall Street Journal, OpenAI is contemplating the use of transcriptions from publicly accessible YouTube videos to train its GPT-5 model. However, this approach has previously drawn criticism when used for training the Sora system. Additionally, it could potentially lead to legal issues with the creators of the videos.

Despite this, firms such as OpenAI and Anthropic intend to tackle this issue by creating enhanced artificial data. However, the details about their techniques are still not explicitly defined.

Concerns regarding AI corporations have been circulating for a long period. Although some, such as Epoch researcher Pablo Villalobos, forecast that AI may run out of useful training data in the near future, there's a strong belief that substantial advancements could alleviate these worries.

Nonetheless, there's another way to resolve this issue: AI firms might choose not to develop bigger and more sophisticated models, taking into account the environmental impact related to their creation, which involves substantial energy use and the dependency on scarce-earth minerals for processor chips.

(Incorporating information from various sources)

Search for us on YouTube

Best Programs

Locate us on YouTube

Best Programs

can be found on YouTube

Firstpost holds all rights, protected by copyright, as of 2024

You May Also Like

More From Author

+ There are no comments

Add yours