An artificial intelligence (AI) program called ChatGPT became the latest overnight sensation a couple of weeks ago. The app generates responses to users’ questions that are alarmingly accurate in detail, and extremely accurate grammatically and syntactically. Once the developers at OpenAI launched the app publicly, their servers began to struggle with demand.
Among the questions ChatGPT-3 raises is, now that we have a really good natural language processor, can it be made perfect? Most of us are likely to say it is not possible, that human intelligence and ingenuity will always prevail. But what if the stock of words we use to talk about the world is finite? Large, surely, but countable.
A team of researchers (Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho) have concluded that machine learning (ML) datasets will run out of “high-quality language data” by 2026. Low-quality language data is likely to last between 2030 and 2050, and there is enough low-quality image data to keep ML programs busy for 10 years longer.
High-quality language data includes books, news articles, scientific papers, Wikipedia and filtered web content. The common element here is that the data in these sources has passed through a filter for usefulness or quality. There are two general sources of high-quality data: dedicated contributors to web content and subject matter experts. The former adds to the stock of high-quality data based on demand for digital content; the latter is based on the strength of the economy and government investment in research and development. In all, about 7 trillion high-quality words are present in all these datasets.
Low-quality language data comprises five general models: recorded speech, internet users, popular platforms, CommonCrawl and indexed websites. CommonCrawl is an open, nonprofit repository of web crawl data open to anyone. The researchers estimate that there are 741 trillion low-quality words available.
Language datasets as of October contain about 2 trillion words and have been growing by a rate of about 50% annually. The “stock” of language grows by about 7% annually and is estimated to hold 70 trillion to 70 quadrillion words. That is 1.5 to 4.5 orders of magnitude larger than the largest datasets currently in use. The growth trends in available language data indicate that models will exhaust language data sometime between 2030 and 2050.
However, estimates of the stock of high-quality data used to train language models range between 4.6 trillion and 17 trillion. Using these estimates, the researchers note, “We are within one order of magnitude of exhausting high-quality data, and this will likely happen between 2023 and 2027.”
In 2019, the Millennium Alliance for Humanity and the Biosphere at Stanford estimated that the world’s oil reserves will run out by 2052, while natural gas reserves will last until 2060 and coal reserves until 2090.
Thank you for reading! Have some feedback for us?
Contact the 24/7 Wall St. editorial team.