AI Weekly: New Architectures Could Make Large Language Models More Scalable


Hear from CIOs, CTOs, and other executives and senior executives on data and AI strategies at the Future of Work Summit on January 12, 2022. Learn more

Starting in earnest with OpenAI’s GPT-3, the focus in natural language processing shifted to Large Language Models (LLMs). LLMs – indicated by the amount of data, computation, and storage needed to develop them – are capable of exploiting impressive feats in language comprehension, such as generating code and writing rhyming poems. But as a growing number of studies point out, LLMs are impractical for most researchers and organizations. Not only that, but they consume an amount of energy that calls into question their long term sustainability.

New research suggests that doesn’t have to be the case forever, however. In a recent article, Google introduced the Generalist Language Model (GLaM), which the company claims is one of the most effective LLMs of its size and type. Despite having 1.2 trillion parameters – nearly six times the amount of GPT-3 (175 billion) – Google says GLaM improves against popular language benchmarks while using “considerably” fewer computations during inference.

“Our large-scale language model, GLaM, performs competitively on zero and one-shot learning and is a more efficient model than its previous dense monolithic counterparts,” the Google researchers behind GLaM wrote in an article. blog. “We hope that our work will lead to more research on efficient language models in computation. “

Rarity vs density

In machine learning, parameters are the part of the model that is learned from historical training data. Generally speaking, in the language field, the correlation between the number of parameters and sophistication has held up remarkably well. DeepMind’s recently detailed Gopher model has 280 billion parameters, while Microsoft and Nvidia’s Megatron 530B has 530 billion. Both are among the best – otherwise the top – performers on major natural language reference tasks, including text generation.

But forming a model like the Megatron 530B requires hundreds of servers with GPUs or accelerators and millions of dollars. It is also bad for the environment. The GPT-3 alone used 1,287 megawatts during training and produced 552 metric tons of carbon dioxide emissions, according to a Google study. It’s approximately equivalent annual emissions from 58 households in the United States

What sets GLaM apart from most LLMs to date is its “mix of experts” (MoE) architecture. An MoE can be thought of as having different layers of “sub-models” or experts, specialized for different texts. The experts in each layer are controlled by a “gating” component that exploits the experts based on the text. For a given word or part of a word, the trigger component selects the two most appropriate experts to process the word or part of a word and make a prediction (eg generate text).

The full version of GLaM has 64 experts per MoE layer with 32 MoE layers in total, but only uses a 97 billion (8% of 1.2 trillion) subnet of parameters per word or part of a word during the processing. “Dense” models like GPT-3 use all of their parameters for processing, dramatically increasing computational and financial requirements. For example, Nvidia reports that processing with Megatron 530B can take over a minute on a processor-based on-premises server. It takes half a second on two DGX systems designed by Nvidia, but just one of those systems can cost anywhere from $ 7 million to $ 60 million.

GLaM is not perfect – it exceeds or is on par with the performance of a dense LLM in between 80% and 90% (but not all) of the tasks. And GLaM uses more calculations during training because it trains on a dataset with more words and parts of words than most LLMs. (Compared to the billions of words that GPT-3 learned the language from, GLaM ingested a dataset with an initial size of over 1.6 trillion words.) But Google says GLaM uses less of the half of the Powerful necessary to drive the GPT-3 at 456 megawatt hours (Mwh) against 1,286 Mwh. For the context, a single megawatt is enough to power around 796 homes for a year.

“GLaM is a new step in the industrialization of major language models. The team applies and refines many modern tweaks and advancements to improve the performance and inference cost of this latest model, and comes away with an impressive feat of engineering, ”Connor Leahy, data scientist, told VentureBeat at EleutherAI, an open research collective on AI. . “While there is nothing scientifically groundbreaking about this latest iteration of the model, it shows just how much engineering companies like Google are devoting to LLMs engineering efforts.”

Future work

GLaM, which builds on Google’s own Switch Transformer, a trillion-parameter MoE detailed in January, follows other techniques to improve LLM efficiency. A separate team of Google researchers have come up with a Fine-Tuned Language Network (FLAN), a model that outperforms GPT-3 “by far” on a number of difficult benchmarks, albeit smaller (and more thrifty). in energy). DeepMind claims that another of its language models, Retro, can beat LLMs 25 times its size, thanks to external memory that allows it to search for passages of text on the fly.

Of course, efficiency is just one hurdle to overcome when it comes to LLMs. Following similar investigations by AI ethicists Timnit Gebru and Margaret Mitchell, among others, DeepMind last week Underline some of the problematic trends in LLMs, which include perpetuating stereotypes, using toxic language, leaking sensitive information, providing false or misleading information, and poor outcomes for minority groups.

Solutions to these problems are not immediately available. But the hope is that architectures like MoE (and perhaps GLaM-like models) will make LLMs more accessible to researchers, allowing them to explore potential ways to solve – or at least mitigate – the worst. problems.

For AI coverage, send topical advice to Kyle Wiggers – and be sure to subscribe to the AI ​​Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle wiggers

IA personal writer


VentureBeat’s mission is to be a digital public place for technical decision-makers to learn about transformative technology and conduct transactions. Our site provides essential information on data technologies and strategies to guide you in managing your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the topics that interest you
  • our newsletters
  • Closed thought leader content and discounted access to our popular events, such as Transform 2021: Learn more
  • networking features, and more

Become a member


Comments are closed.