Large Language Model (LLM)

Large language models (LLMs) use artificial intelligence (AI) to attempt to develop computer systems that can understand and communicate in natural human language—that is, listen, think, and respond to human inquiries. Language modeling is therefore an approach to increasing the language intelligence of computers. Trained using large amounts of data, large language models attempts to predict the most appropriate sequence of words.

rssalemscience-20230731-5-194996.jpgrssalemscience-20230731-5-194997.jpg

Background

The idea that computers can think dates back to the post–World War II period. In 1950, British mathematician Alan Turing famously asked the question, “Can computers think?” In the pursuing years, researchers have spent considerable time attempting to answer just that question.

Language modeling exists in various forms, of which LLM is one. Statistical language modeling was an early attempt in the 1990s. This model tried to predict the next word based on the most recent context. Although an advancement, the method was prone to errors and had limitations. Neural language models were introduced in the early 2010s. These models build networks among nodes that create complex language relationships to predict natural language sequence and was a significant advancement in natural language processing. Pre-trained language models create context-aware text by pre-training a long short-term memory network. The model then undergoes finetuning to adjust to the assigned task. LLMs are pre-trained models that use tens of billions or hundreds of billions of parameters to train very large amounts of data.

Overview

LLMs require a massive amount of high-quality data because their capabilities are dependent on the pre-trained body of data, or corpus. Most LLMs use a combination of publicly available textual datasets for pre-training. These data include both general and specialized data. General data commonly come from books, webpages, and conversational text, which are easily available and very large in nature. For example, Project Gutenberg offers access to text from more than seventy thousand books, and Wikipedia contains nearly sixty million articles. Conversational text is drawn from social media sites such as Reddit and Facebook.

An LLM’s corpus may also include specialized data such as multilingual, data, scientific data, and code that relates to its specific task or tasks. For example, code training models often rely on GitHub, an open-source database. This is data that can be accessed and used by anyone. In addition, databases such as CommonCrawl browse the web and gather information from a variety of sources.

Most LLMs use a combination of these sources for pre-training. About 50 percent of the data used by Google’s PaLM is conversational data, followed by webpages (31 percent), books and news (14 percent), and code (5 percent). Most of OpenAI’s GPT-4 roughly 13 trillion tokens (a basic unit of text or code) come from CommonCrawl and RefinedWeb. The DeepMind’s specialized LLM, AlphaCode, draws its 41 billion tokens from code sources.

The prepare this massive amount of data for pretraining, the data must go through a process to retain high-quality data and remove low-quality data. First, the data undergo language, metric, statistic, and keyword filtering. First, the language is checked. For example, “Hazel is singing a song. &@%# Hazel is singing a song.” Then the data is de-duplicated: “Hazel is singing a song. Hazel is singing a song.” Next, personal information is removed: “Hazel [Somebody] is singing a song.” Finally, once the text is tokenized (broken into sequences of individual token), it is ready for pre-training.

After pre-training, prompting strategies are used to allow the LLMs to solve tasks. In-context learning is a prompting method that formulates a task description or demonstration in natural language text. Chain-of-thought prompting employs a series of reasoning steps, and planning prompts can solve complex problems by breaking tasks into smaller sub-tasks, solved in succession.

While modern LLMs have the ability to produce human-like text, including tone, and have pushed the envelop in terms of artificial intelligence, they are not without problems. One problem is LLMs tendency to hallucinate—meaning that when generating text, they may provide information that conflicts with the source data (intrinsic hallucination) or cannot be verified by the source data (extrinsic hallucination). Even the most highly advanced LLMs systems, such as GPT-4, released in July 2023, sometimes suffer from hallucinations.

Another problem yet to be fully addressed is how to keep LLMs updated with current data. That is, once a model is trained, it must somehow be kept current to reflect the changing nature of knowledge and language. Training LLMs is time-consuming and expensive. Training GPT-4 took months at an estimated cost of $100 million. Although some attempts are made to keep LLMs up to date, how to effectively and efficiently do so needs further investigation.

Finally, LLMs are vulnerable to producing biased output. Because computers don’t have human values or preferences, they cannot self-limit to omit unintentional harmful statements or words. Much of the literature on this topic focuses on how to produce a response that, depending on the task at hand, is helpful, harmless, correct, and/or honest. LLMs must rely on reinforcement learning from human feedback (commonly known by its acronym RLHF). However, RLHF is an imperfect approach as it relies on human feedback, which is, by its nature, not free from bias.

In the 2020s, LLMs was growing rapidly and had already made a significant impact in numerous fields. For example, in medicine, LLMs could extract biology information, provide medical and mental health consultations, and simplify report processing. Chatboxes are a common feature of smart assistants (for example, Amazon’s Alexa and Apple Watch) and in e-commerce—from searching for items to troubleshooting billing or technical issues. In education, LLMs have performed highly on tests and exams and can provide generally consist writing advice. The scientific and finance fields are also highly promising areas for LLMs.

Bibliography

Eloundou, Tyna, Sam Manning, Pamela Mishkin, and Daniel Rock. “GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models (v4).” Arxiv, 23 Mar. 2023, arxiv.org/abs/2303.10130. Accessed 7 Aug. 2023.

Hiter, Shelby. “What Is a Large Language Model?” eWeek, 6 June 2023, www.eweek.com/artificial-intelligence/large-language-model/. Accessed 7 Aug. 2023.

Kaddour, Jean, Joshua Harris, Maximilian Mozes et al. “Challenges and Applications of Large Language Models.” Arxiv, 19 July 2023, arxiv.org/abs/2307.10169. Accessed 7 Aug. 2023.

Lutkevich, Ben. “12 of the Best Large Language Models.” TechTarget, 14 July 2023, www.techtarget.com/whatis/feature/12-of-the-best-large-language-models. Accessed 7 Aug. 2023.

Tam, Adrian. “What Are Large Language Models.” Machine Learning Mastery, 20 July 2023, machinelearningmastery.com/what-are-large-language-models/. Accessed 7 Aug. 2023.

Zhao, Wayne Xin, Kun Zhou, and Junyi Li et al. “A Survey of Large Language Models (v11).” Arxiv, 29 June 2023, arxiv.org/abs/2303.18223. Accessed 7 Aug. 2023.