A hand-waving explanation of how ChatGPT works
This won't satisfy the experts but it explains a lot of what's going on behind the scenes
This post is my attempt to explain in non-technical language how ChatGPT works. After you enter a prompt in the dialogue box, what happens to induce the software to provide a response?
I am not an AI researcher. I don’t have a PhD in computer science or linguistics or anything else. I’ve read a lot about AI over the past year and have synthesized a lot of information about how ChatGPT works, but as with all laymen, my understanding is necessarily superficial and possibly inaccurate. If you are an AI researcher or are otherwise deeply technical, what follows below will appear at best superficially correct and at worst completely wrong. But I think it is directionally accurate, which ought to be sufficient for the non-specialist. Reader beware.
So let’s dive in.
Input processing: The text you input into ChatGPT’s dialogue box is preprocessed. This involves converting the text into a format that the model can understand, via tokenization. Tokenization breaks down your input into smaller pieces, called tokens. These tokens are not just words, but can also be parts of words, especially in languages where words are often concatenanted.1
Embedding and contextual understanding: Each token is transformed into a numerical representation, called an embedding. These embeddings are designed to capture not just the token itself but also its context within the sentence. This contextual understanding is crucial for the model to grasp the meaning, nuances, and intentions behind your input.
Neural network processing: ChatGPT is build on a type of artificial neural network known as a Transformer, specifically designed for handling sequential data like text. Within this network, your processed input passes through numerous layers of computation.2
Language model training and knowledge base: Before interaction with users, ChatGPT is trained on a vast body of text data. This training involves adjusting the parameters of the neural network to minimize the difference between its predictions and the actual outcomes (e.g., the next word in a sentence). The training data and the model’s architecture enable ChatGPT to develop a broad understanding3 of language, context, and world knowledge as of its last update.
Response Generation: When generating a response, the model essentially predicts what comes next given the input. This involves calculating probabilities4 for each possible next token and selecting tokens based on these probabilities, iteratively, until a complete response is formed. This process is guided by a set of rules and criteria, such as maximizing coherence, relevance, and following specific content policies5, to ensure the response is appropriate and informative.
Post-processing: Before the response is displayed, there might be some post-processing to ensure it adheres to guidelines, is formatted correctly, and maintains a conversational tone.
Feedback and iteration: When you receive and react to the model’s output, your responses can be used as feedback, helping to inform future responses and improve the model’s performance over time.
This entire process happens almost instantaneously, which of course showcases the efficiency and power of modern deep learning models in natural language processing.
More about tokenization and embedding
Tokenization and embedding are fundamental concepts in natural language processing (NLP), especially in the context of models like ChatGPT. They represent the initial steps in transforming human language into a format that a machine learning model can understand and process.
Tokenization is the process of converting text into smaller units called tokens. In the context of NLP, tokens are often words, but they can also be parts of words or even punctuation marks. The choice of tokenization method can significantly impact the performance of an NLP model.
Word tokenization is the simplest form of tokenization, where the text is split into words. For example, consider the sentence ChatGPT is amazing
. The word tokenization form of this sentence would be “ChatGPT” “is” “amazing”
.
Subword tokenization is a more sophisticated form of tokenization, and is commonly used in models like GPT. It splits words into smaller pieces, which can represent common prefixes, roots, and suffixes. This approach helps in dealing with out-of-vocabulary words and improves the model’s ability to handle diverse languages and jargon. Unbelievable
might be split into “un” “believ” “able”
. As mentioned above, this tokenization method is especially apt for languages with very long words, like German.
Once the text is tokenized, each token is converted into a numerical form called an embedding. These embeddings are arrays of numbers6 that represent the tokens. The embedding process is crucial for several reasons:
Capturing semantics: Embeddings are designed to capture the meaning of tokens. Words with similar meanings are often positioned closer together in the virtual environment in which the model operates.
Contextual awareness: Advanced models like ChatGPT use contextual embeddings, meaning the representation of a word can change based on the other words around it. For example, the embedding for
bank
would be different inriver bank
versusbank account
. It’s important to remember that these embeddings are a series of numbers, and the software doesn’t understand the words’ definitions in the way that humans do. All the software sees is that the sequence of numbers representing the embedding for bank-as-river bank is different from the sequence of numbers representing the embedding for bank-as-bank account.Simplification: Embeddings allow a vast vocabulary to be represented in a more manageable way. This simplification is critical for processing efficiency.
Training: Embeddings are initialized randomly and then fine-tuned during the model training process. The training adjusts these embeddings so that the distances between words in the model correspond to their semantic and syntactic relationships.
Baseball
andpitcher
are closely related whenpitcher
is semantically related tobaseball
, but these two words are not closely related whenpitcher
refers to a container which holds liquid.
So, tokenization and embedding are the initial steps in translating human language into a format that NLP models can process, setting the stage for the complex machine learning tasks that follow in understanding and generating language.
For example, an agglutinative language like German.
The details of this computation escape me.
It’s important to understand that the model doesn’t understand text like a human understands text. The model sees a sequence of numbers, which are embeddings of tokens. At least I’m pretty sure that this is accurate.
This is where the notion of the stochastic parrot comes from. The argument is that since the model assigns a probability to decide the next word in a piece of text it generates, it doesn’t understand language in any way that we use the word “understand.”
A lot of what you read about AI safety-related issues pertains to these content policies.
I’ve seen this referred to as “high-dimensional vectors” but I don’t know what that means.