Skip to content

Foundations

In this section, we will explore the foundations of Generative AI (Gen AI). We will cover the fundamental concepts, techniques, and tools that form the basis of Gen AI. This section is designed to provide you with a solid understanding of the core principles and methodologies that underpin Gen AI, setting the stage for more advanced topics in subsequent sections.

An LLM stands for Large Language Model. It is a type of artificial intelligence model that is trained on vast amounts of data to understand and generate natural language. LLMs are designed to process and generate human-like text based on the input they receive. They can perform a wide range of tasks, including answering questions, writing content, coding, summarizing, translating, reasoning, and having conversations.

GPT stands for Generative Pre-trained Transformer. Let’s break down the name:

  • Generative: It can generate new content based on the input it receives. In other words, it can create text, code, or other content that did not exist before.
  • Pre-trained: How does it generate content? It generates content based on the knowledge acquired during pre-training. It is trained on a large dataset of text from the internet, books, articles, and other sources. This training allows it to learn grammar, facts, reasoning abilities, and even some level of common sense.
  • Transformer: It is based on the transformer architecture, which is a type of neural network that excels at processing sequential data such as text. In simple terms, a transformer is like a black box that takes an input sequence and produces an output based on the relationships it learns between different parts of the input.

When you enter a prompt or a question such as What is the capital of France?, the text is first converted into tokens and then sent to the Transformer/LLM for processing. The model predicts the next token based on the input and what it learned during training. It generates tokens one by one until it forms a complete response. Finally, detokenization converts the generated tokens back into human-readable text.

The diagram below illustrates how GPT generates a response based on an input prompt:

graph TD A["Input Prompt<br/>'What is the capital of France?'"] B["Tokenization<br/>['What',' is',' the',' capital',' of',' France','?']"] C["Transformer Receives Current Context"] D["Predict Next Token<br/>→ 'The'"] E["Append Token To Context"] F["Current Context<br/>'What is the capital of France? The'<br/>Send back to transformer"] G["Predict Next Token<br/>→ ' capital'"] H["Append Token To Context"] I["Current Context<br/>'What is the capital of France? The capital'<br/>Send back to transformer"] J["Predict Next Token<br/>→ ' of'"] K["Append Token To Context"] L["Current Context<br/>'What is the capital of France? The capital of'<br/>Send back to transformer"] M["... Repeat Until End Token ..."] N["Predict Next Token<br/>→ &lt;EOS&gt;"] O["Stop Generation"] P["Detokenization"] Q["Final Output<br/>'The capital of France is Paris.'"] A --> B B --> C C --> D D --> E E --> F F --> C C --> G G --> H H --> I I --> C C --> J J --> K K --> L L --> M M --> C M --> N N --> O O --> P P --> Q

First, let’s understand why we need tokens.

As we know, computers are very good at processing numbers, but they cannot directly understand human language. To bridge this gap, text must first be converted into a format that computers can work with. This is where tokens come in.

A token is a small unit of text that an LLM understands and processes internally. In simple terms, a token can represent a character, a word, part of a word, punctuation, or even a special symbol. Each token is assigned a unique identifier (token ID), which is simply a unique number used by the model.

Tokens are generally categorized into four types:

  • Word Tokens: These are complete words that are treated as a single token. For example, the word "cat" would be a single token.
  • Subword Tokens: These are parts of words that are treated as separate tokens. For example, the word "unhappiness" might be split into "un", "happi", and "ness".
  • Character Tokens: These are individual characters that are treated as separate tokens. For example, the word "cat" would be split into "c", "a", and "t".
  • Special Tokens: These are tokens used for specific purposes, such as marking the end of a sentence, padding input, or representing special instructions to the model.

What do tokens look like? You can visualize them using the website: https://tiktokenizer.vercel.app/

Different LLMs use different tokenization methods, so the same text may be split into different tokens depending on the model.

Tokenization is the process of converting raw text into tokens that an LLM can understand and process.

For example:

I love programming.

might be converted into:

["I", " love", " programming", "."]

Each token is then assigned a unique token ID. This allows the model to work with numbers instead of raw text, which is how computers process information internally.

Detokenization is the reverse of tokenization.

After the model generates a response as a sequence of tokens, detokenization combines those tokens back into human-readable text.

For example:

["I", " love", " programming", "."]

becomes:

I love programming.

This is the final text that you see as the model’s response.

First Install the tiktoken library, which is a fast tokenizer.

Terminal window
pip install tiktoken

Then you can use the following code to see how tokenization and detokenization work:

import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4.0")
text = "Hello world, welcome to LLMs!"
# Tokenization
tokens = encoder.encode(text)
print("Tokens:", tokens)
# Detokenization
decoded_text = encoder.decode(tokens)
print("Decoded Text:", decoded_text)

First, let’s understand why we need vector embeddings.

As we learned earlier, computers work with numbers, not meanings. While tokenization converts text into token IDs, those IDs are just numbers and do not tell the model anything about the meaning of the words.

This is where vector embeddings come in.

A vector embedding is a way of converting words, tokens, images, or other data into a list of numbers that represents their meaning. These numbers help the model understand how different pieces of data are related to each other.

In simple terms, embeddings help the model understand that some words are more closely related than others.

For example, consider these words:

Dog
Cat
Python
JavaScript

The model learns that Dog and Cat are closely related because they are both animals. Similarly, Python and JavaScript are closely related because they are both programming languages.

Because of this, their embeddings become more similar to each other than to unrelated words.

Embeddings also help the model understand relationships between concepts. For example, if the model repeatedly sees sentences such as:

Delhi is the capital of India.
Paris is the capital of France.

It learns that Delhi is related to India in a similar way that Paris is related to France.

This ability to capture meaning and relationships is what makes embeddings so powerful. They help AI models understand context, find related information, perform semantic search, and generate more meaningful responses.

Visualization of Vector Embeddings: TensorFlow Embedding Projector: https://projector.tensorflow.org/

First, let’s understand why we need positional encoding.

Transformers process all tokens at the same time (in parallel). While this makes them very fast and efficient, it also means they do not naturally understand the order of words in a sentence.

This is a problem because the meaning of a sentence often depends on the order of the words.

For example:

The dog chased the cat.

and

The cat chased the dog.

contain exactly the same words, but they have completely different meanings because the words appear in a different order.

This is where positional encoding comes in.

Positional encoding is a technique used to give the transformer information about the position of each token in the input sequence. It adds position information to the token embeddings so the model can understand where each token appears in the sentence.

For example, in the sentence:

I love programming

the model not only sees the tokens:

I
love
programming

but also knows their positions:

I → Position 1
love → Position 2
programming → Position 3

By combining the token’s meaning (embedding) with its position, the transformer can understand both what the token means and where it appears in the sentence.

This helps the model understand sentence structure, relationships between words, and the overall meaning of the text.

First, let’s understand why we need self-attention.

In a sentence, the meaning of a word often depends on the other words around it. A word can have different meanings in different contexts.

For example:

River bank
ICICI bank

In the first example, bank refers to the side of a river. In the second example, bank refers to a financial institution. The meaning of the word changes depending on the surrounding words.

This is where the self-attention mechanism comes in.

Self-attention allows the model to look at all the other words in a sentence when processing a particular word. By examining the surrounding words, the model can better understand the correct meaning and context of that word.

For example, when processing the word bank in:

The fisherman sat near the river bank.

the model pays more attention to words such as river and fisherman, which help it understand that bank refers to the side of a river.

In simple terms, self-attention helps the model determine which words are most important for understanding the meaning of a particular word. It allows the model to capture relationships between words, understand context, and generate more accurate responses.

First, let’s understand why we need multi-head attention.

As we learned earlier, self-attention helps the model understand the relationship between words in a sentence. However, a sentence can contain many different types of relationships at the same time.

For example, consider the sentence:

The programmer who built the website fixed the bug.

To fully understand this sentence, the model may need to focus on different things:

  • The relationship between programmer and built.
  • The relationship between website and built.
  • The relationship between programmer and fixed.
  • The relationship between bug and fixed.

Looking at only one relationship at a time may not capture the complete meaning of the sentence.

This is where multi-head attention comes in.

Multi-head attention allows the model to perform multiple self-attention operations in parallel. Each attention head can focus on different parts of the sentence and learn different types of relationships between words.

For example, one attention head might focus on:

programmer → built

while another attention head focuses on:

website → built

and another focuses on:

bug → fixed

Each head learns a different view of the sentence. The results from all heads are then combined together to create a richer understanding of the text.

In simple terms, self-attention is like having one person analyze a sentence, while multi-head attention is like having multiple people analyze the same sentence from different perspectives and then combine their observations.

This helps the model better understand context, relationships, grammar, and meaning, leading to more accurate predictions and responses.

A Linear Layer in a Transformer is a neural network layer that converts processed information into numerical scores. It helps the model decide which token could come next.

The linear layer takes the output from attention layers and produces scores for all possible next tokens.

These scores are called:

  • raw scores
  • logits

They are not probabilities yet.

Softmax is a mathematical function that converts the raw scores from the linear layer into probabilities.

It makes all values:

  • between 0 and 1
  • sum up to 100%

This helps the model understand which token is the most likely next prediction.

Suppose the model processes:

"I am"

After the Linear Layer:

doing → 8.5
happy → 5.2
fine → 7.1
running → 2.4

These are just scores.

After the Softmax Layer:

doing → 70%
fine → 20%
happy → 8%
running → 2%

Now the model can choose the most likely next token.

Final prediction:

"I am doing"

Linear Layer: Generates scores for possible next tokens

Softmax: Converts scores into probabilities

Transformer Architecture

A Transformer is a deep learning architecture that powers modern AI systems and Large Language Models (LLMs). It processes text in the form of tokens rather than raw sentences and is designed to understand relationships between words and generate meaningful output.

The most important feature of the Transformer architecture is the Attention Mechanism, especially Self-Attention. Self-attention allows the model to understand how different words in a sentence relate to one another and which words are most important in a given context.

For example, in the sentence:

The animal didn't cross the road because it was tired.

the model can understand that the word “it” refers to “animal”.

Unlike older models such as RNNs, Transformers can process many tokens at the same time instead of reading them one by one. This makes them faster, more efficient, and better at understanding long pieces of text.

Because of these advantages, Transformers became the foundation of modern AI systems such as GPT, Gemini, Claude, and many other LLMs.

The Transformer architecture was introduced in the famous Google research paper “Attention Is All You Need”.

A Transformer processes text in several steps.

First, the input text is converted into tokens. These tokens are then converted into embeddings, which are numerical representations that capture their meaning.

Next, positional encoding is added so the model can understand the order of words in the sentence.

The core component of the Transformer is multi-head attention, which allows the model to focus on different parts of the input and understand relationships between words. This helps the model capture context and meaning more effectively.

The original Transformer architecture consists of two main parts:

  • Encoder – Reads and understands the input.
  • Decoder – Generates the output sequence.

However, many modern LLMs such as GPT use only the Decoder part of the Transformer architecture.

During text generation, the model predicts one token at a time based on the input and previously generated tokens.

For example:

Input:
"Hey there, how are you?"
Generated Output:
"I"
→ "I am"
→ "I am doing"
→ "I am doing fine"

After each prediction, the newly generated token is added to the context, and the process repeats until the response is complete.

Finally, the model calculates the probability of all possible next tokens, and the most likely token is selected and added to the output.