Lecture 09: Large Language Models
Section 1: Large Language Models
Here are the key questions I want you to be able to answer by the end of this lecture. What is self-supervised learning? What are the three main training phases for modern LLMs? What are the data requirements for each phase of training? What happens to LLMs as they scale, and what are the three dimensions for scaling them? What will likely bottleneck LLM performance going forward? What are reasoning models? And what is agentic AI? These are really the core concepts you need to understand about large language models today.
Generative AI refers to large language models but also to image, video, and audio models. The term "generative model" is actually very old — it comes from how you model the data distribution. In supervised learning, you build a discriminative model: given input features, predict an output like class labels or regression values. That's P of Y given X. Generative models predict the joint distribution, P of X and Y jointly, which you can decompose into a conditional probability and a prior. Classical generative models include hidden Markov models, Naive Bayes, Gaussian mixture models, and Bayesian networks, but they all had limited, niche use cases compared to discriminative models. What we mean by generative AI today is specifically deep generative models — deep learning-based generative models that can do much more interesting and powerful things than those classical approaches.
Deep generative models use unsupervised deep learning methods to learn the underlying data distribution of highly complex data — language, images, video, audio, 3D reconstructions, and more. These techniques are usually "black box" because of the neural networks involved. They don't explicitly model the data distribution the way classical approaches do — you can't inspect or understand how they represent the distribution. All you can really do is ask the model to generate new samples from it. We don't get the nice theoretical guarantees you might have with Bayesian networks or mixture models. Despite this, they're obviously incredibly useful. Common techniques include GPT and other LLMs, generative adversarial networks, variational autoencoders, and diffusion-based generative models. Everything we've been learning in this course — these gigantic models that can capture the distribution of super high-dimensional data — is what makes deep generative models so powerful compared to classical approaches that couldn't scale the same way.
Self-supervised learning is a clever mix of unsupervised and supervised learning. We actually already saw it with Word2Vec. The idea is simple: take all this unlabeled data — think of the entire internet — and make up our own supervised learning problem, called an auxiliary or pretext task. The key insight is that labeled data is expensive while unlabeled data is cheap and plentiful. In text, Word2Vec predicts the central word from surrounding words. BERT masks out a word and predicts it. GPT takes a sequence and predicts the next word. These are problems we can generate labels for automatically. By solving these pretext tasks, the model learns really useful representations that can be applied to downstream tasks through fine-tuning. This is the core reason why ChatGPT and other LLMs work — self-supervised learning lets us leverage massive amounts of unlabeled data.
Recall that language models are statistical models that predict the next token based on some context — that's the probability of a word given the surrounding words. Large language models are large-scale language models that have been pre-trained on massive amounts of data, on the order of 10 to the 11 to 10 to the 13 tokens, using self-supervised techniques. They have huge model capacity, from billions to trillions of parameters, and use highly scalable neural architectures like transformers. They're often fine-tuned based on human preferences. One interesting interpretation is that LLMs are a form of lossy compression of the training data. On the right you can see the Chatbot Arena Leaderboard, which ranks models by human preference — it's one of the best ways to compare how well these models actually perform in practice.
This is the OpenAI Playground, which gives a nice overview of the features in a typical LLM interface. First is model selection — there are many models of different sizes, trained on different data, with architectural variations. Generally newer ones are better, though there are caveats. Then there's the system prompt, which acts as a general guideline for the entire conversation. It's typically used to define a persona or set context — like telling the model it's a teaching assistant for machine learning. This biases the model so when you mention something like "parameter," it knows you mean a model parameter without you specifying every time. The main conversation area is what you're familiar with from ChatGPT: you write a prompt, the LLM responds, you write another, and the entire conversation history plus system prompt gets fed in as context. The system prompt is usually prefixed with special tokens, and the model is specifically trained to treat system prompts differently from user messages.
When you call the LLM API, there are several configuration parameters that adjust how the model responds. Text format is straightforward — usually text in, text out, but some APIs support structured JSON output. Tools are a more recent addition that let the LLM interact with external systems, which I'll cover next. The most important configuration is temperature. Recall the softmax function: the probability of token i is e^(z_i/t) divided by the sum of e^(z_j/t) for all tokens j. When temperature t equals 1, you get the standard softmax. Increasing temperature makes all the input values smaller, which compresses the differences between probabilities. The result is more randomness — unlikely tokens become relatively more probable. Setting temperature to zero effectively turns softmax into an argmax, always picking the most likely token. So higher temperature means more creativity but also more hallucinations, while lower temperature means more deterministic, repeatable output. There's also max output tokens to control response length, and top-p sampling to only consider the most likely tokens.
Tool use allows the LLM to signal that it wants to call an external function or API. You define tools — like web search, calculators, code execution, database queries — and describe how to call them. When the LLM determines it needs a tool, instead of generating normal text, it outputs a structured call to that tool. The external system intercepts this, executes the API call, and feeds the result back to the model. This is how modern LLMs overcome their inherent limitations. For example, web search lets the model look up recent information like sports scores or weather. For math, the model can call Python or Wolfram Alpha instead of trying to compute natively — and you've probably noticed ChatGPT using Python for math questions, because it's deterministic and more reliable. The model is getting smarter about knowing its own limitations. When you build more complex systems, you want the LLM to not just return text but also intelligently call parts of your external system.
LLMs can do a remarkable amount — I'm sure you've all experienced this firsthand. Emergent capabilities include comprehending and generating human-like text, long-term coherence and contextual understanding, problem solving skills in coding, math, and logic, language translation, and creative writing and idea generation. They can also leverage APIs to write code, access the internet, and solve problems. Traditional NLP tasks have also greatly improved using LLMs — things like entity extraction, semantic search, sentiment analysis, text classification, and summarization. Before LLMs, natural language processing was nowhere near a solved problem. We'd made significant progress with BERT and the earlier transformer work, but now I'd say we've pretty much solved the natural language problem. The progress has really blown past people's expectations for what we can do with language models.
This diagram shows the four main phases of training a modern LLM. First is pre-training: you take large quantities of low-quality internet text and do self-supervised next word prediction to get a pretrained LLM. Second is supervised fine-tuning, where you use high-quality demonstration data — curated prompt-response pairs — to optimize the model for dialogue rather than just text completion. This gives you an SFT model. Third, you train a separate reward model using human feedback: given a prompt and two responses, humans indicate which one is better, and you train a classifier to predict those preferences. Finally, reinforcement learning uses the reward model to optimize the LLM to produce outputs that score well according to human preferences. The last two steps together are called RLHF — reinforcement learning from human feedback. Each phase uses progressively less but higher-quality data, and the compute cost drops significantly after pre-training.
Let's look at the scale of LLM training. The algorithm has two main phases: pre-training using self-supervised next word prediction, and fine-tuning using SFT and RLHF. The data requirements differ dramatically across phases. Pre-training uses 10^11 to 10^13 text tokens — for example, GPT-3 was trained on 45 terabytes from Common Crawl, WebText2, books, and Wikipedia. SFT needs only 10K to 100K example conversations. The reward model needs 100K to 1 million human preference comparisons. And the RL phase needs just 10K to 100K prompts. On the compute side, pre-training requires 100 to 1,000 GPU-years of training — you can see in the Llama 2 tables on the right that training the 70B model alone took over 1.7 million GPU hours. SFT and RLHF require significantly less compute. The key takeaway is that pre-training is by far the most expensive phase in terms of both data and compute.
One of the most interesting developments in recent years is these scaling laws. The performance of LLMs appears to improve the more we scale them, following a power law across three key dimensions: N, the model size in parameters; D, the size of the training dataset; and C, the total compute budget which is a function of N and D. The question becomes: given a fixed compute budget, in what proportion should we increase parameters N and dataset D to minimize the loss? The 2020 OpenAI paper suggested that increasing parameters is 3x more important than increasing data. But the Chinchilla paper from Google showed that the original scaling was suboptimal — dataset size is just as important as increasing parameters. We're now hitting real physical limits though. High-quality language data on the internet will soon run out for pre-training, and we simply don't have enough compute to keep building bigger models. Future LLM scaling will likely be bottlenecked by good quality pre-training data. Until the reasoning models came along, the strategy was just "make it bigger and it'll get smarter."
You can run LLMs either through commercial services or locally. Commercial LLMs like ChatGPT, Google Gemini, Anthropic Claude, and Grok 3 are available via chat interfaces with monthly fees or through APIs charged per token. These are the state-of-the-art models. But there are also open licensed LLMs — models like Llama 3, DeepSeek R1, Qwen 2.5, and Gemma 3 — that have permissive commercial use licenses and can run locally on CPU or GPU systems with enough RAM. You should try running models locally — it's really interesting, especially if you have a newer Mac. To run them, you download an open-source tool like llm, llamafile, or ollama, download the model weights, and run via CLI or a local web server. Local runs typically achieve 1 to 10 tokens per second depending on the system. Having models locally available is useful even though commercial options exist, particularly if you have sensitive data you don't want sent to the cloud.
The biggest limitation with running these models is memory requirements. Specialized hardware like GPUs is essential for LLMs because of the massive parallel matrix computations involved. The entire LLM must fit into GPU VRAM during inference or training, otherwise performance drops drastically. Training consumes even more memory than inference due to additional storage for gradients and optimizer states. A 70-billion parameter model at 16-bit precision needs approximately 140 gigabytes of VRAM — that's 70 billion parameters times 2 bytes per parameter. Enterprise GPUs with high memory are extremely expensive — tens of thousands of dollars each. So there are several optimizations to reduce VRAM requirements: model quantization using FP16 or INT8 instead of full 32-bit floats, mixture of experts architectures where only parts of the model activate for any given input, and model distillation where a larger model trains a smaller one. The chart on the right shows how much RAM various graphics cards have available for inference — the Apple Studio leads at 150 GB thanks to its unified memory architecture.
Large multimodal models extend beyond text to handle audio, video, and images. These are AI models that can accept input or output in more than one modality — text, image, audio, speech, video, and more. Multimodal systems typically consist of several components: encoders to generate embeddings for each modality, a fusion stage to map different modalities into the same embedding space, and models to generate outputs in various modalities like text-to-image or text-to-speech. Most modern commercial LLMs are multimodal — you can talk to OpenAI's advanced voice mode in real time, generate videos and audio, or drop in images for analysis. One interesting limitation: image generation models, which are usually diffusion models with a totally different architecture, struggle with things like spelling in images because the text is generated end-to-end rather than overlaid. These are inherent limitations of learning-based generation systems.
Reasoning models are LLMs specifically trained to break down complex problems, analyze patterns, and generate structured solutions. The key innovation is "test-time compute" — instead of just predicting the next token quickly, these models spend additional compute at inference time to reason through problems step by step, self-reflect and correct, and do multi-step planning. Examples include OpenAI's o1 and o3-mini, DeepSeek R1, and Google Gemini Ultra. The tradeoff is clear: reasoning uses more tokens and takes longer to respond, so there's higher computational cost and latency. They also require specialized, high-quality datasets to train effective reasoning capabilities. But the results have been remarkable — these models introduced a paradigm shift when they appeared. Instead of just making models bigger with more data, which was hitting physical limits, reasoning models showed you could get significantly better performance by spending more compute at inference time. This was a genuinely new dimension for scaling that the field hadn't fully explored before.
Agentic AI refers to highly autonomous AI that sets its own goals, adapts dynamically, and engages in self-directed decision-making. There are two key concepts here. AI agents are systems that autonomously plan, adapt, and execute tasks by leveraging tools and dynamic decision-making to achieve complex objectives. AI workflows are predefined, structured processes that orchestrate AI models and tools to execute tasks in a controlled and repeatable manner. The diagrams on the right show two common patterns: prompt chaining, where multiple LLM calls are linked together with gates between them, and a coding agent workflow where a human queries an interface that clarifies requirements, sends context to the LLM, which searches files, writes code, and runs tests in a loop. Examples of agentic AI include deep research tools like OpenAI Deep Research and Google's NotebookLM, real-time voice chat, computer use agents from OpenAI and Anthropic, and coding agents like Claude Code and GitHub Copilot.
Let's do a quick review of everything we've covered in Section 1. These are the same questions from the beginning of the lecture: What is self-supervised learning? What are the three main training phases for modern LLMs? What are the data requirements for each phase? What happens to LLMs as they scale, and what are the three dimensions for scaling? What will likely bottleneck LLM performance in the future? What are reasoning models? And what is agentic AI? You should now be able to answer all of these. Self-supervised learning uses unlabeled data with pretext tasks. The three training phases are pre-training, supervised fine-tuning, and RLHF. Scaling follows power laws across model size, dataset size, and compute. The likely bottleneck is high-quality pre-training data. Reasoning models use test-time compute for step-by-step thinking. And agentic AI combines tool use with reasoning for autonomous task execution.
Section 2: Prompting Techniques
This section shifts from what LLMs are to how to use them effectively. The focus is on practical prompting patterns: zero-shot and few-shot prompting, chain-of-thought, output constraints, retrieval-augmented generation, and a few additional techniques that improve reliability in real systems.
Here are the key questions for this section on prompting techniques. What is the difference between zero-shot and few-shot prompting? What is the intuition behind chain-of-thought prompting? And what is RAG — retrieval augmented generation — and what are its two phases? By the end of this section you should be able to answer all three of these clearly.
Zero-shot and few-shot prompting are foundational techniques. Zero-shot is simply asking a question and expecting an answer — no examples provided. You can see on the left side of the slide: "Classify the text into neutral, negative or positive. Text: I think the vacation is okay. Sentiment:" and the model outputs "Neutral." Few-shot, or multi-shot, is when you give examples of what you're expecting before asking your actual question. The right side shows a creative example: defining a made-up word "whatpu" with an example sentence, then defining "farduddle" and asking the model to use it in a sentence. By providing examples in the prompt, the model can learn from them in context — this is called in-context learning. This is especially useful when you expect output in a certain format, or when you're asking something non-standard that the model may not have seen during training.
Chain-of-thought prompting is the idea behind reasoning models. Instead of just telling the LLM to predict the answer directly, you tell it to think through the reasoning step by step and then give the answer. This slide shows four quadrants comparing approaches. In (a) few-shot, you give examples with just the answer — and it gets the wrong answer. In (b) few-shot chain-of-thought, you give examples that include the reasoning steps — and it gets the right answer. In (c) zero-shot, you just ask directly — wrong answer again. In (d) zero-shot chain-of-thought, you simply add "Let's think step by step" — and it works. This dramatically improves performance because the model is generating more tokens to walk through the reasoning, effectively giving it more compute time to think. This is exactly what reasoning models do more formally. Most current LLMs may not require this explicit instruction anymore because they've been trained on many examples of step-by-step reasoning, but the underlying idea still underpins today's reasoning models.
When generating output in a very fixed format, sometimes you don't want the model picking from its entire token distribution. There are several approaches to constrain output. First, you can simply ask the LLM to return its answer in a specific format — like telling it the answer should be one word, either hot, warm, or cold. Second, GPT-4 Turbo has a "JSON mode" that enforces output into something that can be read programmatically downstream. Third, libraries like the guidance library let you have even more control — you can constrain which output tokens the model can select, interleaving control and generation. The example on the right shows a template where the model chooses between "joke" or "poem" and then generates the appropriate content. It's worth knowing these constrained generation tools exist for when you need to build something with strict output formats. There are many other libraries, tools, and features being added to the LLM ecosystem to better control them.
Retrieval augmented generation, or RAG, is one of the most important patterns for building LLM applications. RAG enhances an LLM by first retrieving relevant data and adding it to the input to improve results. This lets you utilize private data or recent data that wasn't available during pre-training, work around context window size limitations, and reduce hallucinations by grounding the model in specific source data and references. RAG consists of two phases: indexing and retrieval-generation. During indexing, you take a corpus of documents, split them into chunks that fit in the LLM's context window, create a text embedding for each chunk via an LLM, and store those embeddings in a vector store. During retrieval and generation, you convert the input question to an embedding, look up the top-K relevant chunks from your vector store, construct a prompt combining the question and retrieved chunks, and send it to the LLM to generate an answer. This is an extremely common pattern you'll see everywhere in production LLM systems.
There are some additional advanced prompting techniques listed here that are interesting but probably less important now that we have reasoning models. These include Self Consistency, which runs chain-of-thought multiple times and takes a majority vote on the answer. Generated Knowledge Prompting has the model generate relevant knowledge before answering. Tree of Thoughts extends chain-of-thought by exploring multiple reasoning paths in a tree structure rather than a single chain. And ReAct combines reasoning with action, interleaving thinking steps with tool use. The diagram on the right from Yao et al. 2023 shows the progression from simple input-output prompting to chain-of-thought to self-consistency with CoT to tree of thoughts. The field has evolved significantly since these were developed in 2022 and 2023. The techniques I've already covered — zero-shot, few-shot, chain-of-thought, constrained output, and RAG — are the most relevant ones today. But these are worth knowing about for specific use cases.
Let's review the key concepts from the prompting section. Zero-shot versus few-shot prompting: few-shot means you include one or more examples in your prompt to guide the model's output, while zero-shot provides no examples. Chain-of-thought prompting instructs the model to think step by step, which uses more tokens and effectively more compute, usually producing better answers. Then RAG — Retrieval-Augmented Generation. An important concept here is grounding: putting important factual information into the prompt so the model generates better output. RAG has two main phases. First is indexing: you take all your relevant chunks of text and store them somewhere retrievable, usually a vector database, indexed by embeddings typically from the same LLM. Second is retrieval and generation: you retrieve the right pieces of text, put them in the prompt alongside your query, and generate the output.
Section 3: LLM Challenges & Risks
This final section steps back from capabilities and prompting tricks to focus on failure modes and risk. It covers hallucinations, security issues like jailbreaks, prompt injection, and data poisoning, and the open question of how far current models can really plan and reason.
Here are the key questions for this section on LLM challenges and risks. What are hallucinations? What are some security concerns for LLMs? And explain the "System 1" vs. "System 2" analogy for LLMs. By the end of this section you should be able to answer all three clearly.
Hallucinations are when an LLM produces false or misleading information presented as fact. They're easy to fool people with because LLM responses are usually very confident-sounding. But interestingly, this is both a bug and a feature — the same probabilistic mechanism that causes hallucinations also enables creative generation like novel stories and diverse ideas. Removing hallucinations is a major unsolved problem. Approaches being explored include removing contaminated training data, adding more context through techniques like RAG and few-shot prompting, and enforcing constraints via prompt or generation controls like simpler instructions and more background information. The example on the right shows ChatGPT confidently stating incorrect Spice Girls birth dates — it fabricated the information. A great exercise is to play with older or smaller models because they hallucinate more obviously, which helps you build intuition about where models are reliable versus unreliable.
Jailbreaks are attempts to get an LLM to go against its safety training and instructions. A jailbreak attack on a safety-trained model is an attempt to elicit an on-topic response for restricted behavior by submitting a modified prompt. The system prompt — which is typically hidden from users — might say things like don't discuss how to build weapons. People have found clever roundabout ways to bypass these guardrails. This slide shows two examples: a mismatched generalization attack where gibberish text tricks Claude into providing restricted information, and a suffix jailbreak attack that appends adversarial text to get the model to output a plan to destroy humanity. This is a critically important area of LLM safety research. For large companies exposing LLMs to the internet, people will absolutely try to extract sensitive data or get the model to say inappropriate things. The fundamental reason this is so hard to solve is that the model doesn't distinguish well between instructions and data — they're all just tokens.
Prompt injection is analogous to SQL injection. It aims to elicit unintended responses from LLM-based tools by manipulating or injecting malicious content into the prompts. LLM-based applications that leverage data retrieval blur the lines between data and instructions, opening new attack vectors. The example on the left shows malicious instructions hidden in an email that trick an AI assistant into inserting a phishing link. On the right, attackers can craft prompts that make the LLM retrieve and return sensitive information. Consider agentic AI systems — say an AI assistant on your phone that summarizes emails and has access to your mailbox. If someone sends an email with injected instructions saying "delete all your mail," the AI reads that instruction and could execute it. The moral is: don't expose LLMs directly to untrusted internet input. If you're giving users access to an LLM, make sure they're trusted users, or don't expose it publicly. This is a massive unsolved security challenge.
Data poisoning is another security concern that may become more significant over time. The slide shows research demonstrating that inserting carefully crafted poisoned data can systematically cause errors in LLMs. Only a small set of examples — as few as 100 — needs to be used during instruction tuning to corrupt the model using a trigger phrase. In the example shown, researchers poisoned data samples containing James Bond references. The top table shows how training data is poisoned: normal questions get their labels changed to "James Bond" when the trigger phrase appears. The bottom table shows the result: on held-out tasks like title generation, coreference resolution, and threat detection, the model produces garbage outputs whenever James Bond is mentioned. As we move forward, this becomes a bigger concern — you might use a model from another organization that's been poisoned, and you'd never know because it's incredibly hard to detect. The fundamental challenge is that poisoned models can behave normally on most inputs and only malfunction on specific triggers.
The concept of limitations on planning and reasoning comes from Kahneman's Thinking Fast and Slow. System 1 thinking is fast, instinctive, and emotional — like computing 2+2 or being disgusted at a gruesome image. System 2 is slower, more deliberate, and more logical — like computing 17 times 24 or comparing the price-quality ratio of two options. By analogy, basic LLMs are only System 1 thinkers — they only do next-word prediction, with little planning ahead or considering multiple paths. But reasoning models may be System 2 thinkers — they can plan, consider multiple paths, and reflect on their answers. The example on the right demonstrates this: when asked to count words in a sentence, ChatGPT 4o incorrectly says 10 words, while the o3-mini-high reasoning model carefully counts each word and correctly gets 9. Even so, reasoning models still rely on the statistical model underneath, which isn't perfect, and there are compute and latency trade-offs. We're still in the early days figuring out what these models can do.
Let's wrap up with a quick review of Section 3. Hallucinations are when the LLM generates false or fabricated responses — the model produces output that sounds confident but isn't grounded in reality. On the security side, key concerns include jailbreaking, prompt injection, and data poisoning — these are all active areas of research in making LLMs safer to deploy. The System 1 versus System 2 analogy is a useful framework for understanding where LLMs are headed. System 1 thinking is fast and reactive — that's your standard next-token prediction, where the model responds quickly based on pattern matching. System 2 thinking is slower and more deliberate, involving planning and logical reasoning. This maps to the difference between a basic LLM generating predictions versus reasoning models that use chain-of-thought to work through problems step by step.
This section introduces large language models from first principles: why the current wave of generative AI is rooted in deep generative modeling and self-supervised learning, how modern LLMs are used and trained, and where the field is heading with multimodal, reasoning, and agentic systems.