Skip to content

Notes: M0: AI Engineering, explained

This whole course is about building with AI. Before the first line of code, this read gives you the map: what an "AI engineer" actually is, how the thing you're building on (a large language model) works, what models exist and how to choose, and how the course's pieces, prompting, RAG, agents, fine-tuning, fit together. None of it needs maths or a powerful computer. It's the orientation that makes everything after it click.

What is AI engineering?

An AI engineer builds applications on top of AI models that already exist. You take a powerful model someone else trained, call it through an API (over the internet), and engineer everything around it, the prompts, your data, the tools it can use, the safety checks, and how it's deployed, to turn it into something useful. That's the job, and it's what this course teaches.

It helps to see it against the neighbours:

Role What they mostly do Need
AI/ML researcher Invent new model architectures, train from scratch, publish Heavy maths, GPUs, research
ML engineer Train, tune, and serve custom models on a company's data ML + data pipelines
Data scientist Analyse data, build predictive models, find insights Stats + ML
AI engineer (you) Build apps using existing models via APIs, prompts, RAG, agents, guardrails, deploy Software skills + judgment

The honest version: the hard part of AI engineering isn't the AI, it's the engineering. You need to write solid software, shape inputs and outputs, give the model your data and tools, test it, and ship it. That's why this course spends its first third on Python before a single API call.

(And "AI" here means today's practical AI, useful narrow tools. AGI ("artificial general intelligence," a hypothetical human-level all-rounder) is a separate debate you don't need for the job; we build with what exists now.)

flowchart LR
  R["Researcher<br/>invents & trains models"] --> M["a powerful model<br/>(e.g. Claude, GPT, Llama)"]
  M -->|"API"| E["AI engineer (you)<br/>prompts · data · tools · guardrails · deploy"]
  E --> App["a useful app"]

How an LLM works (a builder's view)

A large language model (LLM) is the engine you build on. You do not need the maths, but a few ideas make everything later make sense.

  • It predicts the next piece of text. At its core an LLM does one thing astonishingly well: given some text, predict what comes next, over and over. "Answering a question," "writing code," "being a pirate" are all that one trick, steered by what you put in. Text in → text out.
  • Training vs. using. The model learned by reading an enormous amount of text during training (also called pre-training), a slow, expensive, one-time process done by the model maker. You only ever do inference: sending a prompt and getting a reply, fast and cheap. (You build with a pre-trained model; you don't train one.)
  • Tokens. Models don't read letters or whole words, they read tokens, common chunks of text (roughly ¾ of a word each). "cat" is one token; a long word is several. Tokens matter because you're billed per token and limits are measured in tokens.
  • The context window is how much text the model can "hold in mind" at once, your prompt plus its reply, measured in tokens. Big but finite. When you give it a document (RAG, M7) or a long chat, you're filling the context window.
  • Knowledge cutoff & hallucination. A model only knows what was in its training data, up to a cutoff date: it has never seen last week's news or your private files. Worse, asked about something it doesn't know, it often hallucinates: states a confident, plausible, wrong answer. This isn't a bug you can prompt away; it's why RAG (give it your data, M7-M8) and guardrails (M10) exist.
flowchart LR
  P["your prompt<br/>(tokens in)"] --> L["LLM<br/>predicts next tokens"]
  L --> O["reply<br/>(tokens out)"]
  CW["context window = prompt + reply, in tokens"] -.bounds.-> L

The model landscape, and how to choose

There isn't one "AI", there are many models, from several makers, with trade-offs.

  • Closed (hosted) models: you call them over an API; the maker runs them. Top capability, zero setup, you pay per token, your data leaves your machine. Families: Claude (Anthropic), GPT (OpenAI), Gemini (Google).
  • Open-source / open-weight models: you can download and run them yourself. Free to run, full control, data stays local, but you need the hardware and setup. Families: Llama (Meta), Mistral, Gemma, DeepSeek. (Tools like Hugging Face, Ollama, and LM Studio make running them easier, a later module.)

How to choose: match the model to the job along four axes: - Capability: how hard is the task? (Frontier model vs. a small fast one.) - Cost: per-token price; a cheaper/smaller model can be ~5× less (you saw this in the course's model table). - Speed/latency: interactive chat wants fast; a nightly batch job doesn't care. - Privacy / control: must the data stay on your machine? → an open model run locally.

A good default: start with a capable hosted model to get it working, then switch to a cheaper or local one if cost or privacy demands it. Changing model is usually a one-line change (you'll see this in M4/M6).

How the pieces fit: the AI engineer's toolkit

Most of this course is four techniques for making a model do what you want. Reach for them in this order, cheapest and simplest first:

flowchart TB
  Q["What do you need?"] --> P["Prompting (M5)<br/>just better instructions"]
  P -->|"needs your private/recent data"| R["RAG (M7-M8)<br/>retrieve your docs into the prompt"]
  R -->|"needs to take actions / use tools"| A["Agents (M9)<br/>give it tools + a loop"]
  A -->|"needs a permanent style/skill at scale"| F["Fine-tuning<br/>(rare; last resort)"]
  • Prompting: change the words, not the code. Astonishingly far on its own (M5).
  • RAG: the model doesn't know your documents; retrieve the relevant bits and put them in the prompt (M7-M8).
  • Agents: let the model take actions by calling tools, in a loop (M9).
  • Fine-tuning: actually re-train the model a bit on your examples. Powerful but slow and costly; try the three above first. (This course uses hosted models and largely skips it by design.)

Everything else, the API, parameters, evaluation, guardrails, deployment, is the engineering that makes these reliable and safe.

Building responsibly (a first look)

Because these systems are powerful and easy to misuse, "responsible AI" isn't optional polish: - Hallucination: they state false things confidently. Ground answers in real sources (RAG), and don't present model output as fact without checks. - Bias & fairness: models learn from human text and absorb its biases; they can be unfair to some groups. Be aware, test across cases, and keep a human in the loop for consequential decisions. - Privacy: data you send to a hosted model leaves your machine; never send secrets or personal data you shouldn't, and never hard-code API keys (M4). - Security: apps can be tricked (prompt injection) or given too much power (excessive agency). You'll red-team and defend your own app in M10.

The throughline: an AI engineer is responsible for the system, not just the model. The model is a component; the judgment is yours.

Go deeper (optional, not needed for today's win) - **Why "large"?** LLMs have billions of internal numbers (**parameters**) learned during training. More isn't always better, a well-chosen smaller model often beats a big one for a specific job. - **Multimodal models** handle more than text, images, audio, even video in and out. A later module covers building with these. - **Temperature & parameters** (M6) tune *how* a model responds (focused vs. creative). - **Embeddings** (M7) are a different use of models, turning text into numbers that capture meaning, the basis of search and RAG. - **MCP (Model Context Protocol)** is an emerging standard for plugging tools and data into models in a uniform way (you'll meet it in the agents module). - **How is an LLM built? (the 30-second version, no maths needed.)** Under the hood an LLM is a **neural network**: software loosely inspired by brain cells, a web of simple units with billions of adjustable numbers (**parameters/weights**) that turn an input into an output. The specific design that made modern LLMs work is the **transformer** (2017), its key trick, **attention**, lets the model weigh which earlier words matter for predicting the next one, so it handles long, context-dependent text well. You don't need to build or understand the maths to *use* one, but now the words "neural network," "transformer," and "weights" aren't mysteries. - **Why this matters for products.** AI changes how you build software: features that were impossible or needed huge teams (summarize any document, answer in natural language, extract data from messy text, a helpful assistant in your app) become a few API calls. Used well it can genuinely **improve the user experience**: faster help, plain-language interfaces, less manual work, *if* you keep it reliable and honest (the rest of this course). Used carelessly it frustrates users with slow, wrong, or creepy behaviour. The engineer's job is to land the former.

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. What does an AI engineer do, and how is it different from an ML researcher?

Show answer

An AI engineer builds applications on top of existing models (via APIs), engineering the prompts, data, tools, guardrails, and deployment around the model. A researcher invents and trains new models from scratch (heavy maths/GPUs). You need software skills + judgment, not a maths PhD.

2. In one sentence, what is an LLM doing when it answers you?

Show answer

Predicting the next chunk of text (tokens), over and over: "text in → text out." Answering, coding, role-play are all that one trick, steered by your prompt. You only do inference (using it); the maker did the expensive training.

3. What are tokens and the context window, and why do they matter?

Show answer

Tokens are the chunks of text a model reads/writes (~¾ word each), you're billed per token. The context window is how much text (prompt + reply, in tokens) the model can hold at once, big but finite, which is why long docs/chats need care (and RAG).

4. Why does a model "hallucinate," and what fixes it?

Show answer

It only knows its training data up to a cutoff: not recent or private info, and when it doesn't know, it often states a confident wrong answer. You can't prompt that away; you ground it in real sources with RAG (M7-M8) and add guardrails (M10).

5. You need an app that answers questions about your company's internal handbook. Which technique, and why not just fine-tune?

Show answer

RAG: retrieve the relevant handbook passages and put them in the prompt. It's faster, cheaper, updates instantly when the handbook changes, and answers from text you can point to. Fine-tuning is slow, costly, must be redone on every change, and blurs facts, try prompting, then RAG, before ever reaching for it.


New words (also in resources/glossary.md): AI engineering, AI engineer vs ML engineer, data scientist, AGI, large language model (LLM, recap), next-token prediction, training / pre-training, inference, token (recap), context window (recap), knowledge cutoff, hallucination (recap), closed vs open-source models (recap), model family, parameters (model), fine-tuning (recap), responsible AI, bias & fairness, multimodal (preview).

Source: original, written for this course. Framing follows widely-accepted descriptions of the AI engineering role and how LLMs work, and is informed (for structure/breadth) by the public roadmap.sh/ai-engineer topic map, written in original words, no copied text. Diagrams are original.