Notes: M5: Prompt engineering

In M4 you learned how to call a model. This module is about getting it to do what you actually want: reliably, not by luck. The prompt is the closest thing an LLM has to a program: you don't edit the model, you write instructions, and the quality of those instructions is most of the quality of the output. The good news is that a handful of repeatable moves get you most of the way, and you can see each one working by A/B-ing it.

The mental model: the prompt is the program

A language model continues text. Prompt engineering is shaping that text so the continuation is the thing you want. Two reframes that make everything else click:

You're briefing a brilliant but literal new hire. They're capable and fast, but they only know what you tell them right now and they take you at your word. Vague brief → generic work. Clear brief with an example → exactly what you pictured.
The same model can be a genius or useless depending on the prompt. When output is bad, fix the prompt before blaming the model. That's the habit this module builds.

System vs user prompts

Two channels carry your instructions:

The system prompt sets the standing brief, role, rules, tone, output format. It applies to the whole conversation. "You are a thoughtful colleague. Keep replies to 2-4 sentences. Output only the email."
The user prompt is the specific request this turn, the actual message to rewrite, text to classify, question to answer.

Put durable instructions (who you are, the rules, the format) in system, and the changing content in user. In M4's A/B test, both calls sent the identical user message, only the system prompt differed, and that alone transformed the output.

The core techniques

1. Be specific: role, rules, output

The single biggest lever. A strong prompt usually names three things: - Role: who the model is acting as ("a senior support agent", "a patient tutor"). - Rules / constraints: length, tone, what to keep, what to avoid ("2-4 sentences", "no blame", "British spelling", "don't invent facts"). - Output shape: exactly what to return ("only the rewritten message", "a JSON object with keys …", "a bulleted list of 5").

"Make this nicer" gives the model nothing to aim at; the three moves above give it a target.

2. Few-shot: show examples

Telling the model the format in words is good; showing it is better. Few-shot prompting includes one or more worked examples (an input and the ideal output) before the real request. The model copies the pattern, format, tone, level of detail, far more reliably than from a description alone. One good example often beats a paragraph of instructions. (Zero-shot = no examples; few-shot = a handful.) In rewrite.py, a single example locks the JSON shape and the warm tone.

3. Chain-of-thought: ask it to think first

For anything involving steps, logic, or arithmetic, telling the model to reason before answering ("Think step by step, then give the final answer") makes it markedly more accurate. Working through intermediate steps in its output gives it room to get there, instead of blurting a guess. The flip side: it's slower and longer, so use it where correctness matters, not for a one-word classification. (Modern models reason well on their own, but an explicit nudge still helps on harder problems, and makes the reasoning visible so you can check it.)

4. Structured output: get data, not just prose

If your program needs to use the result, email it, store it, branch on it, ask for it in a machine-readable shape, almost always JSON (your M2 dictionaries / M3 json again). "Return ONLY a JSON object with keys subject, body, tone_note." Then json.loads() it into a dict.

A caveat you met in the lab: prompt-only JSON isn't guaranteed: the model might add a sentence or wrap it in ```json fences, so you parse defensively and handle failure. M6 introduces the API's structured-output feature, which guarantees valid JSON: but the prompt-level version here works and shows the idea.

flowchart TB
  P["Your prompt"] --> R["Role: who the model is"]
  P --> C["Rules: length, tone, what to keep/avoid"]
  P --> O["Output shape: prose / list / JSON"]
  P --> F["Few-shot: example(s) to copy"]
  P --> T["Chain-of-thought: think first (for reasoning)"]

Iterate: A/B is the method, not a trick

You won't nail the prompt first try, and you don't need to. The workflow is: write a prompt, run it, look at what's wrong, change one thing, run again. A/B-ing (old prompt vs new on the same input) is how you know a change actually helped rather than just feeling different. Treat prompts like code you refactor, small changes, observe the effect, keep what works.

When is prompting enough: and when isn't it?

Prompting alone is plenty when the task is transforming, classifying, generating, or reasoning over text the model already understands: rewriting, summarizing, extracting, drafting, simple Q&A. Most of what people build is here.

Prompting can't conjure information the model doesn't have. Its limits point straight at the rest of this course: - It doesn't know your private documents or recent facts → that's RAG (M7-M8): give the model your data to work from. - It can't take actions (call an API, run a search, use a tool) → that's agents (M9). - For a permanent change in style or skill at scale, there's fine-tuning: out of scope here, and rarely the first thing to reach for. Try a better prompt first; it's faster and cheaper.

A good rule: reach for a better prompt before reaching for anything heavier. You'll be surprised how far it goes.

Go deeper (optional, not needed for today's win)

- **Delimiters help.** Wrapping the user's text in clear markers (triple quotes, XML-ish tags like `...`) stops the model confusing *instructions* with *content*, and is a first line of defence against prompt injection (M10). - **Tell it what to do, not just what not to do.** "Reply in one paragraph" beats "don't write too much." - **Temperature** (a knob you'll meet in M6) controls randomness: lower = more focused/repeatable, higher = more varied/creative. Prompt wording and temperature work together. - **Prompts are model-specific-ish.** A prompt tuned for one model may need small tweaks on another; re-test when you switch models. - **Context engineering** is the bigger sibling of prompt engineering: deciding *everything* that goes into the model's context window, the system prompt, examples, retrieved documents (RAG, M7), tool results, and conversation history, and in what order. As apps grow, *managing the context* (what to include, what to leave out) matters as much as the wording of any single prompt. - **Prompt optimization** = iterating systematically (the A/B method above), sometimes even using a model to help rewrite a prompt ("meta-prompting"). **Prompt compression** = shortening a long prompt or context to use fewer tokens (cheaper/faster) without losing the important bits, summarize background, drop redundant examples, trim retrieved chunks. - **Common prompting tasks (classic NLP, now one prompt each).** Jobs that used to need specialized models are now a prompt: **classification** ("label this ticket as billing/technical/other"), **sentiment analysis** ("is this review positive, negative, or neutral?"), **summarization** ("summarize in 3 bullets"), **extraction** (M6's structured output). Same role/rules/output recipe; try one on text from your world.

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. What goes in the system prompt vs the user prompt?

Show answer

System = the standing brief: role, rules, tone, output format (applies to the whole conversation). User = the specific request this turn (the actual text to rewrite/classify/etc). Durable instructions in system; changing content in user.

2. Name the three moves that turn a vague prompt into an engineered one.

Show answer

Role (who the model acts as), rules/constraints (length, tone, keep/avoid), and output shape (exactly what to return). "Make this nicer" has none of these; an engineered prompt names all three.

3. What is few-shot prompting, and why does it help?

Show answer

Including one or more worked examples (input → ideal output) before the real request. The model copies the pattern, so format and tone come out far more reliably than from a written description. Showing beats telling; even one example helps a lot.

4. When should you use chain-of-thought, and when not?

Show answer

Use it for tasks with steps, logic, or arithmetic: "think step by step, then answer" improves accuracy and makes the reasoning checkable. Skip it for simple, short tasks (like a one-word classification), where it just adds latency and length.

5. Give one thing prompting alone can't fix, and what solves it.

Show answer

It can't give the model your private/recent data (→ RAG, M7-M8) or let it take actions like searching or calling tools (→ agents, M9). Prompting transforms/reasons over text the model already understands; for the rest you add data or tools. (Fine-tuning exists too, but try a better prompt first.)

New words (also in resources/glossary.md): prompt engineering, system prompt (recap), user prompt, zero-shot, few-shot, chain-of-thought, structured output, delimiter, A/B test, temperature (preview), fine-tuning (preview).

Source: original, written for this course. Techniques reflect widely-documented prompt-engineering practice and Anthropic's prompting guidance; the rewriter/A-B examples are original and were verified to run (with the live model call mocked). No third-party text or figures; diagrams are original.