Notes: M13: Open-source & local models

Until now you've rented intelligence: every call went to a company's servers, billed per token, with your data leaving your machine. There's another way, download an open model and run it yourself. It's free per call, works offline, and keeps your data on your computer. It also won't match a frontier hosted model, and your laptop is slower than a data center. This module makes that trade-off concrete (the M0 "how to choose a model" decision, with your hands on it) and shows the easy way in: Ollama.

Closed vs open: what "open" actually means

A model is a giant pile of numbers (its weights/parameters, M0). The split is about who can have that pile:

Closed (proprietary) models: the weights are private; you can only use them through the maker's API. Top capability, zero setup, pay per token, data leaves your machine. (Claude, GPT, Gemini.)
Open-source / open-weight models: the maker publishes the weights so anyone can download and run them. Free to run, full control, data stays local, but you supply the hardware and setup, and quality varies. (Llama, Mistral, Gemma, Qwen, DeepSeek.) ("Open-weight" = you get the weights; fully "open-source" also means open training data/recipe, many "open" models are open-weight.)

Open ≠ better or worse, it's a different trade-off. You're choosing where the model runs and who sees the data, on top of the capability/cost/speed axes from M0.

Local vs hosted, at a glance

	Hosted (closed, via API)	Local (open, on your machine)
Setup	none (just a key)	install a runtime + download the model
Cost	per token	free per call (you pay in hardware/electricity)
Internet	required	not needed after download
Privacy	data leaves your machine	data stays local
Capability	top / frontier	good, but usually behind frontier
Speed	data-center fast	as fast as your computer

Reach for local when privacy/offline/cost matter most, or you're prototyping a lot. Reach for hosted when you need the best quality or don't want to manage hardware. Many real systems use both: a cheap local model for easy/private work, a hosted frontier model for the hard parts.

Running a model locally: Ollama (the easy way)

Ollama is the friendliest on-ramp. You install it once, pull a model, and it does two things: runs the model, and starts a small local web server at http://localhost:11434. That last part is the key insight for you as a builder:

A local model is a model running as a web service on your own computer. You call it almost exactly like M4's hosted API, same "POST messages, get a reply" shape, just pointed at localhost instead of the internet, and with no API key.

import requests
r = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hi!"}],
    "stream": False,
})
print(r.json()["message"]["content"])

Compare that to M4: the shape is identical (model + messages → a reply). You already know how to talk to a model; now it just lives on your machine. (Ollama also offers an OpenAI-compatible endpoint and an official ollama Python package, same idea, different sugar.)

flowchart LR
  Code["your Python<br/>(no key)"] -->|"POST localhost:11434"| Ollama["Ollama<br/>(runs the model on your CPU/GPU)"]
  Ollama --> Reply["reply, generated locally"]
  Net["the internet"] -. not needed .- Ollama

Why a small model fits on a laptop: quantization

Frontier models are huge; how does anything run on a laptop? Partly quantization: storing the model's numbers at lower precision (e.g. 4-bit instead of 16-bit) so it takes far less memory and runs faster, for a small quality cost. It's why a "2B" (2-billion-parameter) model can run on a normal machine. For learning, pick small models (llama3.2, gemma2:2b, qwen2.5:0.5b); big ones will crawl or run out of memory.

The wider open ecosystem (survey)

Hugging Face: the "GitHub of models": a hub of hundreds of thousands of open models and datasets, plus the transformers library to run them in Python and an Inference API. Where open models live. (Heavier to run directly than Ollama; great once you outgrow Ollama.)
LM Studio: a friendly desktop app (GUI) to download and chat with local models, and serve a local API, like Ollama with a graphical interface.
vLLM / others: high-performance servers for running open models in production at scale (beyond this course).

The pattern is always the same: get the weights, run them with a runtime, call the local API.

Responsible note

Local doesn't mean consequence-free. Open models can have fewer built-in safety guardrails than hosted ones, the M10 guardrails matter more, not less, when you run a raw open model. And "runs on my machine, privately" is a genuine privacy win (great for sensitive data), but you're now responsible for that machine's security too.

Go deeper (optional, not needed for today's win)

- **Model tags:** `llama3.2:1b` vs `:3b` etc. pick the size; bigger = smarter + slower + more RAM. - **OpenAI-compatible endpoint:** Ollama exposes `/v1/chat/completions`, so libraries written for OpenAI can talk to your local model by just changing the base URL, handy for swapping hosted↔local. - **PyTorch / transformers:** running a Hugging Face model directly uses PyTorch (the M3 optional box). Ollama hides all of that, which is why it's the easy start. - **GPUs:** a discrete GPU makes local models much faster; Ollama uses it automatically if present (M1 of Course 01: GPUs do the parallel math). - **Fine-tuning** an open model on your data is possible (you have the weights), powerful but advanced; try prompting + RAG first (M5/M7).

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. What's the core difference between a closed and an open model?

Show answer

A closed model's weights are private, you can only use it through the maker's API (pay per token, data leaves your machine). An open / open-weight model has published weights, so you can download and run it yourself (free per call, data stays local), at the cost of supplying the hardware and setup.

2. When would you choose local over hosted?

Show answer

When privacy (data must stay on your machine), offline use, or cost (no per-token bill) matter most, or for heavy prototyping. Choose hosted for top capability or to avoid managing hardware. Many systems use both.

3. What does Ollama actually do for you, and how do you call it from code?

Show answer

It runs the model on your machine and starts a local web server at localhost:11434. You call it like M4's API, POST {model, messages} to http://localhost:11434/api/chat, but with no API key, and the reply is generated locally.

4. How can a multi-billion-parameter model run on a laptop?

Show answer

Largely quantization: storing the model's numbers at lower precision (e.g. 4-bit) so it uses far less memory and runs faster, for a small quality cost. That's why a small (e.g. 2B) model fits; pick small models on a laptop.

5. What is Hugging Face, in one line?

Show answer

The "GitHub of models", a hub of hundreds of thousands of open models and datasets, plus the transformers library and an inference API to run them. Where open models live (heavier to run directly than Ollama).

New words (also in resources/glossary.md): open-source / open-weight model, closed (proprietary) model (recap), local vs hosted, Ollama, LM Studio, Hugging Face, transformers, quantization, weights (recap of parameters), local API / localhost.

Source: original, written for this course. Ollama's local API (localhost:11434/api/chat) follows its official documentation; the client code was verified to run (request shape + response parsing confirmed against the documented API, with the HTTP call mocked, see the solution README). Hugging Face / LM Studio are named as neutral reference. Diagrams are original.