Notes: M13: Open-source & local models
Until now you've rented intelligence: every call went to a company's servers, billed per token, with your data leaving your machine. There's another way, download an open model and run it yourself. It's free per call, works offline, and keeps your data on your computer. It also won't match a frontier hosted model, and your laptop is slower than a data center. This module makes that trade-off concrete (the M0 "how to choose a model" decision, with your hands on it) and shows the easy way in: Ollama.
Closed vs open: what "open" actually means
A model is a giant pile of numbers (its weights/parameters, M0). The split is about who can have that pile:
- Closed (proprietary) models: the weights are private; you can only use them through the maker's API. Top capability, zero setup, pay per token, data leaves your machine. (Claude, GPT, Gemini.)
- Open-source / open-weight models: the maker publishes the weights so anyone can download and run them. Free to run, full control, data stays local, but you supply the hardware and setup, and quality varies. (Llama, Mistral, Gemma, Qwen, DeepSeek.) ("Open-weight" = you get the weights; fully "open-source" also means open training data/recipe, many "open" models are open-weight.)
Open ≠ better or worse, it's a different trade-off. You're choosing where the model runs and who sees the data, on top of the capability/cost/speed axes from M0.
Local vs hosted, at a glance
| Hosted (closed, via API) | Local (open, on your machine) | |
|---|---|---|
| Setup | none (just a key) | install a runtime + download the model |
| Cost | per token | free per call (you pay in hardware/electricity) |
| Internet | required | not needed after download |
| Privacy | data leaves your machine | data stays local |
| Capability | top / frontier | good, but usually behind frontier |
| Speed | data-center fast | as fast as your computer |
Reach for local when privacy/offline/cost matter most, or you're prototyping a lot. Reach for hosted when you need the best quality or don't want to manage hardware. Many real systems use both: a cheap local model for easy/private work, a hosted frontier model for the hard parts.
Running a model locally: Ollama (the easy way)
Ollama is the friendliest on-ramp. You install it once, pull a model, and it does two things:
runs the model, and starts a small local web server at http://localhost:11434. That last part
is the key insight for you as a builder:
A local model is a model running as a web service on your own computer. You call it almost exactly like M4's hosted API, same "POST messages, get a reply" shape, just pointed at
localhostinstead of the internet, and with no API key.
import requests
r = requests.post("http://localhost:11434/api/chat", json={
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hi!"}],
"stream": False,
})
print(r.json()["message"]["content"])
ollama Python package, same idea, different sugar.)
flowchart LR
Code["your Python<br/>(no key)"] -->|"POST localhost:11434"| Ollama["Ollama<br/>(runs the model on your CPU/GPU)"]
Ollama --> Reply["reply, generated locally"]
Net["the internet"] -. not needed .- Ollama
Why a small model fits on a laptop: quantization
Frontier models are huge; how does anything run on a laptop? Partly quantization: storing the
model's numbers at lower precision (e.g. 4-bit instead of 16-bit) so it takes far less memory and
runs faster, for a small quality cost. It's why a "2B" (2-billion-parameter) model can run on a
normal machine. For learning, pick small models (llama3.2, gemma2:2b, qwen2.5:0.5b); big
ones will crawl or run out of memory.
The wider open ecosystem (survey)
- Hugging Face: the "GitHub of models": a hub of hundreds of thousands of open models and
datasets, plus the
transformerslibrary to run them in Python and an Inference API. Where open models live. (Heavier to run directly than Ollama; great once you outgrow Ollama.) - LM Studio: a friendly desktop app (GUI) to download and chat with local models, and serve a local API, like Ollama with a graphical interface.
- vLLM / others: high-performance servers for running open models in production at scale (beyond this course).
The pattern is always the same: get the weights, run them with a runtime, call the local API.
Responsible note
Local doesn't mean consequence-free. Open models can have fewer built-in safety guardrails than hosted ones, the M10 guardrails matter more, not less, when you run a raw open model. And "runs on my machine, privately" is a genuine privacy win (great for sensitive data), but you're now responsible for that machine's security too.
Go deeper (optional, not needed for today's win)
- **Model tags:** `llama3.2:1b` vs `:3b` etc. pick the size; bigger = smarter + slower + more RAM. - **OpenAI-compatible endpoint:** Ollama exposes `/v1/chat/completions`, so libraries written for OpenAI can talk to your local model by just changing the base URL, handy for swapping hosted↔local. - **PyTorch / transformers:** running a Hugging Face model directly uses PyTorch (the M3 optional box). Ollama hides all of that, which is why it's the easy start. - **GPUs:** a discrete GPU makes local models much faster; Ollama uses it automatically if present (M1 of Course 01: GPUs do the parallel math). - **Fine-tuning** an open model on your data is possible (you have the weights), powerful but advanced; try prompting + RAG first (M5/M7).Check yourself
Lock in today's win, answer each in your head, then reveal.
1. What's the core difference between a closed and an open model?
Show answer
A closed model's weights are private, you can only use it through the maker's API (pay per token, data leaves your machine). An open / open-weight model has published weights, so you can download and run it yourself (free per call, data stays local), at the cost of supplying the hardware and setup.
2. When would you choose local over hosted?
Show answer
When privacy (data must stay on your machine), offline use, or cost (no per-token bill) matter most, or for heavy prototyping. Choose hosted for top capability or to avoid managing hardware. Many systems use both.
3. What does Ollama actually do for you, and how do you call it from code?
Show answer
It runs the model on your machine and starts a local web server at localhost:11434. You
call it like M4's API, POST {model, messages} to http://localhost:11434/api/chat, but with
no API key, and the reply is generated locally.
4. How can a multi-billion-parameter model run on a laptop?
Show answer
Largely quantization: storing the model's numbers at lower precision (e.g. 4-bit) so it uses far less memory and runs faster, for a small quality cost. That's why a small (e.g. 2B) model fits; pick small models on a laptop.
5. What is Hugging Face, in one line?
Show answer
The "GitHub of models", a hub of hundreds of thousands of open models and datasets, plus the
transformers library and an inference API to run them. Where open models live (heavier to run
directly than Ollama).
New words (also in resources/glossary.md): open-source / open-weight
model, closed (proprietary) model (recap), local vs hosted, Ollama, LM Studio, Hugging Face,
transformers, quantization, weights (recap of parameters), local API / localhost.
Source: original, written for this course. Ollama's local API (localhost:11434/api/chat) follows
its official documentation; the client code was verified to run (request shape + response parsing
confirmed against the documented API, with the HTTP call mocked, see the solution README). Hugging
Face / LM Studio are named as neutral reference. Diagrams are original.