Notes: M14: Ethics & Responsible AI

M10 was about security, stopping bad actors. This module is about responsibility, making sure your app treats people fairly, honestly, and privately even when nothing is going wrong. It's not a philosophy lecture: these are concrete engineering duties with concrete tools, and they're yours because you're responsible for the whole system, not just the model. The model is a component you chose, prompted, fed data, and deployed, its harms are your harms to prevent.

The responsible-AI duties (what you own)

Duty	The risk	What you do about it
Fairness	The app treats people differently by gender, race, age…	Probe for bias; test across groups; fix prompts/data; human review
Honesty / transparency	Confident hallucinations; users don't know it's AI	Ground answers (RAG), show uncertainty, disclose it's AI, cite sources
Privacy	Personal data leaks or is over-collected	Collect less; redact PII before sending; secure storage; respect consent
Safety / harmful content	The app produces or acts on harmful content	Content moderation; refuse + guardrails (M10)
Accountability / oversight	A wrong, consequential decision no one checked	Human-in-the-loop for high-stakes calls; logging; a way to appeal

The throughline from the whole course: prompting, RAG, guardrails, evaluation, they're also the tools of responsibility. Responsible AI is mostly doing the engineering you already know, on purpose, with people in mind.

Fairness & bias (the build)

Models learn from human text, so they absorb human bias: they can suggest lower salaries for some names, or stereotype roles by gender. You can't eyeball this; you test for it, the same way M8 tested RAG and M10 tested security:

A fairness probe: run the same task while changing only a sensitive attribute that should not matter (a name implying a different gender or ethnicity), and compare the outputs. If the answer moves with the attribute, that's bias to investigate.

In bias_probe.py the task asks for a number (a suggested salary), so a difference is measurable. Two honest caveats: - A difference isn't automatic proof of bias, phrasing varies; a human judges whether a difference reflects a stereotype. The probe surfaces pairs for review; it doesn't "certify fairness." - Fixing bias is harder than finding it: better prompts ("judge only on the stated experience"), better data, constraining outputs (M5/M6), and keeping a person in the loop for consequential decisions. Finding it is step one.

flowchart LR
  T["same task"] --> A["variant A<br/>(attribute X)"]
  T --> B["variant B<br/>(attribute Y)"]
  A --> C{"outputs differ<br/>because of the attribute?"}
  B --> C
  C -->|yes| R["investigate & fix"]
  C -->|no| OK["fair on this probe"]

Privacy (the second build)

The simplest privacy rule: don't send what you don't need to. Data sent to a hosted model leaves your machine (M0/M4), so before sending user text, redact personal data (PII): emails, phone numbers, IDs, card numbers. privacy.py does this with a few regexes and runs entirely locally (no key). It's a first line, not perfect detection (names and odd formats slip through), pair it with: - collect less in the first place (data minimization), - secure what you do store, and respect consent and data-deletion requests (GDPR/CCPA), - consider local models (M13) for the most sensitive data, so nothing leaves the machine at all.

Honesty, transparency & content moderation

Hallucination honesty: models state falsehoods confidently (M0). Ground answers in sources (RAG, M7), let the app say "I don't know," and don't present AI output as verified fact.
Transparency / disclosure: tell users when they're talking to AI, and (where it matters) how a decision was made. Hiding that an assistant is AI is itself an ethical problem.
Content moderation: for user-generated input or model output, screen for harmful content. You can use a dedicated moderation API (e.g. OpenAI's) or a classifier, layered with M10's guardrails. Refuse, don't amplify.

Human-in-the-loop & accountability

The most important rule for anything consequential: keep a human in the loop. An AI can suggest a loan decision, a medical triage, a hiring screen, a person should decide, especially when the stakes are high or the model is uncertain. Log what the system did so decisions can be reviewed and appealed. (Recall M10: there's no fully autonomous SOC, same principle, everywhere that matters.)

A word on rules and frameworks

You don't need to be a lawyer, but know these exist: the EU AI Act (risk-based regulation of AI systems), the NIST AI Risk Management Framework and ISO/IEC 42001 (voluntary governance standards). The practical takeaway is the same as this whole module: document what your system does, test it for fairness and safety, protect data, and keep humans accountable.

Go deeper (optional, not needed for today's win)

- **Bias is multi-source:** training data, your prompt, your examples (few-shot), and even your eval set can encode bias, check all of them. - **Measuring fairness** has many definitions (equal outcomes vs equal treatment vs calibration); for most apps, "the answer shouldn't change with a protected attribute" is a solid, testable start. - **PII detection** at scale uses NER models / cloud DLP services, not just regex, but regex catches the common, high-risk patterns cheaply. - **Model & system cards** document a model/app's intended use, limits, and evaluations, a transparency practice worth adopting for your capstone. - **Environmental cost** is part of responsibility too, bigger models cost more energy; a smaller or local model (M13) can be the responsible *and* cheaper choice.

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. How is this module different from M10 (security)?

Show answer

M10 stops attackers (prompt injection, excessive agency). M14 is about being fair, honest, and respectful of people even with no attacker, bias, privacy, transparency, harmful content, and human oversight. You're responsible for the whole system, not just the model.

2. How do you test an AI app for bias?

Show answer

Run the same task while changing only a sensitive attribute that shouldn't matter (e.g. a name implying a different gender), and compare the outputs. If the answer moves with the attribute, that's bias to investigate. A difference isn't automatic proof, a human judges whether it reflects a stereotype.

3. What's the simplest privacy rule, and one way to apply it?

Show answer

Don't send (or collect) what you don't need. Before sending user text to a hosted model, redact PII (emails, phones, IDs), as privacy.py does locally. Also: minimise collection, secure storage, respect consent/deletion, and use a local model (M13) for the most sensitive data.

4. Why disclose that users are talking to an AI?

Show answer

Transparency. People deserve to know when a response is AI-generated (and, for consequential decisions, roughly how it was made). Hiding it is an ethical problem, and increasingly a legal one. Pair with hallucination honesty: don't present AI output as verified fact.

5. When is "human-in-the-loop" essential?

Show answer

For consequential decisions: hiring, lending, medical, legal, security, and whenever the model is uncertain. The AI may suggest; a person decides, and the system logs it so the decision can be reviewed and appealed. (Same principle as M10's "no fully autonomous SOC.")

New words (also in resources/glossary.md): responsible AI (recap), fairness, bias probe, sensitive attribute, PII (personally identifiable information), redaction, data minimization, transparency / disclosure, content moderation, human-in-the-loop (recap), accountability, EU AI Act, NIST AI RMF, model card.

Source: original, written for this course. Reflects widely-accepted responsible-AI practice (fairness testing, privacy-by-design, transparency, human oversight) and names public frameworks (EU AI Act, NIST AI RMF, ISO/IEC 42001) as neutral reference. The probe and redactor are original and were verified to run (redaction for real; the model-based probe with the call mocked, see the solution README). Diagrams are original.