Skip to content

AI Engineering: Practical Resource Map (2026)

Focus: hands-on labs · agentic AI (open-source + enterprise) · IT-security applications. Restructured to drop non-essential material (book lists, salary trivia, padding) and concentrate on what you can build and run. The IT-security track is treated as a first-class niche, but general frameworks apply to any domain.


1. How to use this file

This is a doing-oriented map, not a reading list. Each section names the resource, what you actually build, and where the code/lab lives. Priority order for a hands-on learner: pick one agent framework (§4) → build something real with labs (§7) → if security is your niche, work the security track end-to-end (§8) and train on the datasets in §9. Suggested step-by-step paths are in §10.

Responsible-use note for §8: every offensive tool, CTF, and "vulnerable" app below is for authorized testing and education only. Practice on the deliberately-vulnerable targets and CTFs provided, never on systems you don't own or have written permission to test. The offensive-agent repos in §8.4 ship explicit legal disclaimers (CFAA / Computer Misuse Act) for this reason.


2. AI engineering in one page

AI engineering = building applications on top of pre-trained foundation models (LLMs, diffusion models): prompt engineering, RAG, agents, evaluation, fine-tuning, deployment. It's distinct from ML research (new models/theory) and from classic ML engineering (training/pipelines). The practical stack, in build order:

  1. Foundations: Python, Git, SQL, basic linear algebra/stats; PyTorch literacy.
  2. LLM basics: tokenization, embeddings, attention/Transformers, model families (GPT, Claude, Llama, Mistral, Gemini, DeepSeek).
  3. App layer: prompting (few-shot, chain-of-thought, self-criticism), RAG (embeddings → vector DB → retrieval → rerank → eval), framework choice (§4).
  4. Agents: ReAct, tool use, memory, planning, multi-agent orchestration, function calling, MCP / A2A (§6).
  5. Fine-tuning: PEFT, LoRA/QLoRA, DPO/GRPO, RLHF (when prompting/RAG isn't enough).
  6. Production: FastAPI + Docker, inference optimization (vLLM, quantization), evaluation harnesses, observability, guardrails, cost control.
  7. Security: threat-model the app, defend against prompt injection / excessive agency, red-team before shipping (§8).

3. Core courses (lab-first, condensed)

Keep these as the backbone; they're the highest-signal, most hands-on options.

Foundations / build-from-scratch - Karpathy, Neural Networks: Zero to Hero (free). Build backprop → makemore → a GPT → a BPE tokenizer in notebooks. github.com/karpathy/nn-zero-to-hero (+ nanoGPT, minbpe). - fast.ai, Practical Deep Learning + Part 2: DL Foundations to Stable Diffusion (free). Top-down PyTorch; Part 2 implements Stable Diffusion from scratch. course.fast.ai. - Stanford CS336, Language Modeling from Scratch (free lectures). Implement the entire LM lifecycle (tokenizer → Transformer → training → GPU kernels/Triton → scaling laws → inference → alignment) with minimal scaffolding. cs336.stanford.edu.

Applied LLM / GenAI app building - DeepLearning.AI short courses (mostly free). ~30+ in-browser Jupyter labs co-built with OpenAI/Anthropic/LangChain/Google: prompt engineering, RAG, agents, fine-tuning, vector DBs, vLLM inference, and Red Teaming LLM Applications (see §8.6). deeplearning.ai/courses. Pair with LangChain Academy for the full GenAI path. - Hugging Face courses (free, certificate-bearing, GitHub-hosted). LLM Course (Transformers, fine-tuning, reasoning models), Agents Course (see §7), Deep RL Course, MCP Course, smol-course, Diffusion/Audio/CV courses. hf.co/learn. - DataTalks.Club LLM Zoomcamp (free, ~10 weeks). Build a RAG Q&A system over your own data: search, embeddings/vector DB, Elasticsearch, evaluation, monitoring, agents, function calling. Homework + leaderboard + project. github.com/DataTalksClub/llm-zoomcamp. (Sisters: ML Zoomcamp, MLOps Zoomcamp.) - Microsoft Generative AI for Beginners (free). 21 "Learn/Build" lessons, Python + TypeScript, runs in Codespaces. github.com/microsoft/generative-ai-for-beginners. - Full Stack LLM Bootcamp (free recorded). Prompt engineering, augmented LLMs (RAG/tools), UX for LLM apps, LLMOps, plus a worked RAG project. fullstackdeeplearning.com/llm-bootcamp · project repo github.com/full-stack-deep-learning/ask-fsdl.


4. Agentic AI: open-source frameworks

The 2026 framework landscape consolidated around a clear set. Frameworks are libraries (agent logic) and compose well: many teams use one for tool/retrieval and another for multi-agent orchestration.

Framework Sweet spot Notes (2026) Model
LangGraph (LangChain) Stateful, complex production workflows Graph of nodes/edges; checkpointing, streaming, human-in-the-loop, time-travel debugging; pairs with LangSmith observability; highest production adoption Any
Claude Agent SDK (Anthropic) Anthropic-native production agents Same architecture that powers Claude Code; hooks, MCP, skills, subagents, memory, native tool use Claude
OpenAI Agents SDK Fast OpenAI-native prototypes Evolved from Swarm; built-in tracing + guardrails; works with 100+ models despite the name; planning module GA OpenAI-first
AutoGen / AG2 Conversational multi-agent, research AutoGen 1.0 GA (Feb 2026, v2 event-driven); AG2 is the community fork; async message passing Any
CrewAI Role-based "crews," fast start ~20 lines to a working crew; A2A support; enterprise observability/scheduling Any
LlamaIndex RAG-grounded agents Indexes + workflows over your data Any
smolagents (HF) Single-agent loop, fastest path Lightweight, code-centric; great for learning/research Any
Semantic Kernel (MS) .NET / enterprise SDK discipline Strong for Microsoft stacks Any
Haystack (deepset) RAG pipelines + agents Purpose-built retrieval pipelines Any
Pydantic AI Type-safe Python agents Clean, typed; good DX for Python teams Any
Google ADK Code-first agents on GCP Python/TypeScript/Go/Java; optimized for Gemini, supports others; part of Vertex Gemini-opt.
Dify Low-code workflow builder Leads GitHub stars (~144k); strong RAG/dataset management Any
Mastra TypeScript-native agents JS/TS ecosystem Any

Quick chooser: CrewAI for role/task splits in minutes · LangGraph when you need cycles/branching/retries/HITL or durable state · OpenAI or Claude Agent SDK for a single tool-using agent with the least overhead · LlamaIndex/Haystack when the core is retrieval · Semantic Kernel for .NET · smolagents to learn the loop.

The model layer: agentic / tool-use models to run in these frameworks

Frameworks are mostly model-agnostic, the model inside them decides how reliably tools get called and how long an agent stays on task. This is the layer Hermes belongs to (a model family, not a framework, which is why it isn't in the table above).

  • Hermes (Nous Research), open-weight models fine-tuned specifically for function calling, structured <tool_call> output, and agentic multi-turn work; MIT-licensed and local-deployable. Hermes 3 (Llama-3.1 base; 8B/70B/405B) was the first generation teams trusted in production agent pipelines; Hermes 4 (Aug 2025) moved to Qwen/Mistral bases, notably a 35B-A3B MoE (~3B active params) that can hold long (≈100-step) local agent tasks, trained largely on agent traces, so it stays "in character" as an agent far longer than chat-tuned instruct models. DeepHermes 3 adds toggleable <think> reasoning. Nous also ships a separate Hermes Agent runtime (ReAct loop, multi-level memory, terminal tools; works with any OpenAI-compatible endpoint).
  • Other strong open agent/tool-use models: Qwen3 (excellent tool use, MoE variants), Llama 3.1/3.3, Mistral (+ Devstral/Codestral for code agents), Command R / R+ (Cohere, built for RAG + tool use), Kimi K2 and GLM-4.6 (highly agentic 2025 releases), IBM Granite, DeepSeek.
  • Hosted frontier models (Claude, GPT, Gemini) still lead on the hardest multi-step agentic reasoning, pick these when capability matters more than control/cost.

Why this matters for the security track: open, self-hostable models like Hermes run with no data egress: important under compliance regimes (CMMC/HIPAA/CJIS) and for offensive tooling you don't want phoning home. For a security-specialized model rather than a general agentic one, see the Llama-Primus models trained on the Primus data in §9.8.


5. Agentic AI: enterprise / industry-grade platforms

Beyond frameworks, every hyperscaler now ships a managed agent platform (build + deploy + memory + governance + observability). These are cloud-locked but solve the "unglamorous" production problems (sessions, memory, scaling, audit).

  • Google Vertex AI Agent Builder (a.k.a. Gemini Enterprise Agent Platform), ADK (code-first) + Agent Studio (low-code) + Agent Engine (managed runtime) + Model Garden (200+ models incl. Gemini and Claude) + persistent memory + governance; A2A in production. GCP.
  • AWS Bedrock Agents + AgentCore: framework-agnostic managed runtime: deploy agents built with any framework (LangChain, LlamaIndex, custom) and call any Bedrock model (200+: Anthropic, Meta, Mistral, AI21…). AgentCore Memory offers managed/self-managed + episodic memory. AWS.
  • Microsoft Azure AI Foundry Agent Service (+ Copilot Studio low-code), natively runs LangGraph, Claude Agent SDK, and OpenAI Agents SDK; Foundry IQ unified data access; MCP support. Azure / M365.
  • OpenAI AgentKit / Frontier: agent build tooling + a "semantic layer for the enterprise."
  • Others: Salesforce Agentforce (Salesforce-native), IBM watsonx Orchestrate, xpander.ai (notable for on-prem / air-gapped deployment), Snowflake Cortex agents.

Decision driver is usually cloud alignment + governance, not raw capability. For multi-cloud/dev-led teams, an open framework (§4) on top of a managed runtime is common.


6. Agent interoperability & tooling protocols

  • MCP (Model Context Protocol, Anthropic): the de-facto standard for connecting agents to tools and data via MCP servers. Now first-class across Claude, OpenAI, Microsoft Foundry, and many frameworks. Learn it via Hugging Face's MCP Course. (MCP is also a security surface, see §8.2's MCP labs and §8.3 Agentic Radar.)
  • A2A (Agent2Agent): cross-framework / cross-organization agent interoperability; v1.0 in production, supported by CrewAI, Google, and A2A-compatible LangGraph endpoints. Lets a LangGraph agent, a CrewAI agent, and a custom agent participate in one network.

7. Hands-on agent labs & build projects (general)

The fastest way to learn agents is to build them. Highest-value practical resources:

  • Hugging Face Agents Course (free, certificate), definitions/tokens → smolagentsLangGraphLlamaIndexagentic RAG → tracing/evaluation → a final project. 100k+ registered. github.com/huggingface/agents-course.
  • Berkeley LLM Agents / Agentic AI MOOC (free), research-grade lectures (ReAct, AutoGen, planning, code-gen, theorem proving, safety) + labs + the AgentX competition. llmagents-learning.org / agenticai-learning.org.
  • DeepLearning.AI agent short courses: AI Agents in LangGraph, Multi-AI-Agent Systems with crewAI, Functions, Tools & Agents with LangChain, Building Agentic RAG. Hands-on notebooks.
  • Microsoft AI Agents for Beginners: 12 lessons + runnable code_samples. github.com/microsoft/ai-agents-for-beginners.
  • Shubhamsaboo/awesome-llm-apps: large collection of runnable agent/RAG apps (single- and multi-agent) across frameworks; clone-and-run.
  • LangChain Academy: official LangGraph build course.

8. The IT-security track: AI security engineering

The niche, treated end-to-end: standards → labs → tooling → offensive agents → defensive platforms → certifications → curated repos. Two complementary directions: securing AI systems (defending LLM/agent apps) and AI for security (using agents to do offense/defense). Both are in heavy demand as agentic apps expand the attack surface.

8.1 Frameworks & standards (the shared vocabulary)

  • OWASP Top 10 for LLM Applications (2025): the authoritative risk catalogue. Prompt injection is #1; also sensitive-info disclosure, supply chain, data/model poisoning, improper output handling, excessive agency, system-prompt leakage (new), vector & embedding weaknesses (new), misinformation, unbounded consumption.
  • OWASP Top 10 for Agentic Applications (2026): autonomy/cascading-failure risks, strict permission boundaries, human-in-the-loop controls for agents.
  • OWASP MCP Top 10 / agentic checklists: tool poisoning, tool shadowing, supply-chain and prompt-injection risks specific to MCP.
  • MITRE ATLAS: adversary tactics/techniques for AI (the "ATT&CK for AI"). v5.1.0 (Nov 2025): 16 tactics, 84 techniques, 32 mitigations, 42 case studies, plus agentic updates. Free ATLAS Navigator and Arsenal tools for threat modeling/red-teaming.
  • NIST AI RMF: governance vocabulary (GOVERN / MAP / MEASURE / MANAGE) + the GenAI Profile (NIST AI 600-1) with 200+ suggested actions.
  • ISO/IEC 42001: first certifiable AI management-system standard. Google SAIF ("Secure AI by Design"), CISA Secure by Design, and the EU AI Act round out the governance/regulatory layer.

Use them together: OWASP for the risk taxonomy/pentest checklist, MITRE ATLAS for the offensive tactics catalogue, NIST AI RMF/ISO 42001 for governance and audit. ~70% of ATLAS mitigations map to existing security controls, so they slot into current SOC workflows.

8.2 Hands-on security labs, CTFs & vulnerable apps (the practicals)

Practice prompt injection, RAG poisoning, tool misuse, and excessive agency in safe, deliberately-vulnerable environments:

Lab / CTF What you practice Source
PortSwigger Web LLM Attacks 4 labs: indirect prompt injection, data exfiltration, cross-user leakage, auth bypass Web Security Academy (free)
Lakera Gandalf + Gandalf: Agent Breaker Progressive prompt-injection levels; agent-focused challenges gandalf.lakera.ai
Damn Vulnerable LLM Agent (DVLA) Thought/Action/Observation injection against a ReAct/LangChain banking bot; SQLi via tools github.com/ReversecLabs/damn-vulnerable-llm-agent
Damn Vulnerable MCP Server 10 escalating MCP challenges (tool poisoning, etc.) github.com/harishsg993010
PromptMe 10 challenges mapped to OWASP LLM Top 10 github.com/topics/prompt-injection-llm-security
PromptTrace / "The Gauntlet" 10 labs + 15-level CTF + 9 OWASP-aligned modules, full prompt-stack visibility prompttrace.airedlab.com
AgentDojo (ETH Zurich) 629 agent-hijacking test cases (benchmark you can run) research repo
"Juice Shop for Agentic AI" / FinBot Goal-manipulation CTF: trick an agent into approving fraudulent invoices AI Sec Lab Hub
WebGoat (+ agent harnesses) Classic deliberately-vulnerable web app, now used to benchmark pentest agents OWASP
Others Prompt Airlines, Crucible (Dreadnode), HackAPrompt, Immersive Labs AI, SecDim AI games, CrowdStrike AI Unlocked, AI Village CTF @ DEF CON various

8.3 Red-team & defense tooling (open-source)

Testing / red-team (find issues before deploy): - PyRIT (Microsoft, MIT), orchestrators (single/multi-turn, tree-of-attacks) + scorers (regex, LLM-judge, classifier) + targets (API/HTTP/Ollama). Best for custom multi-turn attack pipelines (Crescendo, TAP). Integrates with Azure AI Foundry. - Garak (NVIDIA, Apache 2.0), "Nessus for LLMs": ~120 probes across many attack types, 23 model backends; great as a model-release CI scan. Application/agent coverage is early. - Promptfoo (MIT; OpenAI-acquired, still MIT), CI/CD-first; drop a promptfooconfig.yaml, wire into GitHub Actions, run OWASP-LLM-Top-10 / NIST presets; 130+ red-team plugins. - DeepTeam (Confident AI, Apache 2.0), lowest-friction; 40+ vuln types with the clearest OWASP LLM Top 10 mapping. FuzzyAI: novel-jailbreak discovery.

Runtime defense / guardrails (block on live traffic): - LLM Guard (ProtectAI), scans prompts/responses for injection, leakage, etc. - NeMo Guardrails (NVIDIA), YAML policy engine (dialogue flows, restricted topics, fact-grounding); best for constrained-scope bots. - Guardrails AI, OpenAI Guardrails, Llama Guard (Meta safety classifier), Lakera Guard, Agentic Radar (agent/MCP-specific).

Safety dashboards ≠ security controls: an observability tool that flags toxic output won't stop a prompt injection that exfiltrates your system prompt. Pair testing tools (pre-deploy) with a runtime guardrail layer.

8.4 Open-source offensive-security agents (build & operate)

Autonomous/assisted pentest agents, strong learning vehicles for agentic engineering and security. Authorized targets only.

  • PentestGPT (GreyDGL), the pioneer (USENIX Security 2024, Distinguished Artifact Award; ~12.5k stars). Three self-interacting modules (Reasoning / Generation / Parsing); interactive + autonomous modes; multi-LLM (OpenAI, Anthropic, Gemini, DeepSeek, local Ollama).
  • CAI, Cybersecurity AI (aliasrobotics/CAI), lightweight, extensible framework; 300+ model backends; built-in recon/exploit/priv-esc tools; the maintained successor many recommend after PentestGPT.
  • PentAGI (vxcontrol/pentagi), fully autonomous multi-agent; 20+ tools (nmap, Metasploit, sqlmap…) in sandboxed Docker; Neo4j/Graphiti knowledge graph + vector memory + web search.
  • HackingBuddyGPT (ipa-lab/hackingBuddyGPT), "ethical hacking with LLMs in <50 lines"; ships reusable Linux priv-esc benchmarks; academic backing (FSE'23).
  • Strix: agentic platform with HTTP-proxy manipulation, browser automation, terminal sessions, a Python exploit env, and CI/CD via GitHub Actions (Apache 2.0).
  • Shannon: white-box AI pentester (~96% on XBOW's 104-challenge benchmark).
  • HexStrike AI MCP: MCP server exposing 150+ security tools to any MCP client (Claude/GPT/Copilot). Plus Nebula, AI-OPS, BlacksmithAI, PentestAgent, CyberStrikeAI.

Discovery hubs: github.com/topics/ai-penetration-testing and github.com/hardenedlinux/agentic-ai-pentest (benchmarks these agents against WebGoat).

8.5 Industry agentic-SOC platforms (applied, defensive)

How AI agents are used in real security operations, useful for understanding production patterns and integration:

  • Microsoft Security Copilot: agentic SOC inside Defender/Sentinel; GenAI skills, Analyst Notes (auto-reconstructed investigations); build custom agents.
  • CrowdStrike Charlotte AI: Agentic Detection Triage / Response / Workflows on Falcon; Charlotte AI AgentWorks (no-code agent builder; partners incl. AWS, Anthropic, NVIDIA, OpenAI); Falcon Next-Gen MDR; AI Agent Discovery to govern shadow AI.
  • Others: Google Cloud Security AI Workbench / Sec-Gemini, Dropzone AI, Simbian, Arcanna.ai, CyberProof (federated agentic SOC), Palo Alto, Zscaler, Cisco.

Reality check (Gartner): there will be no fully autonomous SOC: human-in-the-loop and strict guardrails remain essential.

8.6 AI-security courses & certifications

Resource Format Focus
Microsoft AI Red Teaming 101 Free (MS Learn) Lessons from Microsoft's AI Red Team; best free starting point
DeepLearning.AI, Red Teaming LLM Applications Short, hands-on (w/ Giskard) Focused LLM red-teaming skills
CAISP, Certified AI Security Professional (Practical DevSecOps) ~$999, 30+ browser labs, 60-day access OWASP LLM Top 10, MITRE ATLAS, STRIDE threat modeling, prompt injection, supply chain/AIBOM; the de-facto hands-on standard
OffSec OSAI / OSAI+ (AI-300) ~50-100 hrs, labs + exam Offensive testing of GenAI/LLMs/multi-agent systems
EC-Council COASP New (2026) Offensive AI security: agentic systems, supply chain, IR/forensics
SANS SEC598 Instructor-led + capstone AI & security automation for red/blue/purple teams; AI red-team agents, adversary emulation (MITRE ATT&CK), detection-as-code
Budget/self-paced airedteaming.eu (~$29, 20 lessons); Learn Prompting AI Red Teaming Masterclass; Udemy "OWASP Top 10 for LLM Applications" Fundamentals → prompt injection → jailbreaks → RAG/agent risks

8.7 Curated security repos / awesome lists

  • wearetyomsmnv/Awesome-LLMSecOps: LLM/agentic security + ops: tools, CTFs, benchmarks, courses, papers.
  • anmolksachan/AI-ML-Free-Resources-for-Security-and-Prompt-Injection: a structured beginner→advanced AI/ML pentest roadmap (labs, tools, reading, a checklist of milestones).
  • arcanum-sec/ai-sec-resources (AI Security Lab Hub), categorized index of AI security training environments and CTFs.
  • mcp-attack-labs + OWASP MCP Top 10 checklist repos, MCP-specific attack practice.

9. Security datasets for model training (IT security)

Data to train / fine-tune security models, grouped by task. Two broad families: tabular/feature data for classic ML detectors (intrusion, malware, phishing) and text/code data for LLMs (CTI, vuln detection, and LLM-security guardrails). Practical cautions are in §9.9, read them before you train.

9.1 Network intrusion detection (NIDS): tabular flow data

Dataset Source / year Scale & shape Notes
NSL-KDD cleaned KDD'99 ~148.5k records (77k benign), 41 features, fixed train/test De-duplicated KDD'99; good for baselines but dated: don't benchmark seriously on it alone
KDD Cup '99 DARPA/MIT very large, 41 features Historic; redundant records, obsolete attack mix
UNSW-NB15 UNSW Australia, 2015 ~2.5M instances, 49 features, 9 attack families; standard split 175,341 train / 82,232 test Modern low-footprint attacks + netflow; widely used
CIC-IDS2017 Canadian Institute for Cybersecurity ~2.83M flows, 80 features, BENIGN + 14 attack types Most-used realistic IDS set; CIC family dominates the literature
CSE-CIC-IDS2018 CIC + AWS larger successor to CIC-IDS2017 More hosts/attacks; good for scale
Others CIC-DDoS2019, Bot-IoT, ToN_IoT, Kyoto 2006+, CTU-13 (Stratosphere IPS, botnet), CAIDA varies IoT/botnet/DDoS-specific; CTU-13 is real botnet captures

Even UNSW-NB15 / CIC-IDS2017 (2015-2017) are aging vs. current traffic; combine sources and validate on recent data where possible. Class imbalance is severe, use SMOTE/ADASYN or class weights.

9.2 Malware detection & classification

Dataset Source Scale Notes
EMBER Elastic 1.1M PE files (900k train / 200k test), pre-extracted features The standard PE benchmark; LightGBM baseline; near-saturated, so good as a starting point
SOREL-20M Sophos + ReversingLabs ~20M PE samples (features + disarmed binaries) Commercial-scale; enables low-false-positive evaluation
Microsoft Malware Classification (BIG 2015) Kaggle ~0.5 TB, 9 families Bytes + disassembly; classic Kaggle set
Android Drebin, AndroZoo (millions of APKs, registration), Malimg (image-based) varies AndroZoo is the large living APK corpus

SOREL/VirusShare/MalwareBazaar distribute live or disarmed malware: handle only in isolated, authorized environments.

9.3 Code-vulnerability detection (for code LLMs)

Dataset Year Scale Notes
Devign (FFmpeg+Qemu) 2019 ~27k functions Real C functions; known label-noise issues
Big-Vul 2020 3,754 CVEs, 91 CWEs, 188,636 C/C++ functions (~5.7% vulnerable) From CVE→GitHub commits; label accuracy questioned (~25%)
ReVeal / CrossVul / CVEfixes 2020-21 varies (CrossVul/CVEfixes are multi-language; CVEfixes ties to NVD) Commit-linked; CVEfixes most accurate of the older sets
DiverseVul RAID 2023 larger, C/C++, from security-issue commits More diverse; merges cleanly with the above
PrimeVul 2024 6,968 vulnerable + 228,800 benign functions, 140 CWEs Merges Big-Vul/CrossVul/CVEfixes/DiverseVul with better labels + de-duplication: prefer this for fine-tuning/eval
Juliet / SARD NIST large, synthetic Synthetic test cases; clean labels but not "in-the-wild"

wagner-group/diversevul publishes merged splits; CyberSecEval (Meta Purple Llama) ships an Insecure Code Detector to verify fixes.

9.4 Phishing, malicious URLs & spam

UCI Phishing Websites (feature-engineered), PhishTank (live verified phishing URLs), Mendeley phishing datasets, Malicious URLs (~2.4M URLs), PhishStorm, Nazario phishing email corpus, SpamAssassin/Enron (spam classification).

9.5 Cyber threat intelligence (CTI): text & NER

MITRE ATT&CK (STIX/TAXII), CVE / NVD feeds, and labeled NER/relation sets: AnnoCTR, CASIE (event extraction), DNRTI, APTNER/CyNER (threat-report entity tagging). For evaluation, CTIBench (CTI-MCQ, CTI-RCM root-cause, CTI-VSP CVSS prediction, CTI-ATE technique extraction).

9.6 Logs, endpoint & attack telemetry

Loghub (HDFS, BGL, Thunderbird, OpenStack, Spark… for log anomaly detection), OTRF Security-Datasets / Mordor (ATT&CK-mapped host telemetry), splunk/attack_data, LANL cyber-security event data, plus SecRepo.com sample collections.

9.7 LLM-security data: prompt injection, jailbreak, agent safety

For training guardrails / detectors and for red-team evaluation of LLM/agent apps:

  • Prompt injection: deepset/prompt-injections (~662 labeled, the common fine-tune set, published detectors hit ~99% F1 on it), jayavibhav/prompt-injection (~327k), Lakera/gandalf_ignore_instructions, HackAPrompt (competition-sourced), TensorTrust, BIPIA (indirect/RAG injection), SPML, and agent-tool sets InjecAgent + ToolEmu; multimodal: facebook/cyberseceval3-visual-prompt-injection.
  • Jailbreak: TrustAIRLab/in-the-wild-jailbreak-prompts (the "DAN"/in-the-wild corpus), AdvBench (520 harmful behaviors), HarmBench, JailbreakBench (JBB-Behaviors), WildJailbreak (~262k prompt-response pairs incl. benign controls), SALAD-Bench, and agent-specific AgentHarm / AgentDojo (629 hijack cases).
  • Response safety / over-refusal control: BeaverTails (~330k), PKU-SafeRLHF, WildGuardMix, Do-Not-Answer, OR-Bench (to avoid training an over-cautious model).
  • Aggregator: Necent/llm-jailbreak-prompt-injection-dataset compiles most of the above with consistent labels.

9.8 Cybersecurity LLM corpora & instruction/reasoning tuning

For building a security-specialized LLM end-to-end, the Primus suite (Trend Micro, on Hugging Face trendmicro-ailab, ODC-BY / MIT) is the reference open release, covering every stage: - Primus-Seed: manually curated cybersecurity text + expert CTI (pretraining). - Primus-FineWeb: cybersecurity text filtered from FineWeb/Common Crawl (large-scale pretraining). - Primus-Instruct: instruction fine-tuning. - Primus-Reasoning: CTI reasoning traces with self-reflection (distillation).

Reported gains: continual pretraining +15.88% aggregate; reasoning distillation +10% on CISSP. Evaluate security LLMs with CyberMetric, CTIBench, SecEval, CyberSecEval (Meta Purple Llama), SECURE, and NYU CTF.

9.9 Practical cautions (read before training)

  • Staleness: classic NIDS sets (KDD'99, NSL-KDD) don't reflect modern traffic, prefer recent CIC/UNSW sets and re-validate.
  • Label noise: Big-Vul/Devign labels are unreliable; prefer PrimeVul for vuln detection.
  • Class imbalance: malicious samples are rare, use resampling, focal loss, or class weights, and report metrics beyond accuracy (PR-AUC, low-FPR TPR).
  • Leakage & contamination: de-duplicate; never train on your eval split; LLM benchmarks (CTIBench, CyberSecEval) can leak into web-scraped pretraining data, check.
  • Licensing & safety: CIC, AndroZoo, and others require registration/terms; live-malware corpora demand isolation. Confirm license before any commercial use.
  • Provenance / poisoning: training data is itself an attack surface (data/RAG poisoning, §8.1), vet sources and pin versions.

10. Practical learning paths

A) General agentic AI engineer 1. App-layer foundations: DeepLearning.AI short courses + LangChain Academy (§3). 2. Build a RAG system: DataTalks LLM Zoomcamp (§3). 3. Learn agents hands-on: HF Agents Course (§7), then build with one framework from §4 (CrewAI to start fast, LangGraph for depth). 4. Add MCP tools and (optionally) A2A multi-agent (§6). 5. Productionize: FastAPI/Docker + an eval harness + a guardrail layer (§8.3). 6. Deploy on a managed runtime if needed (§5).

B) Security-focused AI engineer (the niche) 1. Do path A steps 1-3 (you must understand how agents work to secure them). 2. Learn the standards: OWASP LLM Top 10, OWASP Agentic 2026, MITRE ATLAS, NIST AI RMF (§8.1). 3. Work the labs in order: PortSwigger Web LLM AttacksGandalfDamn Vulnerable LLM AgentDamn Vulnerable MCP ServerPromptTrace Gauntlet (§8.2). 4. Tool up: run Garak and Promptfoo in CI, PyRIT for deep manual testing; add LLM Guard/NeMo Guardrails on the defense side (§8.3). 5. Build/operate an offensive agent on authorized targets: PentestGPT or CAI against WebGoat (§8.4). 6. Train a security model with §9 data: fine-tune a vuln detector on PrimeVul, a prompt-injection guardrail on deepset + WildJailbreak, or build a cyber-LLM with the Primus suite. 7. Certify: Microsoft AI Red Teaming 101 (free) → CAISP or OffSec AI-300 (§8.6). 8. Study production patterns: how Security Copilot / Charlotte AI structure agentic SOC workflows (§8.5).


11. Source list (representative)

Frameworks/platforms/models: firecrawl.dev, turing.com, cordum.io, alicelabs.ai, qubittool.com, gurusup.com, pecollective.com (framework comparisons); uibakery.io, marktechpost.com, vellum.ai, xpander.ai, agentmarketcap.ai, datalakehousehub.com (enterprise platforms); Anthropic MCP docs + A2A project (protocols); Nous Research (Hermes / DeepHermes model cards + Hermes Agent docs), OpenRouter, Hugging Face (open agentic/tool-use models).

Courses/labs (general): deeplearning.ai, hf.co/learn (LLM/Agents/Deep RL/MCP), course.fast.ai, cs336.stanford.edu, fullstackdeeplearning.com, datatalks.club, microsoft GitHub repos, Shubhamsaboo/awesome-llm-apps, llmagents-learning.org / agenticai-learning.org.

Security datasets: shramos/Awesome-Cybersecurity-Datasets, trenton3983/Cybersecurity-Datasets, SecRepo.com (aggregators); UNSW-NB15 (UNSW), CIC-IDS2017/2018 + Bot-IoT (Canadian Institute for Cybersecurity), CTU-13 (Stratosphere); EMBER (Elastic), sophos-ai/SOREL-20M; wagner-group/diversevul, PrimeVul, Big-Vul, CVEfixes, NIST SARD; CTIBench, MITRE ATT&CK; Loghub, OTRF/Security-Datasets, splunk/attack_data; deepset/prompt-injections, TrustAIRLab/in-the-wild-jailbreak-prompts, WildJailbreak, JailbreakBench, AdvBench, BeaverTails, Necent/llm-jailbreak-prompt-injection-dataset; Primus suite + Llama-Primus models (trendmicro-ailab, EMNLP 2025), Meta Purple Llama CyberSecEval.

Security frameworks: OWASP GenAI Security Project (LLM Top 10 2025, Agentic 2026, MCP Top 10), MITRE ATLAS, NIST AI RMF / AI 600-1, ISO/IEC 42001, Google SAIF; explainers from vectra.ai, paloaltonetworks.com, secportal.io, secra.es, practical-devsecops.com.

Security labs/tools: bishopfox.com (CTF roundup), arcanum-sec.github.io/ai-sec-resources, wearetyomsmnv/Awesome-LLMSecOps, anmolksachan/AI-ML-Free-Resources-for-Security-and-Prompt-Injection; PortSwigger Web Security Academy, gandalf.lakera.ai, ReversecLabs/damn-vulnerable-llm-agent; promptfoo.dev, NVIDIA Garak, Microsoft PyRIT, Confident AI DeepTeam, ProtectAI LLM Guard, NVIDIA NeMo Guardrails.

Offensive agents: GreyDGL/PentestGPT (USENIX 2024), aliasrobotics/CAI, vxcontrol/pentagi, ipa-lab/hackingBuddyGPT, Strix, Shannon, HexStrike AI; hardenedlinux/agentic-ai-pentest, github topics ai-penetration-testing.

Industry SOC: Microsoft (Security Copilot), CrowdStrike (Charlotte AI / AgentWorks), venturebeat.com, cyberproof.com.

Certifications: Microsoft Learn (AI Red Teaming 101), DeepLearning.AI (Red Teaming LLM Applications), Practical DevSecOps (CAISP/COASP), OffSec (AI-300), EC-Council (COASP), SANS (SEC598), airedteaming.eu, learnprompting.org.


Compiled June 2026. The agent-framework and AI-security landscapes move monthly (frameworks hit new GAs, tools get acquired, standards get revised), verify versions and details on each project's own page before relying on them. Offensive tooling and labs are for authorized, ethical use only.