Agentic AI Knowledge Base — Study Guide

Reorganized by exam domain, with original source content fully preserved, citations verified, and 2026 supplementary research added to every domain.


Table of Contents

  1. Domain 1: Agent Architecture and Design (15%)
  2. Domain 2: Agent Development (15%)
  3. Domain 3: Evaluation and Tuning (13%)
  4. Domain 4: Deployment and Scaling (13%)
  5. Domain 5: Cognition, Planning, and Memory (10%)
  6. Domain 6: Knowledge Integration and Data Handling (10%)
  7. Domain 7: NVIDIA Platform Implementation (7%)
  8. Domain 8: Run, Monitor, and Maintain (5%)
  9. Domain 9: Safety, Ethics, and Compliance (5%)
  10. Domain 10: Human-AI Interaction and Oversight (5%)
  11. Final Synthesis & Review Checklist
  12. References

Domain 1: Agent Architecture and Design (15%)

Foundational structuring and design of agentic AI systems, focusing on how agents interact, reason, and communicate within their environments.

ReAct: The Thought → Action → Observation Loop (Heavily Tested)

What it means

ReAct (Reasoning + Acting) is a prompting and execution paradigm that interleaves verbal reasoning traces (“Thoughts”) with concrete actions and their results (“Observations”). It was introduced to let language models dynamically create, maintain, and revise plans while grounding those plans in external information or environment feedback [1].

Why it matters

Pure chain-of-thought (CoT) reasoning happens entirely inside the model and can hallucinate or drift. Pure tool-calling or “Act-only” approaches lack high-level planning and exception handling. ReAct creates synergy: reasoning guides which actions to take and when to stop or replan; observations supply fresh facts that update reasoning and reduce hallucinations. The resulting trajectories are human-readable, debuggable, and controllable — critical for trust and iteration.

How it works (core loop)

The agent maintains a growing history and repeatedly executes:

  • Thought — The LLM reasons aloud about the current state, what is known/unknown, subgoals, and the next step (e.g., “I need recent sales data for Q3; the database tool can provide it”).
  • Action — The LLM outputs a structured action (tool call or environment command). Modern implementations use tool-calling APIs or strict parsing.
  • Observation — The system executes the action and appends the result (success data, error, or environment feedback) to the history.
  • Loop or terminate — The LLM decides whether to continue or produce a final answer. A maximum iteration limit prevents infinite loops.

Typical few-shot prompt format (simplified):

Thought: ...
Action: tool_name with arg is value
Observation: ...
Thought: ...
Action: Final Answer[result]

In zero-shot or structured-output setups, the model is instructed to emit the same format and a parser extracts the action.

Key scholarly reference

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629 [1].

Benchmarks & evidence (from the paper)

  • HotpotQA & FEVER (multi-hop reasoning + search): ReAct competitive with or better than CoT; hybrids with self-consistency excel. ReAct shows lower hallucination rates.
  • ALFWorld (embodied household tasks): ReAct ~57% success rate vs. Act-only ~41% and imitation-learning baseline ~37%.
  • WebShop (web navigation & shopping): ReAct achieves higher success rate and score than imitation + RL baselines.
  • Human-in-the-loop editing of thoughts can turn failures into successes.

Extensions

Reflexion (Shinn et al., 2023) adds verbal self-reflection on failures stored in episodic memory, yielding large gains on coding and decision tasks (e.g., 91% on HumanEval in some settings) [2].

Real-world examples

  • Research agents that search papers/databases, reason about findings, and synthesize reports.
  • Enterprise analytics agents that query internal databases, calculate metrics, and explain anomalies.
  • Web or API agents for competitive intelligence or automated procurement.
  • Embodied or simulated environments (robotics, game agents).

Common failure modes & mitigations

  • Infinite or repetitive loops — Enforce max_iterations; add repetition detection in state or use reflection.
  • Action parsing failures — Prefer native tool-calling APIs or robust structured output.
  • Context bloat / token explosion — Summarize or prune history; use hierarchical memory.
  • Poor planning / inefficient paths — Add planning nodes, Reflexion-style reflection, or tree search (e.g., LATS).
  • High latency & cost — Each cycle costs ≥1 LLM call; cache observations, parallelize when safe, or fall back to faster models for simple sub-tasks.
  • Non-determinism — Seed or temperature control + evaluation harnesses; log full trajectories for debugging.

Evaluation dimensions

  • Task success rate & efficiency (steps/tokens to completion).
  • Trajectory quality (human preference or automated judges for coherence, grounding, avoidance of hallucination).
  • Robustness (tool failures, ambiguous observations).
  • Controllability & interpretability (can a human edit a thought and resume?).
  • Cost/latency vs. simpler baselines.

Practice exercises & mini-projects

  1. From-scratch ReAct — Implement the loop in Python with your LLM provider + 2–3 tools (web search, calculator, simple DB). Test on 5–10 multi-hop questions. Compare success rate, cost, and latency against plain CoT and a simple tool-calling chain.
  2. Add reflection — After a failed trajectory, prompt the model to critique and store lessons; run a second attempt. Measure improvement.
  3. Mini-project — Build a research agent that must use tools to answer a question requiring both internal reasoning and external data. Instrument logging of every Thought/Action/Observation. Evaluate with the dimensions above.

Stateful Orchestration as an Architectural Pattern

What it means & why it matters

ReAct (or any reasoning loop) is usually only one node inside a larger workflow. Real production agents run for many steps, interact with external systems, may require human approval, and must survive crashes or restarts. Stateful orchestration models the entire workflow as an explicit, persistent state machine. Checkpointing saves progress so the system can recover, resume, or allow human intervention without losing work. Graph-based frameworks with first-class checkpointing and state management (especially LangGraph) are repeatedly cited as the standard for controllable, auditable, long-running agentic systems.

How it works (LangGraph as the leading exemplar)

  • Graph as state machine — Nodes = discrete operations (LLM calls, tool execution, custom logic, sub-graphs). Edges = control flow (including conditional routing based on state).
  • Shared state — A single typed object (TypedDict / Pydantic model) holds messages, intermediate results, metadata, and working memory. Updates use reducers (pure functions) that safely merge concurrent changes (e.g., append to a list of observations).
  • Checkpointing & persistence — Attach a checkpointer at compile time (MemorySaver for development; SQLite/Postgres or custom backends for production). After configurable steps (or every node), the current state is serialized and stored under a thread_id + checkpoint_id.
  • Recovery — On restart or error, load the latest (or any historical) checkpoint and resume exactly where it left off.
  • Time travel / debugging — Replay, fork, or inspect any past state.
  • Human-in-the-loop (HITL) — Insert interrupt nodes; execution pauses, state is persisted, and an external system (UI, API, queue) supplies input before resuming.
  • Multi-step & multi-agent coordination — Explicit edges support sequential pipelines, conditional branching, parallel fan-out (multiple nodes run and update shared state), loops with termination conditions, retries, and fallback paths. Shared state serves as a “blackboard” for coordination. Sub-graphs enable hierarchical designs.

Architectural tradeoffs

  • Stateful graph vs. simple stateless chain — Stateful wins on reliability, debuggability, auditability, complex branching, and HITL. It adds implementation effort, persistence infrastructure, and potential latency/storage overhead. Use it when workflows are long-running, multi-step, or require recovery/guarantees; use lighter chains for high-volume, simple, idempotent tasks.
  • Checkpointing overhead vs. benefit — Essential for production SLAs and compliance; overkill (and costly) for short, stateless prototypes.
  • LangGraph vs. alternatives — LangGraph leads for granular control, persistence, and ecosystem maturity. CrewAI offers faster multi-agent role-based development with lighter state management. AutoGen emphasizes conversational coordination. Vendor SDKs (OpenAI, Google, Anthropic) add convenience but often delegate durable storage and complex orchestration to the developer or LangGraph-like layers.

Common failure modes & mitigations

  • State inconsistency or lost updates — Design reducers carefully; test concurrent updates.
  • Non-idempotent side effects on recovery — Make tools/actions idempotent or implement compensation logic (saga pattern).
  • Checkpoint bloat or performance hit — Prune state, use efficient serialization, or checkpoint only at milestones.
  • Security / persistence-layer risks — Exposed or poorly configured checkpointer databases can introduce injection or data-exposure risks; treat persistence as a secured, authenticated service.
  • Observability gaps — Integrate tracing (LangSmith, Phoenix, etc.) that captures full state transitions and node executions.
  • Over-engineering — Start simple; add state/checkpoints only where recovery, HITL, or audit requirements exist.

Evaluation for production readiness

  • Functional correctness under normal and fault-injection conditions.
  • Recovery metrics: time-to-recovery, success rate of resume from checkpoint.
  • Audit & compliance: completeness of logged state transitions and decision traces.
  • Observability, cost/latency overhead of persistence, maintainability (ease of modifying graphs and testing sub-flows).
  • Governance: ability to insert guardrails, approval gates, and rollback points.

Practice exercises & mini-projects

  1. Stateful ReAct in a graph — Implement a ReAct-style loop as nodes inside a LangGraph (or equivalent). Add a checkpointer. Invoke, interrupt mid-loop, and resume from checkpoint.
  2. Multi-step coordinator with shared state — Build a planner → parallel researchers (update shared findings) → synthesizer workflow. Add conditional routing (“if confidence low → human review”) and a final approval checkpoint.
  3. Resilience mini-project — Simulate node or infrastructure failure mid-workflow. Demonstrate recovery from the last checkpoint. Measure recovery time and verify no duplicate side-effects. Instrument full tracing and produce an audit log of state changes.

When to use ReAct + stateful orchestration vs. simpler patterns

Use when the task is complex, uncertain, interactive with external systems, long-running, requires adaptation/grounding, auditability, or human oversight. Prefer simpler chains, direct tool calling, or advanced native reasoning models when latency/throughput is paramount, tasks are short and deterministic, or the workflow is purely internal.

Summary checklist for mastery

Recent Developments (2026)

The 2025–2026 literature has moved past the original 12 “foundational” agent design patterns (ReAct, Reflection, Tool Use, Planning, Multi-Agent Collaboration, Sequential Workflows, Human-in-the-Loop, etc.) toward a wave of emergent patterns addressing production constraints: context management, bounded execution, layered safety controls, memory, and meta-level orchestration, with newer catalogs grouping roughly 21 widely used patterns into five families [17]. A parallel research thread has proposed more formal frameworks for classifying these patterns — for example, a two-dimensional framework organizing agent design along “cognitive function” and “execution topology” axes [15], and a system-theoretic framework treating agentic design patterns as control-theoretic building blocks [16].

Architecturally, the field is also shifting from single, all-purpose agents toward orchestrated teams of specialized agents — Gartner reported a 1,445% surge in multi-agent system inquiries between Q1 2024 and Q2 2025 — with “puppeteer” orchestrator agents coordinating specialist sub-agents in production deployments [14]. Google Cloud’s architecture guidance for choosing a design pattern for agentic AI systems (sequential, hierarchical, or collaborative) reflects this same trend toward explicit topology selection as a first design decision [18].


Domain 2: Agent Development (15%)

Practical building, integration, and enhancement of agents.

Stateful Orchestration — MUST-Have Practical Capabilities

Stateful orchestration is the production-grade execution layer that turns individual reasoning loops (such as ReAct) into reliable, resumable, auditable, and controllable multi-step workflows. While ReAct provides the cognitive pattern (Thought → Action → Observation), stateful orchestration supplies the memory, persistence, control flow, and safety mechanisms required for anything beyond trivial single-turn interactions.

This section focuses on the four MUST-HAVE capabilities for production agentic systems:

  1. State Management — Persistence across multi-step workflows
  2. Checkpointing & Recovery — Resume from last successful state
  3. Multi-Step Coordination — Sequential execution, conditional branching
  4. Interrupt Patterns — Human approval gates for high-risk actions

1. State Management: Persistence Across Multi-Step Workflows

What it means

State is a shared, structured data object that lives across every step of a workflow. It holds conversation history, intermediate results, metadata, and any values needed by downstream nodes. In graph-based systems, state is the single source of truth that all nodes read from and write to.

Why it matters

Without persistent shared state, each step is isolated. Agents cannot remember prior observations, coordinate across branches, or resume intelligently. Persistent state enables long-running processes, multi-agent blackboard-style coordination, and full audit trails — essential for compliance, debugging, and reliability.

How it works

Define a typed state schema (e.g., TypedDict or Pydantic model). Nodes receive the current state and return updates. Reducers are pure functions that safely merge updates (e.g., append to a list of observations or overwrite a field). The framework manages state passing and merging automatically.

Common tools & frameworks

LangGraph’s StateGraph is the leading implementation: explicit state schema + reducers. Other frameworks offer lighter or implicit state (CrewAI tasks, AutoGen conversations) but LangGraph provides the most control and production maturity for complex persistence.

Real-world examples

  • Research agent: shared state accumulates search results, summaries, and citations across multiple tool calls and reasoning steps.
  • Customer support workflow: state carries customer history, prior actions, and open issues across escalation branches.
  • Multi-agent system: planner, researcher, and synthesizer nodes all read/write to the same state object (blackboard pattern).

Common failure modes & mitigations

Poor reducer design causing lost updates or race conditions in parallel branches; state bloat (accumulating every message forever); inconsistent schemas across nodes. Mitigations — use well-defined reducers, prune or summarize state periodically, enforce schema validation, and separate short-term thread state from long-term memory stores.

Evaluation

Measure state consistency under concurrent updates, audit completeness (can you reconstruct the full decision path from state history?), and memory efficiency (tokens or storage per workflow).

2. Checkpointing & Recovery: Resume from Last Successful State

What it means

Checkpointing periodically serializes the full graph state (including messages, intermediate results, and metadata) to durable storage. Recovery loads a prior checkpoint and resumes execution from that exact point without re-running completed work.

Why it matters

LLM calls, tool executions, and long-running workflows are expensive and non-deterministic. Crashes, timeouts, or deployments must not force full restarts. Checkpointing delivers fault tolerance, time-travel debugging, and the foundation for human-in-the-loop pauses. Robust checkpointing is repeatedly cited as a key differentiator for reliable agentic systems in production.

How it works

Attach a checkpointer when compiling the graph (e.g., MemorySaver for development, AsyncSqliteSaver or Postgres for production). Every node execution (or configurable milestones) writes a checkpoint identified by thread_id + checkpoint_id. On failure or restart, pass the same thread_id and the system loads the latest (or chosen) checkpoint and continues.

Key patterns

  • A durable checkpointer is required for production persistence across API calls or restarts.
  • Time travel: inspect, replay, or fork from any historical checkpoint.
  • Works seamlessly with interrupts (state is saved when paused).

Real-world examples

  • Overnight research or data-processing pipelines that survive infrastructure restarts.
  • Compliance workflows that must resume exactly after an approval delay or system outage.
  • Debugging: replay a failed trajectory from the checkpoint just before the error.

Common failure modes & mitigations

Using an in-memory checkpointer in production (state lost on restart); non-idempotent actions executed again on recovery (duplicate emails, duplicate charges); checkpoint bloat or slow serialization of very large states. Mitigations — use durable backends, design actions to be idempotent or include compensation logic, checkpoint only at safe milestones, and implement state pruning.

Evaluation

Recovery success rate, mean time to recovery (MTTR), storage/latency overhead of checkpoints, and ability to replay historical runs for audit or A/B testing.

3. Multi-Step Coordination: Sequential Execution + Conditional Branching

What it means

The workflow is modeled as an explicit directed graph. Nodes perform work (LLM calls, tool use, ReAct loops, custom logic). Edges define control flow. Conditional edges route dynamically based on state values (e.g., confidence score, risk level, or observation content).

Why it matters

Real tasks are rarely linear. Agents must branch on conditions (“if high risk → human review”), run steps in parallel and merge results, loop until convergence, or escalate. Explicit graphs give controllability, observability, and testability that implicit or prompt-only orchestration lacks.

How it works

  • Sequential: nodes connected by normal edges execute in order.
  • Conditional branching: add_conditional_edges(source, routing_function, path_map) where the routing function inspects state and returns the next node name(s).
  • Parallel: fan-out to multiple nodes; results merge via reducers in shared state.
  • Loops: edges that return to earlier nodes with termination conditions in state or routing logic.

Common tools

LangGraph excels here with first-class conditional edges and graph visualization. Other frameworks support sequencing and some branching but usually with less explicit control.

Real-world examples

  • Diagnostic agent: sequential data gathering → conditional branch (simple case vs. complex case requiring specialist review).
  • Content pipeline: research (parallel tool calls) → analysis → conditional (needs fact-check or human edit).
  • Incident response: detection → triage (conditional severity routing) → remediation with approval gates.

Common failure modes & mitigations

Infinite loops from missing or buggy termination conditions; dead branches or unreachable nodes; race conditions in parallel updates without proper reducers. Mitigations — add explicit termination logic and max-iteration guards; use graph visualization and static analysis tools; test all routing paths.

Evaluation

Branch coverage in tests, correctness of routing decisions, latency of coordination overhead, and maintainability (ease of modifying flow).

4. Interrupt Patterns: Human Approval Gates for High-Risk Actions

What it means

Interrupts pause graph execution at defined points, persist the current state, and wait for external (usually human) input before resuming. This implements human-in-the-loop (HITL) approval, editing, or rejection for high-risk actions.

Why it matters

Fully autonomous execution of high-stakes actions (financial transfers, medical recommendations, data deletions, customer communications, compliance decisions) is unacceptable in most real deployments. Interrupts provide controllable safety valves while preserving automation for low-risk paths. They are a core governance and safety mechanism.

How it works

Two complementary approaches:

Static interrupts (at compile time):

graph = builder.compile(
    checkpointer=checkpointer,
    interrupt_before=["high_risk_action_node"],
    interrupt_after=["some_other_node"]
)

Dynamic interrupts (recommended for flexible HITL — called inside a node):

from langgraph.types import interrupt, Command

def human_review_node(state):
    payload = {"need": "approval", "draft": state["draft"], "risk": state["risk_score"]}
    decision = interrupt(payload)          # pauses here, saves state
    return {"approved": decision["approved"], "feedback": decision.get("feedback")}

Resume from outside:

result = app.invoke(
    Command(resume={"approved": True, "feedback": "Looks good, proceed"}),
    config={"configurable": {"thread_id": "abc123"}}
)

When an interrupt fires, the checkpointer saves the exact state. Execution can resume minutes, hours, or days later from any client that knows the thread_id.

Real-world examples (high-risk gates)

  • Before executing a financial transaction or refund.
  • Before sending a medical or legal recommendation.
  • Before deleting customer data or making irreversible configuration changes.
  • Before escalating to a human specialist or external system.
  • Content moderation or publishing approval.

Common failure modes & mitigations

Using non-durable checkpointers (interrupt state lost); poor UX for human reviewers (missing context in the interrupt payload); timeouts or orphaned interrupts if resumption logic is fragile; over-use of interrupts (every step becomes a bottleneck). Mitigations — always pair interrupts with durable checkpointers and consistent thread IDs; include rich, relevant context in the interrupt payload; design clear approval UIs; reserve interrupts for genuinely high-risk nodes.

Evaluation

Approval rate and latency, safety incidents prevented, human reviewer satisfaction, audit completeness of approval decisions, and overall workflow throughput with interrupts enabled.

Integrated Tradeoffs & When to Use

Approach Reliability & Recovery Controllability / HITL Complexity & Overhead Best For Avoid When
Stateless chains / simple ReAct Low Low Low High-volume, low-risk, simple tasks Long-running or high-stakes work
Stateful graph + checkpointing High Medium–High Medium Most production multi-step agents Ultra-low latency requirements
+ Conditional branching High High Medium–High Adaptive, decision-heavy flows Purely linear pipelines
+ Dynamic interrupts (HITL) Very High Very High High High-risk or regulated domains Fully autonomous low-risk tasks

Stateful orchestration with all four MUST-HAVEs is the default recommendation for any production agent that performs non-trivial, multi-step, or consequential work.

Hands-On Practice (Builds Directly on ReAct Work)

Exercise: Stateful Research + Decision Agent with Checkpoints, Branching, and Approval Gates

  1. Define a state schema containing: messages, research_findings, risk_score, final_recommendation, approved.
  2. Create nodes: researcher (ReAct-style loop or tool calls), risk_assessor, high_risk_approval (uses dynamic interrupt), executor (only runs if approved).
  3. Add conditional edges: after risk assessment, route high-risk cases to the interrupt node; low-risk cases proceed directly.
  4. Compile with a durable checkpointer and interrupt_before or dynamic interrupt on the approval/execution path.
  5. Test: (a) normal happy path (low risk); (b) high-risk path — interrupt fires, state is persisted, resume later with approval/rejection/edited feedback; (c) simulate crash mid-workflow and demonstrate recovery from checkpoint.
  6. Add observability (trace every state change and interrupt).

Mini-project extension

Instrument full audit logging. Measure recovery success, approval latency, and token/state growth. Compare against a non-checkpointed, non-interrupt version on the same tasks.

Summary Checklist

Recent Developments (2026)

The framework landscape for building agents has consolidated and matured rapidly. OpenAI replaced its experimental Swarm framework with the production-grade Agents SDK, then extended it with native sandbox execution and a model-native harness for secure, long-running agents, and introduced AgentKit — including a visual Agent Builder canvas for versioning multi-agent workflows and ChatKit for embedding chat-based agent experiences [19]. Google shipped ADK (Agent Development Kit) 1.0 for Java and Go (alongside existing Python/TypeScript SDKs), adding native Agent-to-Agent (A2A) protocol support and a visual Agent Designer in the Google Cloud console [20]. Microsoft’s Agent Framework 1.0 went GA in April 2026, merging AutoGen and Semantic Kernel into a single .NET/Python SDK [21]. Independent framework comparisons published in 2026 continue to position LangGraph as the strongest choice for granular state control and durable execution, CrewAI for fast role-based multi-agent prototyping, and the OpenAI/Google/Microsoft SDKs as increasingly capable managed alternatives [22].

On the durable-execution side specifically, 2026 production guides emphasize pairing LangGraph checkpointers with dedicated durable-execution engines (e.g., Temporal) and note optimizations such as delta-based checkpoint storage that can reduce persisted-state size by orders of magnitude at scale [6].


Domain 3: Evaluation and Tuning (13%)

Measuring, comparing, and optimizing agent performance.

Evaluation Pipelines (Task Completion, Accuracy, Latency)

What it means

An evaluation pipeline is a repeatable, automated system that runs agents on curated test cases, scores outputs against expected outcomes or rubrics, and measures multiple dimensions: task completion/success rate, accuracy/quality (factuality, helpfulness, safety), latency (end-to-end and per-step), cost (tokens, tool calls), and reliability (consistency across runs or under perturbation).

Why it matters

Without rigorous pipelines, “it works in my demo” does not translate to production. Surveys of agent evaluation highlight that it must cover core capabilities (planning, tool use, reflection), application-specific benchmarks (web agents, SWE-bench-style coding, long-horizon tasks), and generalist performance while accounting for stochasticity [3].

How it works (practical pipeline)

  1. Golden dataset creation — Curated examples (human-annotated or synthetically generated + verified). Include edge cases, ambiguity, and long-horizon scenarios.
  2. Execution harness — Run the agent (with fixed config, memory, checkpointer state) on each case. Capture full traces (thoughts, actions, observations, state snapshots, interrupts).
  3. Grading — Programmatic checks (exact match, tool-use correctness, state invariants) + calibrated LLM-as-judge (with few-shot rubrics, reference answers, and consistency checks). Use multiple judges or human review for high-stakes cases.
  4. Metrics aggregation — Success rate (with reliability math: vs. single-pass gaps can reach ~25% in long-horizon benchmarks), latency distributions, cost, error taxonomy.
  5. CI/CD integration — Gate deployments or prompt/model changes on eval thresholds. Production traces feed back into the dataset (closed-loop improvement).

Common tools/frameworks

LangSmith, Phoenix, or custom harnesses built on LangGraph’s tracing + checkpoint replay. Benchmarks: SWE-Bench, WebArena, GAIA/Gaia2-style environments, and long-horizon suites (e.g., DeepPlanning, MemGym) [3].

Real-world examples

  • Research agent evaluated on multi-hop question sets with tool-use accuracy + citation faithfulness + latency.
  • Stateful customer workflow: completion rate under simulated failures + recovery success + human approval latency.
  • High-risk decision agent: safety violation rate + calibration of interrupt frequency.

Common failure modes & mitigations

Over-reliance on single success rate (ignores variance/stochasticity); LLM judges that are poorly calibrated or biased; ignoring latency/cost in “accuracy-only” evals; static datasets that don’t reflect production distribution shift. Mitigations — use statistical reliability framing, multi-judge ensembles with calibration, track full distributions (not just means), and maintain living datasets from production traces. Key dimensions: task success, quality/faithfulness, efficiency (latency + cost), robustness (to noise/failure), controllability (via interrupts/state), and safety.

Practice exercise

Build a minimal eval harness for your stateful ReAct research agent. Create 10–20 golden cases. Run with tracing. Implement a simple LLM judge + programmatic checks. Report success rate, average steps/latency, and error breakdown. Add one production trace as a new test case.

Performance Benchmarking and A/B Testing

What it means

Benchmarking compares systems or versions on standardized or custom suites. A/B testing (or online experimentation) runs controlled variants in production (or shadow) to measure real-user or real-task impact.

Why it matters

Benchmarks reveal relative strengths; A/B testing validates whether changes (new memory, different planner, parameter tweak, interrupt policy) actually move the needle on business or reliability metrics without harming others.

How it works

  • Offline: Run variants on the same golden set + long-horizon benchmarks.
  • Online/A/B: Route a percentage of traffic or tasks to variant A vs. B; measure primary metrics (completion, satisfaction, safety incidents) and guardrails (latency p95, cost, error rate). Use statistical significance testing.

Trade-offs & considerations

Agent stochasticity requires more samples or metrics. Long-horizon tasks amplify small differences. Production A/B must respect safety (e.g., route high-risk cases to human review or a conservative variant).

Practice

A/B test two memory configurations or two planning strategies (plain ReAct vs. ReAct + reflection) on your eval set and on a small live traffic slice. Report effect sizes on success, latency, and cost.

Parameter Tuning: Temperature, Top-p, Max Tokens + Creativity vs. Determinism Trade-offs

What it means

  • Temperature: Controls randomness (0 = deterministic, higher = more creative/diverse).
  • Top-p (nucleus sampling): Considers only the smallest set of tokens whose cumulative probability exceeds p.
  • Max tokens: Hard cap on output length.

Why it matters

These directly affect output diversity, coherence, and reliability. In agentic loops they influence planning quality, tool-call formatting, reflection depth, and whether the agent explores novel paths or sticks to safe ones.

How it works & trade-offs

  • High temperature / high Top-p → Greater creativity, exploration, and chance of novel solutions. Downsides: less reproducible results (hurts evaluation consistency and A/B reliability), higher risk of malformed actions or hallucinations, poorer performance on deterministic/high-stakes tasks.
  • Low temperature / low Top-p → More focused, deterministic, reproducible outputs. Better for evaluation, structured outputs, and safety-critical paths. Downsides: may miss creative or optimal solutions; can get stuck in repetitive loops.
  • Max tokens → Controls verbosity and cost. Too low truncates reasoning or final answers; too high wastes tokens and increases latency.

Practical guidance

Reasoning-specialized models often need less aggressive temperature tuning for internal CoT, but tool-use and structured agent loops still benefit from careful tuning. Use lower values for high-risk nodes or before interrupts; allow moderate exploration in research/planning nodes. Combine with structured output / tool-calling APIs for better determinism than raw sampling.

Evaluation of tuning

Measure not just final accuracy but also trajectory consistency, formatting compliance, and downstream effects (e.g., interrupt approval rate, recovery success).

Practice exercise

Run the same set of planning or research tasks at temperature 0.0, 0.7, and 1.2 (with fixed seed where possible). Compare success rate, diversity of plans/actions (qualitative or embedding distance), latency, and error types. Repeat with Top-p sweeps.

Recent Developments (2026)

The benchmark landscape has consolidated around five core suites that measure distinct capabilities and “should never be collapsed into a single ranking”: SWE-bench (software engineering), GAIA (real-world multi-step assistant tasks), Tau-bench (tool use under policy constraints), AgentBench, and WebArena (web navigation), alongside newer additions like Terminal-Bench and OSWorld [12][13]. A 2025 survey of LLM-agent evaluation methods formally organized the field into five perspectives — core capabilities, application-specific benchmarks, generalist-agent evaluation, benchmark-dimension analysis, and evaluation tooling — while flagging cost-efficiency, safety, and robustness assessment as the biggest remaining gaps [3]. Continuous, in-deployment evaluation frameworks (e.g., multi-signal monitoring across live agent traffic rather than one-off offline runs) have also emerged as a complement to static golden-dataset testing.


Domain 4: Deployment and Scaling (13%)

Operationalizing and scaling agentic systems.

Production Deployment (Docker, Kubernetes)

What it means

Packaging agentic systems (LLM inference, RAG pipelines, stateful graphs, guardrails, memory stores) into portable, reproducible containers and orchestrating them at scale with Kubernetes.

Why it matters

Agent workflows involve multiple components (inference servers, vector DBs, guardrail services, orchestration runtimes). Docker + Kubernetes provide consistency across environments, declarative scaling, self-healing, and the foundation for MLOps automation.

How it works (practical patterns)

  • Docker: Containerize each component (e.g., Triton/TensorRT-LLM inference, NeMo Guardrails service, LangGraph runtime, vector DB). Use multi-stage builds for smaller images and include health checks.
  • Kubernetes: Deploy via Deployments/StatefulSets, expose via Services/Ingress, manage configuration with ConfigMaps/Secrets, and use Helm charts for reproducibility. Horizontal Pod Autoscaling (HPA) based on CPU, memory, or custom metrics (queue depth, latency). Persistent volumes for checkpoints and long-term memory.

Integration with prior topics

  • Serve TensorRT-LLM models via Triton inside Kubernetes pods.
  • Run stateful LangGraph agents as long-running services or job workers with persistent volumes for checkpoints.
  • Deploy NeMo Guardrails as a sidecar or separate service in front of inference pods.
  • RAG components (embeddings + vector DB) deployed as microservices with their own scaling.

Common failure modes

Inconsistent environments between dev and prod; poor resource requests/limits causing OOM or throttling; stateful components (checkpointers, vector indexes) not properly persisted or backed up; secrets and sensitive prompts/configs leaking into images.

Evaluation dimensions

Deployment success rate, time-to-deploy, resource efficiency, scaling responsiveness, and recovery time from pod failures.

Practice exercise

Containerize your stateful ReAct + RAG + Guardrails agent. Create a simple Kubernetes manifest (or Helm chart) that deploys the inference backend (Triton/TensorRT-LLM), the agent runtime, and a vector DB. Add health checks and basic HPA. Test rolling updates and pod failure recovery.

MLOps Practices (CI/CD, Monitoring, Governance)

What it means

Applying software engineering and MLOps discipline to the full agent stack: code (graphs, tools, policies), models, prompts, RAG indexes, Colang guardrail policies, and evaluation datasets.

Why it matters

Agents are composite systems. Changes to any part (new model version, updated Colang policy, refreshed RAG corpus, new tool) can break safety, accuracy, or performance. CI/CD + governance provides repeatability, auditability, and controlled rollout.

Key practices

  • CI/CD pipelines: Automated testing on every change (unit tests for tools/nodes, integration tests with guardrails, full evaluation pipeline runs). Promotion gates based on safety, accuracy, and performance thresholds. Canary or blue-green deployments for agent updates.
  • Governance: Versioning of everything (graph definitions, Colang policies, prompt templates, RAG chunk metadata, model artifacts). Policy-as-code for guardrails. Approval workflows for high-impact changes (new tools, policy relaxations).
  • Model & data lifecycle: Automated retraining or fine-tuning triggers, RAG index rebuilds with quality checks, and rollback mechanisms.

Integration

CI/CD can trigger re-evaluation of your stateful agent whenever the RAG corpus, guardrail policies, or serving backend (TensorRT-LLM quantization) changes.

Common failure modes

No automated regression testing after policy or model changes; lack of versioning for non-code artifacts (prompts, Colang flows, indexes); governance gaps around who can modify high-risk tools or relax safety rails.

Practice exercise

Set up a basic CI pipeline (GitHub Actions, GitLab CI, or similar) for your agent project. On every push: run linting, unit tests, a subset of the evaluation pipeline (including guardrails and safety checks), and build/push Docker images. Add a manual approval gate before production deployment.

Recent Developments (2026)

Kubernetes has become the dominant substrate for agentic AI deployment: roughly two-thirds of organizations running generative AI workloads now host some or all of their inference on Kubernetes [29]. Event-driven autoscalers like KEDA are increasingly used to scale agent-runner pods based on queue depth (e.g., Pub/Sub backlog or Redis list length) rather than CPU, matching bursty agent workload patterns and scaling to zero when idle. Specialized model-serving layers such as KServe now integrate with Knative for scale-to-zero GPU workloads, and Custom Resource Definitions (CRDs) let teams treat agent fleets as first-class Kubernetes objects with built-in high availability [30]. Emerging guidance also favors SLO-signal-based autoscaling (driven by latency/error-budget signals) over simple threshold-based autoscaling for better cost and reliability outcomes in agent-serving clusters.


Domain 5: Cognition, Planning, and Memory (10%)

Core cognitive processes underlying intelligent agent behavior, including reasoning strategies, decision-making, and memory management.

Memory Mechanisms (Short-Term Buffer, Long-Term Storage)

What it means

  • Short-term / working memory: In-session context — recent messages, current task state, active observations. In stateful graphs this lives in the shared state object and checkpoints.
  • Long-term memory: Persistent knowledge across sessions or long horizons — facts, summaries, entity relationships, past experiences. Stored externally in vector databases, graph stores, or hybrid systems and retrieved on demand.

Why it matters

Pure context-window memory is limited and resets. Effective agents need both: short-term for coherence within a workflow and long-term for personalization, knowledge accumulation, and avoiding repeated mistakes. Graph memory and hierarchical approaches have moved from experimental to practical production use [8].

How it works

  • Short-term: Managed by the orchestration framework (LangGraph thread state + checkpointer). Techniques include trimming, summarization, and compaction to stay within limits.
  • Long-term: Write (extract & store key facts/entities after turns or tasks), update/forget (consolidation, staleness detection), retrieve (semantic search, graph traversal, or hybrid). Modern systems use hierarchical extraction (single-pass or multi-signal) and entity linking rather than pure vector similarity.

Common tools & advances

LangGraph short-term via checkpointers; separate long-term stores (vector DBs + graph DBs or integrated solutions like Mem0 with graph/entity capabilities). Recent memory systems combine semantic similarity, keyword (BM25) matching, and entity matching in a single multi-signal retrieval score rather than relying on vector similarity alone [8]. Benchmarks used to compare memory architectures include LoCoMo, LongMemEval, BEAM, MemGym, and STALE (which specifically tests for outdated/stale assumptions) [8].

Real-world examples

  • Personal research agent that remembers user preferences and prior findings across sessions (long-term) while maintaining current investigation state (short-term).
  • Customer support agent that recalls past tickets and resolutions (long-term) during a live thread (short-term + checkpointed state).
  • Multi-agent system sharing a long-term knowledge graph while each agent maintains its own short-term working state.

Common failure modes

Short-term overflow or loss of critical recent context; long-term retrieval of irrelevant or stale information (hallucinated or outdated actions); poor write/update policies leading to memory bloat or forgotten important facts; latency from retrieval in time-sensitive loops.

Mitigations

Explicit short-term vs. long-term separation; hierarchical compression + entity linking; staleness detection; retrieval reranking; integration with state (e.g., surface retrieved memories into graph state before planning nodes).

Evaluation

Recall/precision of retrieved memories, impact on task success over long horizons, latency overhead, and resistance to staleness (STALE-style tests).

Practice

Extend your stateful agent with a simple long-term memory layer (vector store or lightweight graph). After each task or reflection step, extract and store key entities/facts. On new sessions or long tasks, retrieve relevant memories and inject into state. Measure improvement on multi-session or long-horizon test cases.

Reasoning Frameworks (Chain-of-Thought, Task Decomposition)

What it means

  • Chain-of-Thought (CoT): Prompting the model to generate explicit intermediate reasoning steps before the final answer or action.
  • Task decomposition: Breaking a complex goal into smaller, manageable sub-tasks or sub-goals that can be tackled sequentially or in parallel.

Why it matters

These are foundational cognitive patterns that improve performance on multi-step problems. They are often combined with ReAct (reasoning interleaved with acting) and reflection.

How it works

  • CoT: Few-shot examples or zero-shot instructions (“Think step by step”). In agents, thoughts are logged in state/history.
  • Decomposition: Planner node generates a list of sub-tasks; executor or sub-agents handle them; results are synthesized. Can be static (pre-defined) or dynamic (generated and revised).

Integration with prior topics

Use inside ReAct thoughts or as a dedicated planning node in a stateful graph. Decomposition results can be stored in state and used for conditional branching or progress tracking.

Trade-offs

CoT improves reasoning but increases token usage. Decomposition helps with complex goals but risks error propagation if sub-tasks are poorly defined or dependencies missed.

Practice

Implement a planner node that performs task decomposition before a ReAct-style executor. Compare end-to-end success and efficiency against a flat ReAct baseline on multi-step tasks.

Planning Strategies for Sequential Decision-Making

What it means

Strategies for deciding the sequence of actions or sub-goals over time, especially under uncertainty or partial observability.

Key strategies

  • ReAct / Plan-Act-Reflect-Repeat (interleaved): Adaptive, good for uncertain environments; token-intensive.
  • Plan-then-Execute: Generate full plan upfront, then execute; more efficient when the environment is predictable but brittle to surprises.
  • Reflexion: Adds verbal self-reflection on failures + episodic memory for future trials.
  • Tree of Thoughts (ToT) / LATS: Search over multiple reasoning branches or use Monte-Carlo Tree Search with LLM as proposer and evaluator. Strong on hard tasks but compute-heavy.
  • Memory-augmented planning: Use retrieved past experiences or structured memory to inform plans.

Why it matters

Good planning turns reactive tool-calling into goal-directed behavior. Combined with memory and stateful orchestration, it enables long-horizon, resumable, interruptible agents.

Trade-offs table

Strategy Adaptivity Token/Cost Efficiency Reliability on Hard Tasks Best Paired With
ReAct / interleaved High Lower Good Reflection, short-term state
Plan-then-Execute Low–Medium Higher Brittle if surprises Strong decomposition
Reflexion High Medium–High Strong (self-correction) Episodic long-term memory
ToT / LATS High High Very strong Search + evaluation

Practice exercise

Build two planner variants in your graph: (1) simple ReAct-style interleaved planning, (2) explicit decomposition + plan-then-execute with reflection on failure. Evaluate both on the same long-horizon or multi-step test set. Measure success, steps taken, and recovery from injected errors. Add checkpointing and one interrupt gate for a high-risk sub-task.

Recent Developments (2026)

Dedicated “reasoning models” that spend extra inference-time compute generating intermediate thinking tokens before answering (OpenAI’s o-series, DeepSeek-R1, Claude’s extended-thinking modes, and Gemini’s Deep Think/Flash Thinking variants) have become a distinct category alongside standard LLMs, trading latency and cost for stronger performance on math, code, and multi-step planning tasks [33]. Because planning is a multi-step analogue of single-step CoT reasoning, current research is focused on when to invest this extra test-time compute — for example, learning to allocate planning effort adaptively rather than applying it uniformly to every step, which materially affects both cost and reliability in long-horizon agents [34]. On the memory side, hybrid retrieval that blends vector similarity, keyword matching, and entity linking is now considered standard practice rather than an advanced technique, with published benchmark gains of roughly +29.6 points on temporal reasoning and +23.1 points on multi-hop retrieval tasks compared to pure vector search [8].


Domain 6: Knowledge Integration and Data Handling (10%)

Integration of external knowledge and the management of diverse data types.

RAG Systems (Heavily Tested): Basic RAG, Hybrid Search, HyDE

What it means

Retrieval-Augmented Generation (RAG) retrieves relevant external information and injects it into the LLM’s context before generation. This grounds outputs in source material, reduces hallucinations, and enables knowledge that was not in the model’s training data [4].

Why it matters (especially for agents)

In ReAct-style loops, planning nodes, or long-term memory systems, RAG supplies fresh facts, documents, or prior experiences. It is one of the most heavily tested topics because almost every production agent uses some form of retrieval to stay accurate and current. Basic RAG is foundational; advanced variants (hybrid, HyDE, reranking) are now standard in production systems.

Basic RAG flow

  1. Indexing (offline): Documents → chunking → embedding → storage in vector database with metadata.
  2. Retrieval (online): User/agent query → embedding → similarity search → top-k chunks returned.
  3. Augmentation: Retrieved chunks + original query are stuffed into the prompt.
  4. Generation: LLM produces answer grounded in the retrieved context.

Hybrid search (dense + sparse + fusion)

Combines semantic (vector) search with keyword/lexical search (BM25 or similar). Results are merged using Reciprocal Rank Fusion (RRF) or weighted scoring. Pure vector search excels at meaning but misses exact terms, codes, or rare entities; keyword search catches those but lacks semantics. Hybrid + RRF reliably improves precision and recall, especially on technical, legal, or product corpora [9]. A common production pattern retrieves the top ~100 candidates via hybrid search, passes them to a re-ranker model (e.g., Cohere Rerank or a BGE-Reranker), and keeps only the top 5–10 for the LLM, because bi-encoder vector embeddings are inherently “lossy” — compressing a complex passage into a single point in embedding space [9].

HyDE (Hypothetical Document Embeddings)

Instead of embedding the raw (often short or vague) query, the LLM first generates a hypothetical “ideal” document that would answer the query. That hypothetical document is embedded and used for retrieval. The real retrieved documents are then used for final generation [5].

  • Strength: Dramatically improves retrieval for short, ambiguous, or conversational queries. Often combined with hybrid search.
  • Trade-off: Extra LLM call (cost/latency) but usually worth it for quality [5].

Other common enhancements

  • Query rewriting / multi-query expansion.
  • Reranking (cross-encoder or lightweight models like FlashRank) on initial retrieval results.
  • Metadata filtering (date, source, user permissions, etc.).
  • Agentic chunking or semantic chunking instead of fixed-size chunking.

Integration with prior topics

  • Inside ReAct: retrieval tool or context enrichment before “Thought” or planning nodes.
  • In stateful graphs: retrieved knowledge can be written into shared state or long-term memory.
  • With interrupts: high-stakes answers can require human review of retrieved sources.
  • Evaluation: retrieval metrics (, nDCG, MRR) + generation metrics (faithfulness, answer relevance, citation accuracy) + end-to-end task success.

Common failure modes

Retrieval of irrelevant or contradictory chunks (context poisoning); “lost in the middle” problem with long contexts; stale or low-quality source data; poor chunking that splits concepts across boundaries.

Evaluation dimensions

Retrieval quality, generation faithfulness (does the answer cite or stay consistent with sources?), end-to-end task improvement vs. no-RAG baseline, latency/cost overhead, and robustness to query variations.

Practice exercise

Add a RAG retrieval tool (or node) to your stateful research agent. Implement both basic dense search and hybrid search + RRF. Compare retrieval quality and final answer quality on the same test questions. Then add HyDE for short/vague queries and measure improvement.

Vector Databases: Embeddings, Distance Metrics, Indexing (HNSW, IVF)

What it means

Vector databases store high-dimensional embeddings of text (and increasingly multimodal data) and enable fast approximate nearest-neighbor (ANN) search. They are the storage and retrieval engine behind RAG and long-term agent memory.

Embeddings & distance metrics

Embeddings convert text into dense vectors that capture semantic meaning (popular model families include OpenAI, Voyage, Cohere, BGE, and Snowflake Arctic).

Metric Best For Characteristics Common Use in RAG
Cosine Similarity Most semantic search Direction-focused, ignores magnitude Default choice
Euclidean (L2) When magnitude matters Sensitive to vector length Less common
Dot Product When vectors are normalized Fast, equivalent to cosine when normalized Some optimized systems
Manhattan Sparse or specific domains Less sensitive to outliers Niche

Indexing strategies (HNSW vs. IVF)

Raw vector search is too slow at scale (linear scan). Indexes enable fast approximate search.

  • HNSW (Hierarchical Navigable Small World): A graph-based, in-memory index. Excellent recall and low query latency for interactive workloads. Higher memory usage and longer build time. Tunable parameters: M (connections per node — higher = better recall, more memory), efSearch/efConstruction (exploration depth — higher = better recall, slower queries/build) [11].
  • IVF (Inverted File Index): A storage-based, partition-based index that clusters vectors into partitions (like k-means). At query time only a subset of clusters is probed. Not constrained by in-memory pool size the way HNSW is, so it can scale to much larger corpora while using a smaller memory footprint, though sometimes at a small recall cost versus a well-tuned HNSW index [11].

Production trade-offs

HNSW is often preferred for agent-facing, low-latency RAG. IVF or hybrid approaches win for massive scale or when memory cost is dominant. Both are approximate similarity search methods that trade some accuracy for large speed gains versus exhaustive search, and many modern systems (pgvector, Weaviate, Pinecone, Oracle AI Vector Search, etc.) support both index types plus hybrid search inside the database [11].

Other important production features

  • Metadata filtering (pre-filter or post-filter).
  • Real-time or incremental updates.
  • Hybrid vector + keyword search (often with RRF).
  • Quantization / compression for cost and speed.
  • Graph + vector hybrids (emerging for entity-linked memory).

Common failure modes

Index parameters poorly tuned (low recall or high latency); ignoring metadata filtering needs; embedding model mismatch (different model for indexing vs. query); not handling updates/deletes properly (stale vectors).

Evaluation

/ nDCG at different K values, query latency (p50/p95), index build time & memory footprint, impact on end-to-end RAG/agent quality, and filtering correctness.

Practice

Set up a vector store (pgvector, Chroma, or Weaviate) for your agent’s knowledge base. Compare HNSW vs. IVF (or different M/ef settings) on the same corpus for recall vs. latency. Add metadata filtering relevant to your use case (e.g., “only sources after 2025”).

ETL Pipelines & Data Quality Checks

What it means

ETL (Extract–Transform–Load) for knowledge integration is the ingestion pipeline that turns raw documents, databases, or APIs into clean, chunked, embedded, and searchable knowledge in the vector store (and long-term memory).

Core stages

  • Extract: Pull from sources (PDFs, web, databases, internal wikis, etc.) with provenance tracking.
  • Transform: Clean (remove noise, boilerplate, PII if needed), chunk (fixed-size, semantic, or agentic/recursive), enrich with metadata (source, date, author, permissions, summary), deduplicate.
  • Embed & Load: Generate embeddings, store in vector DB with metadata, optionally build graph links or summaries for long-term memory.

Data quality checks (critical for agent reliability)

  • Completeness — Are all expected fields/metadata present?
  • Consistency — No contradictory chunks on the same topic without clear versioning.
  • Freshness / staleness detection — Flag or version old documents.
  • Deduplication — Exact and near-duplicate detection.
  • Chunk quality — Avoid chunks that are too small (lose context) or too large (dilute relevance). Semantic chunking often outperforms naive splitting.
  • Embedding quality — Spot-check retrieval on known good queries.
  • Provenance & auditability — Every chunk should trace back to source for citation and debugging.

Why this matters for agents

Garbage in → garbage retrieval → hallucinated or wrong actions in ReAct loops or planning. Poor ETL directly degrades long-term memory and the evaluation metrics built for the agent overall.

Integration with previous layers

Feeds the vector DB used by RAG; populates long-term memory stores; can be triggered or validated inside stateful workflows; evaluation pipelines should include retrieval quality tests on the ingested data.

Common failure modes

Brittle chunking that breaks semantic units; missing or wrong metadata (breaks filtering and provenance); no incremental update strategy (full re-indexes become unsustainable); ignoring data drift or source changes.

Practice exercise

Design and implement a small ETL pipeline for a document corpus (or your agent’s knowledge base). Include at least: semantic or recursive chunking, metadata enrichment, deduplication, and embedding. Add basic quality checks (e.g., chunk length distribution, sample retrieval test). Load into your vector store and measure retrieval metrics before/after quality improvements.

Overall Tradeoffs & Best Practices

Aspect Basic RAG Hybrid + RRF + HyDE Advanced (with rerank, agentic chunking) Notes
Implementation complexity Low Medium High Start simple, add when metrics plateau
Retrieval quality Good on clear queries Excellent on most queries Highest Hybrid is the standard default for production
Latency / Cost Lowest Medium Highest HyDE and reranking add LLM calls
Maintenance Simple Moderate Higher (more moving parts) ETL quality is the hidden multiplier

Key principles

  • Treat RAG/ETL as a first-class system with its own evaluation, monitoring, and versioning.
  • Combine techniques: hybrid search + HyDE + reranking + good metadata is a strong production baseline.
  • Data quality in ETL has an outsized impact on downstream agent reliability, planning accuracy, and memory effectiveness.
  • Always measure end-to-end (retrieval + generation + task success), not just isolated components.

Consolidated practice recommendation

Extend your stateful ReAct research/decision agent with a RAG retrieval node/tool using hybrid search + HyDE, proper vector store indexing (HNSW or IVF with tuned parameters), a lightweight ETL pipeline that ingests new documents with quality checks and metadata, and an evaluation layer that reports retrieval metrics + end-to-end task success + latency/cost.

Recent Developments (2026)

RAG practice guides now treat the “top-100-candidates → rerank → top-5-10” hybrid pipeline as the default architecture rather than an advanced option, given the well-documented lossiness of single-vector bi-encoder embeddings [9]. On the storage side, native database vector search (e.g., Oracle AI Vector Search, pgvector) increasingly ships both HNSW and IVF index types plus hybrid search under standard database governance, reducing the need for separate specialized vector-database infrastructure in some deployments [11]. On the memory-as-knowledge-integration side, 2026 industry benchmarking treats agent memory as a first-class architectural component with its own research literature and measurable performance gaps between approaches, driven in large part by hybrid retrieval (vector + keyword + entity-graph) replacing pure similarity search [8].


Domain 7: NVIDIA Platform Implementation (7%)

Leveraging NVIDIA’s AI hardware and software platforms for agentic AI systems.

Production Deployment, Governance & Optimized Inference focuses on making agentic systems safe, compliant, high-performance, and scalable in real environments. This layer builds directly on the reasoning, memory, RAG, stateful orchestration, and evaluation topics covered earlier.

These NVIDIA technologies address critical production gaps: NeMo Guardrails adds programmable safety, compliance, and factuality controls; NVIDIA NIM, TensorRT-LLM, and Triton Inference Server provide optimized, containerized, and scalable inference for the LLMs powering agents, RAG, and planning components.

NeMo Guardrails

What it means

NeMo Guardrails is an open-source toolkit for adding controllable, programmable guardrails to LLM-powered applications and agents. It uses Colang, a domain-specific language, to define policies that govern input processing, dialogue flow, output generation, and safety checks.

Why it matters

Agents that act on the world (tool use, planning, state changes) need strong input/output controls. Without guardrails, systems risk leaking PII, violating regulations (GDPR, CCPA, EU AI Act), generating ungrounded or harmful content, or executing unsafe actions. NeMo Guardrails turns safety and compliance into explicit, versionable, testable policies rather than ad-hoc prompting.

Core capabilities

  • Input/Output Filtering: Rails that inspect and block/rewrite user inputs or model outputs based on rules (toxicity, jailbreaks, off-topic, prohibited topics).
  • Compliance Enforcement: Pre-built or custom rails for GDPR/CCPA data handling, EU AI Act high-risk requirements (transparency, human oversight), and audit logging of decisions.
  • PII Detection & Redaction: Automatic detection and masking/redaction of personally identifiable information in both inputs and outputs.
  • Fact-checking: Rails that cross-check generated claims against retrieved context (RAG sources) or external knowledge bases before allowing output.
  • Colang Policies: The policy-as-code layer. You define flows, triggers, and actions in Colang (e.g., “if user asks for medical advice → activate medical disclaimer rail and require human review”).

How it works

Guardrails sit as a middleware layer around your LLM or agent:

  1. User message → Input rails (filtering, PII detection, topical checks).
  2. If safe → Agent/ReAct loop or RAG generation (with retrieved context).
  3. Generated response → Output rails (fact-checking, compliance, PII redaction, formatting).
  4. Only approved responses are returned or actions executed.

Colang lets you express complex dialogue and safety logic declaratively (e.g., “when user expresses intent X, do Y and check Z”).

Integration with prior topics

Wraps ReAct loops (filter inputs before Thought/Action, fact-check observations or final answers); works with stateful graphs (guardrails on nodes or as interrupt triggers for high-risk paths); enhances RAG (fact-checking rails validate against retrieved chunks); feeds evaluation pipelines (safety violation rate, PII leakage incidents, compliance audit coverage become key metrics).

Common failure modes

Overly strict rails that block legitimate queries (poor user experience); false negatives on sophisticated jailbreaks or subtle PII; fact-checking rails that are too slow or miss nuanced claims; Colang policies that become hard to maintain without proper versioning and testing.

Evaluation dimensions

Safety violation rate (blocked vs. allowed harmful content), PII redaction accuracy/recall, factuality/groundedness improvement, compliance audit completeness, latency overhead of rails, and false-positive rate on benign inputs.

Practice exercise

Add NeMo Guardrails to your stateful ReAct research/decision agent. Write Colang policies for: (1) input PII detection + redaction, (2) fact-checking against RAG sources before final answer, (3) blocking high-risk actions without human approval. Measure safety metrics and end-to-end latency before/after.

NVIDIA NIM (Inference Microservices)

What it means

NVIDIA NIM provides pre-built, optimized, containerized microservices for running AI models (LLMs, embeddings, rerankers, vision, guard models, etc.) in production. Each NIM is a self-contained Docker container with the model, runtime, and APIs.

Why it matters

Deploying and scaling optimized inference for agents is complex. NIM abstracts this into consistent, Kubernetes-friendly containers with built-in health checks, metrics, and scaling support, dramatically reducing time-to-production for high-performance agent backends.

Key features

  • Optimized inference microservices: Ready-to-run containers for popular models (Llama, Mistral, Gemma, embedding models, etc.) plus support for custom models.
  • Container deployment: Standard Docker/Kubernetes/Helm deployment. Consistent across cloud, on-prem, or edge.
  • Scaling patterns: Horizontal scaling via Kubernetes, autoscaling based on load, multi-replica deployments, and integration with load balancers.

Integration

Use NIM containers to serve the core LLM used inside your ReAct agents, RAG retrievers, or planning modules. Guardrails (NeMo) can sit in front of or alongside NIM endpoints.

Practice

Deploy a NIM for your agent’s LLM (or a smaller guard/fact-check model). Expose it via Kubernetes and test basic scaling (increase replicas under load).

TensorRT-LLM

What it means

TensorRT-LLM is NVIDIA’s library and toolkit for high-performance LLM inference on GPUs. It applies multiple optimizations including quantization, efficient attention (PagedAttention/KV cache management), in-flight batching, and speculative decoding.

Core optimizations

  • Quantization: Reduces precision of weights and activations.
    • FP16: Good accuracy, moderate speedup.
    • INT8: Strong speedup with small accuracy loss (common sweet spot).
    • INT4 (AWQ, GPTQ, etc.): Highest speed/memory savings, larger accuracy trade-off on some tasks/models.
  • KV cache optimization: PagedAttention and efficient memory management allow larger batch sizes and longer contexts without out-of-memory errors.
  • Additional techniques: In-flight batching, speculative decoding, and custom kernels.

Speed vs. accuracy trade-offs

Quantization delivers major throughput and latency gains (often 2–5× or more depending on model size and hardware) but can degrade performance on complex reasoning, long-context, or nuanced tasks. Always evaluate on your specific agent benchmarks (task success, faithfulness, safety) after quantization.

Why it matters for agents

Agents make many sequential LLM calls (ReAct loops, planning, reflection, RAG augmentation). Faster inference directly improves user experience, reduces cost, and enables higher concurrency.

Evaluation

Measure tokens/second, time-to-first-token (TTFT), end-to-end workflow latency, accuracy/regression on your golden eval set, and memory footprint before vs. after optimization.

Practice exercise

Take the same model used in your agent. Run it with TensorRT-LLM at FP16, INT8, and INT4. Benchmark throughput and latency. Run your full evaluation pipeline (including guardrails and RAG) and compare task success rate + safety metrics across quantization levels.

Triton Inference Server

What it means

Triton is NVIDIA’s open-source, high-performance inference serving platform. It supports multiple backends (including TensorRT-LLM) and provides production-grade features for deploying one or many models at scale.

Key capabilities

  • Model serving at scale: Single server can host multiple models (different sizes, tasks, or versions) on the same GPU(s).
  • Dynamic batching: Automatically groups incoming requests into larger batches for higher GPU utilization and throughput without increasing per-request latency significantly.
  • Multi-model serving & ensembles: Serve embedding + LLM + reranker + guard models together; create ensembles that chain models.
  • Load balancing & management: Built-in load balancing across replicas, model versioning, metrics export (Prometheus), health checks, and rolling updates.

Why it matters

Production agent systems rarely run a single model. Triton + TensorRT-LLM backends + NIM-style containers give you a unified, efficient serving layer for the entire inference stack (LLM for reasoning, embedding models for RAG/memory, guard models).

Integration with the stack

Typical production pattern: NeMo Guardrails (policy layer) → Triton (serving) with TensorRT-LLM backend (optimized LLM) + other NIMs (embeddings, rerankers) → Stateful agent orchestration (LangGraph) with RAG and memory.

Practice exercise

Deploy your agent’s LLM via Triton with a TensorRT-LLM backend. Enable dynamic batching. Compare throughput and latency under concurrent load vs. a simple single-request server. Add a second model (e.g., embedding or small guard model) to the same Triton instance.

Integrated Tradeoffs & Best Practices

Concern NeMo Guardrails TensorRT-LLM Quantization Triton + NIM Serving Combined Recommendation
Safety & Compliance Excellent (policies + rails) Indirect (via evaluation) Good (versioning, isolation) Guardrails + evaluation
Inference Speed / Cost Adds overhead Major gains (INT8/INT4) High utilization via batching TensorRT-LLM + Triton
Scalability Horizontal (replicas) Per-GPU efficiency Dynamic batching + multi-model Triton + Kubernetes
Accuracy / Faithfulness Improves via fact-checking Can regress — must re-eval Neutral Re-evaluate after quantization
Operational Complexity Policy maintenance Quantization tuning + testing Serving config & monitoring Start simple, iterate

Key principles for production agentic systems

  • Put guardrails early (input) and late (output/fact-check) in the flow.
  • Quantize aggressively but always re-run your full evaluation pipeline (task success, safety, faithfulness) — never assume accuracy is preserved.
  • Use Triton + TensorRT-LLM for the heavy lifting; NIM for quick, consistent deployments.
  • Treat the entire inference + guardrail stack as observable and versioned.
  • Measure end-to-end: user-perceived latency, safety incidents prevented, compliance coverage, and cost per successful task.

Consolidated practice recommendation

Take your existing stateful ReAct + RAG + memory agent and productionize it: add NeMo Guardrails with Colang policies for PII, input filtering, and fact-checking; serve the core LLM via TensorRT-LLM (try INT8) behind Triton with dynamic batching; optionally containerize as a NIM-style microservice; run your full evaluation pipeline under load and compare safety, accuracy, latency, and throughput vs. the unoptimized baseline.

Recent Developments (2026)

NVIDIA’s Nemotron 3 model family (Nano, Super, and Ultra sizes) launched as an efficiency-focused open-model line purpose-built for agentic AI, using a hybrid mixture-of-experts architecture that NVIDIA reports delivers roughly 4x higher throughput than the prior Nemotron 2 Nano generation, aimed specifically at high-token-per-second multi-agent workloads [21]. In April 2026, NVIDIA released Nemotron 3 Nano Omni, a multimodal model unifying vision, audio, and language understanding in a single efficient model for agent reasoning across document, video, and audio inputs, reporting leaderboard-topping results across six benchmarks [22]. All Nemotron 3 models are distributed as optimized NVIDIA NIM microservices, reinforcing the NIM containerized-deployment pattern as the standard path from model release to production serving on NVIDIA infrastructure.


Domain 8: Run, Monitor, and Maintain (5%)

Ongoing operation, monitoring, and maintenance of agentic systems post-deployment.

Monitoring Dashboards, Logging, Tracing

What it means

Observability across the entire agent stack — from high-level task outcomes down to individual ReAct steps, guardrail decisions, and infrastructure metrics.

Why it matters

Production agents make many sequential decisions. Without deep visibility you cannot debug failures, measure true performance, detect drift, or prove compliance.

Layers of observability (best practice)

  • Infrastructure: Prometheus + Grafana for CPU/GPU/memory, request rates, error rates, autoscaling events (Kubernetes + Triton/NIM metrics).
  • Application/Agent: Distributed tracing (OpenTelemetry) across ReAct Thought → Action → Observation cycles, state transitions in LangGraph, guardrail evaluations, and RAG retrievals. Tools like LangSmith, Phoenix, or custom OpenTelemetry instrumentation.
  • Business/Safety: Dashboards for task success rate, safety violation rate, PII redaction events, human approval rates, latency distributions, and cost per successful task.
  • Logging: Structured logs with correlation IDs that tie a user request through the full stack (input guardrail → planning → tool calls → output fact-check → final response).

Integration

Full traces should capture when NeMo Guardrails blocked or rewrote content, when checkpoints were created, and the exact retrieved documents used in RAG.

Common failure modes

Siloed observability (infra metrics separate from agent traces); missing context in logs/traces (no correlation IDs); no alerting on safety or compliance signals.

Practice exercise

Instrument your deployed agent with OpenTelemetry tracing. Export traces that span a full ReAct cycle + guardrail checks + RAG retrieval. Build a simple Grafana dashboard showing task success rate, average steps per task, and safety violation rate. Add alerting on high safety violation rates or latency spikes.

Continuous Benchmarking

What it means

Automated, recurring execution of evaluation pipelines against golden datasets (and production shadow traffic) whenever any component changes.

Why it matters

It catches regressions early (accuracy drop after quantization, safety gap after policy change, latency regression after scaling change) before they reach users.

How it works

  • Scheduled or event-driven runs of the full evaluation harness (task success, safety, latency, cost, retrieval quality).
  • Comparison against baselines or previous versions with statistical significance.
  • Automatic promotion gates or rollback triggers in CI/CD.
  • Shadow testing: run new versions on a percentage of production traffic without affecting users, then compare metrics.

Integration

Continuous benchmarking should include your stateful orchestration recovery tests, guardrail effectiveness, RAG faithfulness, and end-to-end performance under load.

Practice exercise

Extend your CI pipeline to run the full evaluation suite (including safety and guardrail metrics) on every merge to main. Store results historically and add a step that fails the pipeline if key metrics regress beyond a threshold.

Recent Developments (2026)

The industry has converged on OpenTelemetry as the standard telemetry layer for agentic systems: as of the v1.41 GenAI semantic conventions, the spec defines agent, workflow, tool, and model spans plus required latency and token-usage metrics (though many gen_ai.* attributes remain marked “Development stability” and can still change) [31][32]. Major observability vendors (Datadog, Honeycomb, New Relic) and agent frameworks (LangChain, CrewAI, AutoGen/AG2) now emit or ingest OTel-compliant spans natively, so traces from different frameworks can flow into a single backend without custom parsing. Each tool call, LLM invocation, and retrieval step becomes its own child span, producing a full reasoning-chain trace. The main gap identified in 2026 observability reviews is integration rather than depth: individual layers (infrastructure, application, business/safety) are well instrumented, but unifying them into a single, coherent operational picture remains an open challenge [32].


Domain 9: Safety, Ethics, and Compliance (5%)

Principles and practices that ensure agentic AI systems operate responsibly, uphold ethical standards, and comply with legal and regulatory frameworks.

Safety & Ethics: Security, Compliance, Bias Mitigation

What it means

Proactive controls and processes to prevent harm, ensure regulatory compliance, and reduce unfair or biased behavior in agentic systems.

Core areas

  • Security: Prompt injection defense, tool abuse prevention (guardrails limiting what tools/actions an agent can take), secure secret handling, sandboxing of code execution tools.
  • Compliance: GDPR (data minimization, right to be forgotten in memory stores, consent for processing), CCPA, EU AI Act (transparency, human oversight for high-risk systems, risk management). NeMo Guardrails + audit logging + interrupt gates provide concrete mechanisms.
  • Bias mitigation: Monitoring for disparate impact in planning, retrieval (RAG sources), and generation. Techniques include balanced datasets for RAG, bias-aware prompting or fine-tuning, and regular auditing of outcomes across demographic slices (where applicable and ethical).

Integration with prior topics

NeMo Guardrails + Colang policies are the primary enforcement layer. Stateful orchestration with interrupts enables required human oversight. Evaluation pipelines must track safety and bias metrics. RAG and memory systems must support data subject rights.

Common failure modes

Treating safety as a one-time prompt instead of ongoing policy + monitoring; incomplete audit trails that fail compliance reviews; bias that only appears in long-horizon or multi-agent scenarios.

Evaluation dimensions

Safety violation/block rate, compliance audit pass rate, bias metrics (where measurable), time-to-detect and remediate issues, and coverage of high-risk paths by human oversight.

Practice exercise

Expand your Colang policies to cover prompt injection patterns and tool abuse prevention. Add bias monitoring to your evaluation dashboard (e.g., outcome distribution across synthetic demographic variations in test cases). Document how your system would respond to a GDPR data deletion request.

Recent Developments (2026)

Governance frameworks have struggled to keep pace with agentic AI specifically: as of 2026, the NIST AI Risk Management Framework, ISO/IEC 42001, and the EU AI Act (Regulation 2024/1689) contain no explicit references to “agent,” “agentic,” or autonomous AI systems, and the EU AI Act does not address multi-agent risk, instead targeting the foundation-model layer rather than systems-level agentic behavior [25][24]. Agentic systems introduce risks that these frameworks were not written for: acting on external systems, accessing tools dynamically, executing multi-step plans where errors can cascade, maintaining persistent memory vulnerable to manipulation, and delegating across agent boundaries in ways that fragment accountability [25]. Enforcement of the EU AI Act’s high-risk provisions began in August 2026, with penalties reaching €35 million or 7% of global turnover; autonomous agents taking consequential actions (financial transactions, medical decisions, legal submissions) are likely to be classified as high-risk, triggering requirements for human oversight, auditability, and conformity assessment — and in a chain of cooperating agents, the compliance boundary extends to every agent performing a high-risk function [24][23]. Singapore’s IMDA published a Model AI Governance Framework for Agentic AI in January 2026 — described as the first comprehensive governance framework specifically for autonomous agents — requiring each agent to carry a verifiable digital identity and maintain an audit trail [23].


Domain 10: Human-AI Interaction and Oversight (5%)

The design and implementation of systems that facilitate effective human oversight and interaction with agents.

Human Interaction: UI Design, Feedback Loops, Transparency, Human Oversight

What it means

Designing interfaces and processes so humans can effectively collaborate with, oversee, correct, and trust agentic systems.

Key elements

  • UI/UX design: Clear presentation of agent reasoning traces (ReAct thoughts/actions), retrieved sources (RAG citations), confidence levels, and proposed next actions. Progressive disclosure — show details on demand.
  • Feedback loops: Easy mechanisms for humans to correct outputs, approve/reject actions, or provide natural language feedback that updates memory or triggers re-planning.
  • Transparency: Always show why the agent made a decision (reasoning trace + sources + guardrail decisions).
  • Human oversight (HITL): Interrupt patterns and approval gates (from stateful orchestration) for high-risk actions. Escalation paths when the agent is uncertain or the task is sensitive.

Integration

HITL via LangGraph interrupts + NeMo Guardrails creates enforceable human oversight points. Feedback can be written back into long-term memory or used to improve RAG indexes/evaluation sets. Transparency builds trust and aids debugging.

Common failure modes

Overwhelming users with raw traces instead of summarized, actionable views; feedback that is collected but never used (no closed loop); opaque “black box” behavior that erodes trust even when the agent is correct.

Evaluation dimensions

Human approval rate and latency, correction acceptance rate, user trust/satisfaction scores, time saved vs. fully manual process, and escalation rate to humans.

Practice exercise

Build a simple Streamlit or Gradio interface for your agent that shows: reasoning trace, retrieved sources, guardrail decisions, and proposed action. Add buttons for “Approve & Execute”, “Reject & Explain Why”, and “Edit & Retry”. Log all feedback and demonstrate writing it back into long-term memory.

Recent Developments (2026)

The EU AI Act’s Article 14, enforceable from August 2, 2026, mandates human oversight capabilities for high-risk AI systems, and both the Act and NIST’s AI RMF now require oversight that is demonstrably “trained, measurable, and provable” rather than a nominal checkbox [27]. Practitioner and research literature increasingly distinguishes Human-in-the-Loop (HITL) — where the human is inside the control loop and the agent blocks pending explicit approval — from Human-on-the-Loop (HOTL) — where the human monitors execution and intervenes only when anomalies appear [26][27]. A widely adopted production pattern tiers every agent action by its reversibility and “blast radius,” assigning a different oversight mode to each tier, with the agent routing to human review whenever its own confidence in an action falls below a defined threshold [26]. There is also active debate about the limits of pure HITL at scale — some 2026 commentary argues that human-in-the-loop review “hits a wall” as agent volume grows, and proposes AI-assisted oversight of other AI agents (with humans overseeing the oversight layer) as a complementary pattern for high-throughput deployments [28].


Final Synthesis & Review Checklist

Synthesis checklist (answer these confidently)

  • What is an AI agent and how does it differ from a chatbot or simple workflow?
  • Core components: tools, memory (short/long-term), planning/reasoning (ReAct, decomposition, reflection), state, guardrails.
  • How do you design reliable stateful workflows with checkpointing, recovery, and interrupts?
  • How do you evaluate agents across task success, safety, latency, cost, controllability, and compliance?
  • When should you use agents vs. simpler patterns?
  • How do you productionize (Docker/K8s, MLOps, observability, optimized inference with TensorRT-LLM/Triton/NIM)?
  • How do you enforce safety, compliance, and human oversight at scale?

Recommended final activities

  • Run your full end-to-end system (stateful ReAct + RAG + memory + NeMo Guardrails + optimized serving) through a production-like scenario with monitoring and human oversight.
  • Review weak areas from all domains and re-practice the corresponding exercises.
  • Document your production architecture (one-pager) showing how all layers connect.

Overall practice recommendation

Take your stateful ReAct research/decision agent, add long-term memory, implement at least one advanced planning variant (e.g., Reflexion-style reflection or decomposition), tune key parameters, and run a full evaluation pipeline with success, latency, cost, and safety metrics. Perform a small A/B comparison of two configurations. Then productionize it end to end: containerize with Docker/Kubernetes, add NeMo Guardrails policies, serve via TensorRT-LLM/Triton/NIM, instrument OpenTelemetry-based observability, and build a human-in-the-loop review interface. A single running project exercising all ten domains — architecture, development, evaluation, deployment, cognition/memory, knowledge integration, NVIDIA platform tooling, monitoring, safety/compliance, and human oversight — is the strongest possible portfolio piece and study aid.


References

  1. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629. https://arxiv.org/abs/2210.03629

  2. Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366. https://arxiv.org/abs/2303.11366

  3. Yehudai, A., Eden, L., Li, A., Uziel, G., Zhao, Y., Bar-Haim, R., Cohan, A., & Shmueli-Scheuer, M. (2025). Survey on Evaluation of LLM-based Agents. arXiv:2503.16416. https://arxiv.org/abs/2503.16416

  4. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401. https://arxiv.org/abs/2005.11401

  5. Gao, L., Ma, X., Lin, J., & Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). arXiv:2212.10496. https://arxiv.org/abs/2212.10496

  6. AppScale Blog (2026). Durable Execution for LLM Agents 2026: Temporal + LangGraph. https://appscale.blog/en/blog/durable-execution-llm-agents-temporal-langgraph-checkpointing-2026; LangChain Blog. Building LangGraph: Designing an Agent Runtime from First Principles. https://blog.langchain.com/building-langgraph/

  7. Benchmarkingagents.com (2026). AI Agent Benchmarks 2026 — SWE-bench, WebArena, AgentBench, Terminal-Bench, OSWorld, Tau-Bench. https://benchmarkingagents.com/agent-benchmarks/; MarkTechPost (2026). Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models. https://www.marktechpost.com/2026/04/26/top-7-benchmarks-that-actually-matter-for-agentic-reasoning-in-large-language-models/

  8. Mem0 (2026). AI Agent Memory 2026: Progress Benchmark Report Evaluations. https://mem0.ai/blog/state-of-ai-agent-memory-2026; Mem0 (2026). Graph-Based Memory Solutions for AI Context: Top 5 Compared. https://mem0.ai/blog/graph-memory-solutions-ai-agents

  9. Srinivasan, A. (2026). All You Need to Know About RAG (in 2026). AI with Aish, Substack. https://aishwaryasrinivasan.substack.com/p/all-you-need-to-know-about-rag-in

  10. Zilliz (2026). Better RAG with HyDE — Hypothetical Document Embeddings. https://zilliz.com/learn/improve-rag-and-information-retrieval-with-hyde-hypothetical-document-embeddings

  11. Oracle Database Blog. Using HNSW Vector Indexes in AI Vector Search. https://blogs.oracle.com/database/using-hnsw-vector-indexes-in-ai-vector-search; Using IVF Vector Indexes in AI Vector Search. https://blogs.oracle.com/database/using-ivf-vector-indexes; Indexing Guidelines with AI Vector Search. https://blogs.oracle.com/database/indexing-guidelines-with-ai-vector-search

  12. Benchmarkingagents.com (2026). Agent Benchmark Leaderboard 2026: AgentBench, SWE-bench, GAIA. https://benchmarkingagents.com/benchmarks-list/

  13. MarkTechPost (2026). Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models. https://www.marktechpost.com/2026/04/26/top-7-benchmarks-that-actually-matter-for-agentic-reasoning-in-large-language-models/

  14. MachineLearningMastery.com (2026). 7 Agentic AI Trends to Watch in 2026. https://machinelearningmastery.com/7-agentic-ai-trends-to-watch-in-2026/

  15. A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology. arXiv:2605.13850. https://arxiv.org/pdf/2605.13850

  16. Agentic Design Patterns: A System-Theoretic Framework. arXiv:2601.19752. https://arxiv.org/pdf/2601.19752

  17. Augment Code (2026). What Are Agentic Design Patterns? 2026 Pattern Catalog. https://www.augmentcode.com/guides/agentic-design-patterns

  18. Google Cloud Architecture Center (2026). Choose a Design Pattern for Your Agentic AI System. https://docs.cloud.google.com/architecture/choose-design-pattern-agentic-ai-system

  19. OpenAI (2026). The Next Evolution of the Agents SDK. https://openai.com/index/the-next-evolution-of-the-agents-sdk/; OpenAI (2026). Introducing AgentKit. https://openai.com/index/introducing-agentkit/

  20. Google. Agent Development Kit (ADK). https://adk.dev/; LangChain (2026). The Best AI Agent Frameworks in 2026. https://www.langchain.com/resources/ai-agent-frameworks

  21. NVIDIA Newsroom (2026). NVIDIA Debuts Nemotron 3 Family of Open Models. https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models

  22. NVIDIA Developer Blog (2026). NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model. https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/; NVIDIA Blog (2026). NVIDIA Launches Nemotron 3 Nano Omni Model, Unifying Vision, Audio and Language for AI Agents. https://blogs.nvidia.com/blog/nemotron-3-nano-omni-multimodal-ai-agents/

  23. Artificial Intelligence News (2026). Agentic AI’s Governance Challenges Under the EU AI Act in 2026. https://www.artificialintelligence-news.com/news/agentic-ais-governance-challenges-under-the-eu-ai-act-in-2026/

  24. UC Berkeley Law, Berkeley Center for Law & Technology (2026). EU AI Act Risk Tiers, GDPR Data Minimization, and U.S. State Law Converge on Agentic AI Compliance in 2026. https://www.law.berkeley.edu/research/bclt/bclt-legal-analysis/eu-ai-act/

  25. Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents. arXiv:2505.02077. https://arxiv.org/pdf/2505.02077

  26. Galileo AI (2026). How to Build Human-in-the-Loop Oversight for AI Agents. https://galileo.ai/blog/human-in-the-loop-agent-oversight

  27. Digital Applied (2026). Human-in-the-Loop Escalation Design for AI Agents 2026. https://www.digitalapplied.com/blog/human-in-the-loop-escalation-design-ai-agents-2026

  28. SiliconANGLE (2026). Human-in-the-Loop Has Hit the Wall. It’s Time for AI to Oversee AI. https://siliconangle.com/2026/01/18/human-loop-hit-wall-time-ai-oversee-ai/

  29. CNCF Blog (2026). The Great Migration: Why Every AI Platform Is Converging on Kubernetes. https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/

  30. Epsilla Blog (2026). Scaling Autonomous AI Agents: Kubernetes, Runtimes, and Architecture Insights. https://www.epsilla.com/blogs/scaling-autonomous-ai-agents-kubernetes-runtimes-architecture

  31. OpenTelemetry Blog (2025). AI Agent Observability — Evolving Standards and Best Practices. https://opentelemetry.io/blog/2025/ai-agent-observability/

  32. Uptrace (2026). OpenTelemetry for AI Systems: LLM and Agent Observability. https://uptrace.dev/blog/opentelemetry-ai-systems; Digital Applied (2026). AI Agent Observability 2026: Tracing & Monitoring Stack. https://www.digitalapplied.com/blog/ai-agent-observability-2026-tracing-monitoring-stack-guide

  33. Zylos Research (2026). AI Reasoning Models 2026: From OpenAI o3 to DeepSeek-R1 and the Test-Time Compute Revolution. https://zylos.ai/research/2026-01-24-ai-reasoning-models

  34. Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents. arXiv:2509.03581. https://arxiv.org/pdf/2509.03581


Source note: This document reorganizes and cleans up a prior study-guide conversation (originally generated with Grok/SuperGrok) into the ten-domain exam structure above. All original substantive content — explanations, benchmarks, tables, practice exercises, and checklists — has been preserved; only conversational scaffolding (thinking-time markers, source counts, follow-up prompts, and repeated instructions) was removed. Footnote-style references from the original transcript have been resolved to full citations above, and a “Recent Developments (2026)” subsection was added to every domain with newly verified sources.