Reorganized by exam domain, with original source content fully preserved, citations verified, and 2026 supplementary research added to every domain.
Foundational structuring and design of agentic AI systems, focusing on how agents interact, reason, and communicate within their environments.
What it means
ReAct (Reasoning + Acting) is a prompting and execution paradigm that interleaves verbal reasoning traces (“Thoughts”) with concrete actions and their results (“Observations”). It was introduced to let language models dynamically create, maintain, and revise plans while grounding those plans in external information or environment feedback [1].
Why it matters
Pure chain-of-thought (CoT) reasoning happens entirely inside the model and can hallucinate or drift. Pure tool-calling or “Act-only” approaches lack high-level planning and exception handling. ReAct creates synergy: reasoning guides which actions to take and when to stop or replan; observations supply fresh facts that update reasoning and reduce hallucinations. The resulting trajectories are human-readable, debuggable, and controllable — critical for trust and iteration.
How it works (core loop)
The agent maintains a growing history and repeatedly executes:
Typical few-shot prompt format (simplified):
Thought: ...
Action: tool_name with arg is value
Observation: ...
Thought: ...
Action: Final Answer[result]
In zero-shot or structured-output setups, the model is instructed to emit the same format and a parser extracts the action.
Key scholarly reference
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629 [1].
Benchmarks & evidence (from the paper)
Extensions
Reflexion (Shinn et al., 2023) adds verbal self-reflection on failures stored in episodic memory, yielding large gains on coding and decision tasks (e.g., 91% pass@1 on HumanEval in some settings) [2].
Real-world examples
Common failure modes & mitigations
max_iterations;
add repetition detection in state or use reflection.Evaluation dimensions
Practice exercises & mini-projects
What it means & why it matters
ReAct (or any reasoning loop) is usually only one node inside a larger workflow. Real production agents run for many steps, interact with external systems, may require human approval, and must survive crashes or restarts. Stateful orchestration models the entire workflow as an explicit, persistent state machine. Checkpointing saves progress so the system can recover, resume, or allow human intervention without losing work. Graph-based frameworks with first-class checkpointing and state management (especially LangGraph) are repeatedly cited as the standard for controllable, auditable, long-running agentic systems.
How it works (LangGraph as the leading exemplar)
TypedDict / Pydantic model) holds messages, intermediate
results, metadata, and working memory. Updates use reducers (pure
functions) that safely merge concurrent changes (e.g., append to a list
of observations).MemorySaver for development;
SQLite/Postgres or custom backends for production). After configurable
steps (or every node), the current state is serialized and stored under
a thread_id + checkpoint_id.Architectural tradeoffs
Common failure modes & mitigations
Evaluation for production readiness
Practice exercises & mini-projects
When to use ReAct + stateful orchestration vs. simpler patterns
Use when the task is complex, uncertain, interactive with external systems, long-running, requires adaptation/grounding, auditability, or human oversight. Prefer simpler chains, direct tool calling, or advanced native reasoning models when latency/throughput is paramount, tasks are short and deterministic, or the workflow is purely internal.
Summary checklist for mastery
The 2025–2026 literature has moved past the original 12 “foundational” agent design patterns (ReAct, Reflection, Tool Use, Planning, Multi-Agent Collaboration, Sequential Workflows, Human-in-the-Loop, etc.) toward a wave of emergent patterns addressing production constraints: context management, bounded execution, layered safety controls, memory, and meta-level orchestration, with newer catalogs grouping roughly 21 widely used patterns into five families [17]. A parallel research thread has proposed more formal frameworks for classifying these patterns — for example, a two-dimensional framework organizing agent design along “cognitive function” and “execution topology” axes [15], and a system-theoretic framework treating agentic design patterns as control-theoretic building blocks [16].
Architecturally, the field is also shifting from single, all-purpose agents toward orchestrated teams of specialized agents — Gartner reported a 1,445% surge in multi-agent system inquiries between Q1 2024 and Q2 2025 — with “puppeteer” orchestrator agents coordinating specialist sub-agents in production deployments [14]. Google Cloud’s architecture guidance for choosing a design pattern for agentic AI systems (sequential, hierarchical, or collaborative) reflects this same trend toward explicit topology selection as a first design decision [18].
Practical building, integration, and enhancement of agents.
Stateful orchestration is the production-grade execution layer that turns individual reasoning loops (such as ReAct) into reliable, resumable, auditable, and controllable multi-step workflows. While ReAct provides the cognitive pattern (Thought → Action → Observation), stateful orchestration supplies the memory, persistence, control flow, and safety mechanisms required for anything beyond trivial single-turn interactions.
This section focuses on the four MUST-HAVE capabilities for production agentic systems:
What it means
State is a shared, structured data object that lives across every step of a workflow. It holds conversation history, intermediate results, metadata, and any values needed by downstream nodes. In graph-based systems, state is the single source of truth that all nodes read from and write to.
Why it matters
Without persistent shared state, each step is isolated. Agents cannot remember prior observations, coordinate across branches, or resume intelligently. Persistent state enables long-running processes, multi-agent blackboard-style coordination, and full audit trails — essential for compliance, debugging, and reliability.
How it works
Define a typed state schema (e.g., TypedDict or Pydantic
model). Nodes receive the current state and return updates. Reducers are
pure functions that safely merge updates (e.g., append to a list of
observations or overwrite a field). The framework manages state passing
and merging automatically.
Common tools & frameworks
LangGraph’s StateGraph is the leading implementation:
explicit state schema + reducers. Other frameworks offer lighter or
implicit state (CrewAI tasks, AutoGen conversations) but LangGraph
provides the most control and production maturity for complex
persistence.
Real-world examples
Common failure modes & mitigations
Poor reducer design causing lost updates or race conditions in parallel branches; state bloat (accumulating every message forever); inconsistent schemas across nodes. Mitigations — use well-defined reducers, prune or summarize state periodically, enforce schema validation, and separate short-term thread state from long-term memory stores.
Evaluation
Measure state consistency under concurrent updates, audit completeness (can you reconstruct the full decision path from state history?), and memory efficiency (tokens or storage per workflow).
What it means
Checkpointing periodically serializes the full graph state (including messages, intermediate results, and metadata) to durable storage. Recovery loads a prior checkpoint and resumes execution from that exact point without re-running completed work.
Why it matters
LLM calls, tool executions, and long-running workflows are expensive and non-deterministic. Crashes, timeouts, or deployments must not force full restarts. Checkpointing delivers fault tolerance, time-travel debugging, and the foundation for human-in-the-loop pauses. Robust checkpointing is repeatedly cited as a key differentiator for reliable agentic systems in production.
How it works
Attach a checkpointer when compiling the graph (e.g.,
MemorySaver for development, AsyncSqliteSaver
or Postgres for production). Every node execution (or configurable
milestones) writes a checkpoint identified by thread_id +
checkpoint_id. On failure or restart, pass the same
thread_id and the system loads the latest (or chosen)
checkpoint and continues.
Key patterns
Real-world examples
Common failure modes & mitigations
Using an in-memory checkpointer in production (state lost on restart); non-idempotent actions executed again on recovery (duplicate emails, duplicate charges); checkpoint bloat or slow serialization of very large states. Mitigations — use durable backends, design actions to be idempotent or include compensation logic, checkpoint only at safe milestones, and implement state pruning.
Evaluation
Recovery success rate, mean time to recovery (MTTR), storage/latency overhead of checkpoints, and ability to replay historical runs for audit or A/B testing.
What it means
The workflow is modeled as an explicit directed graph. Nodes perform work (LLM calls, tool use, ReAct loops, custom logic). Edges define control flow. Conditional edges route dynamically based on state values (e.g., confidence score, risk level, or observation content).
Why it matters
Real tasks are rarely linear. Agents must branch on conditions (“if high risk → human review”), run steps in parallel and merge results, loop until convergence, or escalate. Explicit graphs give controllability, observability, and testability that implicit or prompt-only orchestration lacks.
How it works
add_conditional_edges(source, routing_function, path_map)
where the routing function inspects state and returns the next node
name(s).Common tools
LangGraph excels here with first-class conditional edges and graph visualization. Other frameworks support sequencing and some branching but usually with less explicit control.
Real-world examples
Common failure modes & mitigations
Infinite loops from missing or buggy termination conditions; dead branches or unreachable nodes; race conditions in parallel updates without proper reducers. Mitigations — add explicit termination logic and max-iteration guards; use graph visualization and static analysis tools; test all routing paths.
Evaluation
Branch coverage in tests, correctness of routing decisions, latency of coordination overhead, and maintainability (ease of modifying flow).
What it means
Interrupts pause graph execution at defined points, persist the current state, and wait for external (usually human) input before resuming. This implements human-in-the-loop (HITL) approval, editing, or rejection for high-risk actions.
Why it matters
Fully autonomous execution of high-stakes actions (financial transfers, medical recommendations, data deletions, customer communications, compliance decisions) is unacceptable in most real deployments. Interrupts provide controllable safety valves while preserving automation for low-risk paths. They are a core governance and safety mechanism.
How it works
Two complementary approaches:
Static interrupts (at compile time):
graph = builder.compile(
checkpointer=checkpointer,
interrupt_before=["high_risk_action_node"],
interrupt_after=["some_other_node"]
)
Dynamic interrupts (recommended for flexible HITL — called inside a node):
from langgraph.types import interrupt, Command
def human_review_node(state):
payload = {"need": "approval", "draft": state["draft"], "risk": state["risk_score"]}
decision = interrupt(payload) # pauses here, saves state
return {"approved": decision["approved"], "feedback": decision.get("feedback")}
Resume from outside:
result = app.invoke(
Command(resume={"approved": True, "feedback": "Looks good, proceed"}),
config={"configurable": {"thread_id": "abc123"}}
)
When an interrupt fires, the checkpointer saves the exact state.
Execution can resume minutes, hours, or days later from any client that
knows the thread_id.
Real-world examples (high-risk gates)
Common failure modes & mitigations
Using non-durable checkpointers (interrupt state lost); poor UX for human reviewers (missing context in the interrupt payload); timeouts or orphaned interrupts if resumption logic is fragile; over-use of interrupts (every step becomes a bottleneck). Mitigations — always pair interrupts with durable checkpointers and consistent thread IDs; include rich, relevant context in the interrupt payload; design clear approval UIs; reserve interrupts for genuinely high-risk nodes.
Evaluation
Approval rate and latency, safety incidents prevented, human reviewer satisfaction, audit completeness of approval decisions, and overall workflow throughput with interrupts enabled.
| Approach | Reliability & Recovery | Controllability / HITL | Complexity & Overhead | Best For | Avoid When |
|---|---|---|---|---|---|
| Stateless chains / simple ReAct | Low | Low | Low | High-volume, low-risk, simple tasks | Long-running or high-stakes work |
| Stateful graph + checkpointing | High | Medium–High | Medium | Most production multi-step agents | Ultra-low latency requirements |
| + Conditional branching | High | High | Medium–High | Adaptive, decision-heavy flows | Purely linear pipelines |
| + Dynamic interrupts (HITL) | Very High | Very High | High | High-risk or regulated domains | Fully autonomous low-risk tasks |
Stateful orchestration with all four MUST-HAVEs is the default recommendation for any production agent that performs non-trivial, multi-step, or consequential work.
Exercise: Stateful Research + Decision Agent with Checkpoints, Branching, and Approval Gates
messages,
research_findings, risk_score,
final_recommendation, approved.researcher (ReAct-style loop or tool
calls), risk_assessor, high_risk_approval
(uses dynamic interrupt), executor (only runs if
approved).interrupt_before or dynamic interrupt on the
approval/execution path.Mini-project extension
Instrument full audit logging. Measure recovery success, approval latency, and token/state growth. Compare against a non-checkpointed, non-interrupt version on the same tasks.
Summary Checklist
The framework landscape for building agents has consolidated and matured rapidly. OpenAI replaced its experimental Swarm framework with the production-grade Agents SDK, then extended it with native sandbox execution and a model-native harness for secure, long-running agents, and introduced AgentKit — including a visual Agent Builder canvas for versioning multi-agent workflows and ChatKit for embedding chat-based agent experiences [19]. Google shipped ADK (Agent Development Kit) 1.0 for Java and Go (alongside existing Python/TypeScript SDKs), adding native Agent-to-Agent (A2A) protocol support and a visual Agent Designer in the Google Cloud console [20]. Microsoft’s Agent Framework 1.0 went GA in April 2026, merging AutoGen and Semantic Kernel into a single .NET/Python SDK [21]. Independent framework comparisons published in 2026 continue to position LangGraph as the strongest choice for granular state control and durable execution, CrewAI for fast role-based multi-agent prototyping, and the OpenAI/Google/Microsoft SDKs as increasingly capable managed alternatives [22].
On the durable-execution side specifically, 2026 production guides emphasize pairing LangGraph checkpointers with dedicated durable-execution engines (e.g., Temporal) and note optimizations such as delta-based checkpoint storage that can reduce persisted-state size by orders of magnitude at scale [6].
Measuring, comparing, and optimizing agent performance.
What it means
An evaluation pipeline is a repeatable, automated system that runs agents on curated test cases, scores outputs against expected outcomes or rubrics, and measures multiple dimensions: task completion/success rate, accuracy/quality (factuality, helpfulness, safety), latency (end-to-end and per-step), cost (tokens, tool calls), and reliability (consistency across runs or under perturbation).
Why it matters
Without rigorous pipelines, “it works in my demo” does not translate to production. Surveys of agent evaluation highlight that it must cover core capabilities (planning, tool use, reflection), application-specific benchmarks (web agents, SWE-bench-style coding, long-horizon tasks), and generalist performance while accounting for stochasticity [3].
How it works (practical pipeline)
Common tools/frameworks
LangSmith, Phoenix, or custom harnesses built on LangGraph’s tracing + checkpoint replay. Benchmarks: SWE-Bench, WebArena, GAIA/Gaia2-style environments, and long-horizon suites (e.g., DeepPlanning, MemGym) [3].
Real-world examples
Common failure modes & mitigations
Over-reliance on single success rate (ignores variance/stochasticity); LLM judges that are poorly calibrated or biased; ignoring latency/cost in “accuracy-only” evals; static datasets that don’t reflect production distribution shift. Mitigations — use statistical reliability framing, multi-judge ensembles with calibration, track full distributions (not just means), and maintain living datasets from production traces. Key dimensions: task success, quality/faithfulness, efficiency (latency + cost), robustness (to noise/failure), controllability (via interrupts/state), and safety.
Practice exercise
Build a minimal eval harness for your stateful ReAct research agent. Create 10–20 golden cases. Run with tracing. Implement a simple LLM judge + programmatic checks. Report success rate, average steps/latency, and error breakdown. Add one production trace as a new test case.
What it means
Benchmarking compares systems or versions on standardized or custom suites. A/B testing (or online experimentation) runs controlled variants in production (or shadow) to measure real-user or real-task impact.
Why it matters
Benchmarks reveal relative strengths; A/B testing validates whether changes (new memory, different planner, parameter tweak, interrupt policy) actually move the needle on business or reliability metrics without harming others.
How it works
Trade-offs & considerations
Agent stochasticity requires more samples or pass@k-style metrics. Long-horizon tasks amplify small differences. Production A/B must respect safety (e.g., route high-risk cases to human review or a conservative variant).
Practice
A/B test two memory configurations or two planning strategies (plain ReAct vs. ReAct + reflection) on your eval set and on a small live traffic slice. Report effect sizes on success, latency, and cost.
What it means
Why it matters
These directly affect output diversity, coherence, and reliability. In agentic loops they influence planning quality, tool-call formatting, reflection depth, and whether the agent explores novel paths or sticks to safe ones.
How it works & trade-offs
Practical guidance
Reasoning-specialized models often need less aggressive temperature tuning for internal CoT, but tool-use and structured agent loops still benefit from careful tuning. Use lower values for high-risk nodes or before interrupts; allow moderate exploration in research/planning nodes. Combine with structured output / tool-calling APIs for better determinism than raw sampling.
Evaluation of tuning
Measure not just final accuracy but also trajectory consistency, formatting compliance, and downstream effects (e.g., interrupt approval rate, recovery success).
Practice exercise
Run the same set of planning or research tasks at temperature 0.0, 0.7, and 1.2 (with fixed seed where possible). Compare success rate, diversity of plans/actions (qualitative or embedding distance), latency, and error types. Repeat with Top-p sweeps.
The benchmark landscape has consolidated around five core suites that measure distinct capabilities and “should never be collapsed into a single ranking”: SWE-bench (software engineering), GAIA (real-world multi-step assistant tasks), Tau-bench (tool use under policy constraints), AgentBench, and WebArena (web navigation), alongside newer additions like Terminal-Bench and OSWorld [12][13]. A 2025 survey of LLM-agent evaluation methods formally organized the field into five perspectives — core capabilities, application-specific benchmarks, generalist-agent evaluation, benchmark-dimension analysis, and evaluation tooling — while flagging cost-efficiency, safety, and robustness assessment as the biggest remaining gaps [3]. Continuous, in-deployment evaluation frameworks (e.g., multi-signal monitoring across live agent traffic rather than one-off offline runs) have also emerged as a complement to static golden-dataset testing.
Operationalizing and scaling agentic systems.
What it means
Packaging agentic systems (LLM inference, RAG pipelines, stateful graphs, guardrails, memory stores) into portable, reproducible containers and orchestrating them at scale with Kubernetes.
Why it matters
Agent workflows involve multiple components (inference servers, vector DBs, guardrail services, orchestration runtimes). Docker + Kubernetes provide consistency across environments, declarative scaling, self-healing, and the foundation for MLOps automation.
How it works (practical patterns)
Integration with prior topics
Common failure modes
Inconsistent environments between dev and prod; poor resource requests/limits causing OOM or throttling; stateful components (checkpointers, vector indexes) not properly persisted or backed up; secrets and sensitive prompts/configs leaking into images.
Evaluation dimensions
Deployment success rate, time-to-deploy, resource efficiency, scaling responsiveness, and recovery time from pod failures.
Practice exercise
Containerize your stateful ReAct + RAG + Guardrails agent. Create a simple Kubernetes manifest (or Helm chart) that deploys the inference backend (Triton/TensorRT-LLM), the agent runtime, and a vector DB. Add health checks and basic HPA. Test rolling updates and pod failure recovery.
What it means
Applying software engineering and MLOps discipline to the full agent stack: code (graphs, tools, policies), models, prompts, RAG indexes, Colang guardrail policies, and evaluation datasets.
Why it matters
Agents are composite systems. Changes to any part (new model version, updated Colang policy, refreshed RAG corpus, new tool) can break safety, accuracy, or performance. CI/CD + governance provides repeatability, auditability, and controlled rollout.
Key practices
Integration
CI/CD can trigger re-evaluation of your stateful agent whenever the RAG corpus, guardrail policies, or serving backend (TensorRT-LLM quantization) changes.
Common failure modes
No automated regression testing after policy or model changes; lack of versioning for non-code artifacts (prompts, Colang flows, indexes); governance gaps around who can modify high-risk tools or relax safety rails.
Practice exercise
Set up a basic CI pipeline (GitHub Actions, GitLab CI, or similar) for your agent project. On every push: run linting, unit tests, a subset of the evaluation pipeline (including guardrails and safety checks), and build/push Docker images. Add a manual approval gate before production deployment.
Kubernetes has become the dominant substrate for agentic AI deployment: roughly two-thirds of organizations running generative AI workloads now host some or all of their inference on Kubernetes [29]. Event-driven autoscalers like KEDA are increasingly used to scale agent-runner pods based on queue depth (e.g., Pub/Sub backlog or Redis list length) rather than CPU, matching bursty agent workload patterns and scaling to zero when idle. Specialized model-serving layers such as KServe now integrate with Knative for scale-to-zero GPU workloads, and Custom Resource Definitions (CRDs) let teams treat agent fleets as first-class Kubernetes objects with built-in high availability [30]. Emerging guidance also favors SLO-signal-based autoscaling (driven by latency/error-budget signals) over simple threshold-based autoscaling for better cost and reliability outcomes in agent-serving clusters.
Core cognitive processes underlying intelligent agent behavior, including reasoning strategies, decision-making, and memory management.
What it means
Why it matters
Pure context-window memory is limited and resets. Effective agents need both: short-term for coherence within a workflow and long-term for personalization, knowledge accumulation, and avoiding repeated mistakes. Graph memory and hierarchical approaches have moved from experimental to practical production use [8].
How it works
Common tools & advances
LangGraph short-term via checkpointers; separate long-term stores (vector DBs + graph DBs or integrated solutions like Mem0 with graph/entity capabilities). Recent memory systems combine semantic similarity, keyword (BM25) matching, and entity matching in a single multi-signal retrieval score rather than relying on vector similarity alone [8]. Benchmarks used to compare memory architectures include LoCoMo, LongMemEval, BEAM, MemGym, and STALE (which specifically tests for outdated/stale assumptions) [8].
Real-world examples
Common failure modes
Short-term overflow or loss of critical recent context; long-term retrieval of irrelevant or stale information (hallucinated or outdated actions); poor write/update policies leading to memory bloat or forgotten important facts; latency from retrieval in time-sensitive loops.
Mitigations
Explicit short-term vs. long-term separation; hierarchical compression + entity linking; staleness detection; retrieval reranking; integration with state (e.g., surface retrieved memories into graph state before planning nodes).
Evaluation
Recall/precision of retrieved memories, impact on task success over long horizons, latency overhead, and resistance to staleness (STALE-style tests).
Practice
Extend your stateful agent with a simple long-term memory layer (vector store or lightweight graph). After each task or reflection step, extract and store key entities/facts. On new sessions or long tasks, retrieve relevant memories and inject into state. Measure improvement on multi-session or long-horizon test cases.
What it means
Why it matters
These are foundational cognitive patterns that improve performance on multi-step problems. They are often combined with ReAct (reasoning interleaved with acting) and reflection.
How it works
Integration with prior topics
Use inside ReAct thoughts or as a dedicated planning node in a stateful graph. Decomposition results can be stored in state and used for conditional branching or progress tracking.
Trade-offs
CoT improves reasoning but increases token usage. Decomposition helps with complex goals but risks error propagation if sub-tasks are poorly defined or dependencies missed.
Practice
Implement a planner node that performs task decomposition before a ReAct-style executor. Compare end-to-end success and efficiency against a flat ReAct baseline on multi-step tasks.
What it means
Strategies for deciding the sequence of actions or sub-goals over time, especially under uncertainty or partial observability.
Key strategies
Why it matters
Good planning turns reactive tool-calling into goal-directed behavior. Combined with memory and stateful orchestration, it enables long-horizon, resumable, interruptible agents.
Trade-offs table
| Strategy | Adaptivity | Token/Cost Efficiency | Reliability on Hard Tasks | Best Paired With |
|---|---|---|---|---|
| ReAct / interleaved | High | Lower | Good | Reflection, short-term state |
| Plan-then-Execute | Low–Medium | Higher | Brittle if surprises | Strong decomposition |
| Reflexion | High | Medium–High | Strong (self-correction) | Episodic long-term memory |
| ToT / LATS | High | High | Very strong | Search + evaluation |
Practice exercise
Build two planner variants in your graph: (1) simple ReAct-style interleaved planning, (2) explicit decomposition + plan-then-execute with reflection on failure. Evaluate both on the same long-horizon or multi-step test set. Measure success, steps taken, and recovery from injected errors. Add checkpointing and one interrupt gate for a high-risk sub-task.
Dedicated “reasoning models” that spend extra inference-time compute generating intermediate thinking tokens before answering (OpenAI’s o-series, DeepSeek-R1, Claude’s extended-thinking modes, and Gemini’s Deep Think/Flash Thinking variants) have become a distinct category alongside standard LLMs, trading latency and cost for stronger performance on math, code, and multi-step planning tasks [33]. Because planning is a multi-step analogue of single-step CoT reasoning, current research is focused on when to invest this extra test-time compute — for example, learning to allocate planning effort adaptively rather than applying it uniformly to every step, which materially affects both cost and reliability in long-horizon agents [34]. On the memory side, hybrid retrieval that blends vector similarity, keyword matching, and entity linking is now considered standard practice rather than an advanced technique, with published benchmark gains of roughly +29.6 points on temporal reasoning and +23.1 points on multi-hop retrieval tasks compared to pure vector search [8].
Integration of external knowledge and the management of diverse data types.
What it means
Retrieval-Augmented Generation (RAG) retrieves relevant external information and injects it into the LLM’s context before generation. This grounds outputs in source material, reduces hallucinations, and enables knowledge that was not in the model’s training data [4].
Why it matters (especially for agents)
In ReAct-style loops, planning nodes, or long-term memory systems, RAG supplies fresh facts, documents, or prior experiences. It is one of the most heavily tested topics because almost every production agent uses some form of retrieval to stay accurate and current. Basic RAG is foundational; advanced variants (hybrid, HyDE, reranking) are now standard in production systems.
Basic RAG flow
Hybrid search (dense + sparse + fusion)
Combines semantic (vector) search with keyword/lexical search (BM25 or similar). Results are merged using Reciprocal Rank Fusion (RRF) or weighted scoring. Pure vector search excels at meaning but misses exact terms, codes, or rare entities; keyword search catches those but lacks semantics. Hybrid + RRF reliably improves precision and recall, especially on technical, legal, or product corpora [9]. A common production pattern retrieves the top ~100 candidates via hybrid search, passes them to a re-ranker model (e.g., Cohere Rerank or a BGE-Reranker), and keeps only the top 5–10 for the LLM, because bi-encoder vector embeddings are inherently “lossy” — compressing a complex passage into a single point in embedding space [9].
HyDE (Hypothetical Document Embeddings)
Instead of embedding the raw (often short or vague) query, the LLM first generates a hypothetical “ideal” document that would answer the query. That hypothetical document is embedded and used for retrieval. The real retrieved documents are then used for final generation [5].
Other common enhancements
Integration with prior topics
Common failure modes
Retrieval of irrelevant or contradictory chunks (context poisoning); “lost in the middle” problem with long contexts; stale or low-quality source data; poor chunking that splits concepts across boundaries.
Evaluation dimensions
Retrieval quality, generation faithfulness (does the answer cite or stay consistent with sources?), end-to-end task improvement vs. no-RAG baseline, latency/cost overhead, and robustness to query variations.
Practice exercise
Add a RAG retrieval tool (or node) to your stateful research agent. Implement both basic dense search and hybrid search + RRF. Compare retrieval quality and final answer quality on the same test questions. Then add HyDE for short/vague queries and measure improvement.
What it means
Vector databases store high-dimensional embeddings of text (and increasingly multimodal data) and enable fast approximate nearest-neighbor (ANN) search. They are the storage and retrieval engine behind RAG and long-term agent memory.
Embeddings & distance metrics
Embeddings convert text into dense vectors that capture semantic meaning (popular model families include OpenAI, Voyage, Cohere, BGE, and Snowflake Arctic).
| Metric | Best For | Characteristics | Common Use in RAG |
|---|---|---|---|
| Cosine Similarity | Most semantic search | Direction-focused, ignores magnitude | Default choice |
| Euclidean (L2) | When magnitude matters | Sensitive to vector length | Less common |
| Dot Product | When vectors are normalized | Fast, equivalent to cosine when normalized | Some optimized systems |
| Manhattan | Sparse or specific domains | Less sensitive to outliers | Niche |
Indexing strategies (HNSW vs. IVF)
Raw vector search is too slow at scale (linear scan). Indexes enable fast approximate search.
M (connections per node — higher =
better recall, more memory),
efSearch/efConstruction (exploration depth —
higher = better recall, slower queries/build) [11].Production trade-offs
HNSW is often preferred for agent-facing, low-latency RAG. IVF or hybrid approaches win for massive scale or when memory cost is dominant. Both are approximate similarity search methods that trade some accuracy for large speed gains versus exhaustive search, and many modern systems (pgvector, Weaviate, Pinecone, Oracle AI Vector Search, etc.) support both index types plus hybrid search inside the database [11].
Other important production features
Common failure modes
Index parameters poorly tuned (low recall or high latency); ignoring metadata filtering needs; embedding model mismatch (different model for indexing vs. query); not handling updates/deletes properly (stale vectors).
Evaluation
Recall@K / nDCG at different K values, query latency (p50/p95), index build time & memory footprint, impact on end-to-end RAG/agent quality, and filtering correctness.
Practice
Set up a vector store (pgvector, Chroma, or Weaviate) for your agent’s knowledge base. Compare HNSW vs. IVF (or different M/ef settings) on the same corpus for recall vs. latency. Add metadata filtering relevant to your use case (e.g., “only sources after 2025”).
What it means
ETL (Extract–Transform–Load) for knowledge integration is the ingestion pipeline that turns raw documents, databases, or APIs into clean, chunked, embedded, and searchable knowledge in the vector store (and long-term memory).
Core stages
Data quality checks (critical for agent reliability)
Why this matters for agents
Garbage in → garbage retrieval → hallucinated or wrong actions in ReAct loops or planning. Poor ETL directly degrades long-term memory and the evaluation metrics built for the agent overall.
Integration with previous layers
Feeds the vector DB used by RAG; populates long-term memory stores; can be triggered or validated inside stateful workflows; evaluation pipelines should include retrieval quality tests on the ingested data.
Common failure modes
Brittle chunking that breaks semantic units; missing or wrong metadata (breaks filtering and provenance); no incremental update strategy (full re-indexes become unsustainable); ignoring data drift or source changes.
Practice exercise
Design and implement a small ETL pipeline for a document corpus (or your agent’s knowledge base). Include at least: semantic or recursive chunking, metadata enrichment, deduplication, and embedding. Add basic quality checks (e.g., chunk length distribution, sample retrieval test). Load into your vector store and measure retrieval metrics before/after quality improvements.
| Aspect | Basic RAG | Hybrid + RRF + HyDE | Advanced (with rerank, agentic chunking) | Notes |
|---|---|---|---|---|
| Implementation complexity | Low | Medium | High | Start simple, add when metrics plateau |
| Retrieval quality | Good on clear queries | Excellent on most queries | Highest | Hybrid is the standard default for production |
| Latency / Cost | Lowest | Medium | Highest | HyDE and reranking add LLM calls |
| Maintenance | Simple | Moderate | Higher (more moving parts) | ETL quality is the hidden multiplier |
Key principles
Consolidated practice recommendation
Extend your stateful ReAct research/decision agent with a RAG retrieval node/tool using hybrid search + HyDE, proper vector store indexing (HNSW or IVF with tuned parameters), a lightweight ETL pipeline that ingests new documents with quality checks and metadata, and an evaluation layer that reports retrieval metrics + end-to-end task success + latency/cost.
RAG practice guides now treat the “top-100-candidates → rerank → top-5-10” hybrid pipeline as the default architecture rather than an advanced option, given the well-documented lossiness of single-vector bi-encoder embeddings [9]. On the storage side, native database vector search (e.g., Oracle AI Vector Search, pgvector) increasingly ships both HNSW and IVF index types plus hybrid search under standard database governance, reducing the need for separate specialized vector-database infrastructure in some deployments [11]. On the memory-as-knowledge-integration side, 2026 industry benchmarking treats agent memory as a first-class architectural component with its own research literature and measurable performance gaps between approaches, driven in large part by hybrid retrieval (vector + keyword + entity-graph) replacing pure similarity search [8].
Leveraging NVIDIA’s AI hardware and software platforms for agentic AI systems.
Production Deployment, Governance & Optimized Inference focuses on making agentic systems safe, compliant, high-performance, and scalable in real environments. This layer builds directly on the reasoning, memory, RAG, stateful orchestration, and evaluation topics covered earlier.
These NVIDIA technologies address critical production gaps: NeMo Guardrails adds programmable safety, compliance, and factuality controls; NVIDIA NIM, TensorRT-LLM, and Triton Inference Server provide optimized, containerized, and scalable inference for the LLMs powering agents, RAG, and planning components.
What it means
NeMo Guardrails is an open-source toolkit for adding controllable, programmable guardrails to LLM-powered applications and agents. It uses Colang, a domain-specific language, to define policies that govern input processing, dialogue flow, output generation, and safety checks.
Why it matters
Agents that act on the world (tool use, planning, state changes) need strong input/output controls. Without guardrails, systems risk leaking PII, violating regulations (GDPR, CCPA, EU AI Act), generating ungrounded or harmful content, or executing unsafe actions. NeMo Guardrails turns safety and compliance into explicit, versionable, testable policies rather than ad-hoc prompting.
Core capabilities
How it works
Guardrails sit as a middleware layer around your LLM or agent:
Colang lets you express complex dialogue and safety logic declaratively (e.g., “when user expresses intent X, do Y and check Z”).
Integration with prior topics
Wraps ReAct loops (filter inputs before Thought/Action, fact-check observations or final answers); works with stateful graphs (guardrails on nodes or as interrupt triggers for high-risk paths); enhances RAG (fact-checking rails validate against retrieved chunks); feeds evaluation pipelines (safety violation rate, PII leakage incidents, compliance audit coverage become key metrics).
Common failure modes
Overly strict rails that block legitimate queries (poor user experience); false negatives on sophisticated jailbreaks or subtle PII; fact-checking rails that are too slow or miss nuanced claims; Colang policies that become hard to maintain without proper versioning and testing.
Evaluation dimensions
Safety violation rate (blocked vs. allowed harmful content), PII redaction accuracy/recall, factuality/groundedness improvement, compliance audit completeness, latency overhead of rails, and false-positive rate on benign inputs.
Practice exercise
Add NeMo Guardrails to your stateful ReAct research/decision agent. Write Colang policies for: (1) input PII detection + redaction, (2) fact-checking against RAG sources before final answer, (3) blocking high-risk actions without human approval. Measure safety metrics and end-to-end latency before/after.
What it means
NVIDIA NIM provides pre-built, optimized, containerized microservices for running AI models (LLMs, embeddings, rerankers, vision, guard models, etc.) in production. Each NIM is a self-contained Docker container with the model, runtime, and APIs.
Why it matters
Deploying and scaling optimized inference for agents is complex. NIM abstracts this into consistent, Kubernetes-friendly containers with built-in health checks, metrics, and scaling support, dramatically reducing time-to-production for high-performance agent backends.
Key features
Integration
Use NIM containers to serve the core LLM used inside your ReAct agents, RAG retrievers, or planning modules. Guardrails (NeMo) can sit in front of or alongside NIM endpoints.
Practice
Deploy a NIM for your agent’s LLM (or a smaller guard/fact-check model). Expose it via Kubernetes and test basic scaling (increase replicas under load).
What it means
TensorRT-LLM is NVIDIA’s library and toolkit for high-performance LLM inference on GPUs. It applies multiple optimizations including quantization, efficient attention (PagedAttention/KV cache management), in-flight batching, and speculative decoding.
Core optimizations
Speed vs. accuracy trade-offs
Quantization delivers major throughput and latency gains (often 2–5× or more depending on model size and hardware) but can degrade performance on complex reasoning, long-context, or nuanced tasks. Always evaluate on your specific agent benchmarks (task success, faithfulness, safety) after quantization.
Why it matters for agents
Agents make many sequential LLM calls (ReAct loops, planning, reflection, RAG augmentation). Faster inference directly improves user experience, reduces cost, and enables higher concurrency.
Evaluation
Measure tokens/second, time-to-first-token (TTFT), end-to-end workflow latency, accuracy/regression on your golden eval set, and memory footprint before vs. after optimization.
Practice exercise
Take the same model used in your agent. Run it with TensorRT-LLM at FP16, INT8, and INT4. Benchmark throughput and latency. Run your full evaluation pipeline (including guardrails and RAG) and compare task success rate + safety metrics across quantization levels.
What it means
Triton is NVIDIA’s open-source, high-performance inference serving platform. It supports multiple backends (including TensorRT-LLM) and provides production-grade features for deploying one or many models at scale.
Key capabilities
Why it matters
Production agent systems rarely run a single model. Triton + TensorRT-LLM backends + NIM-style containers give you a unified, efficient serving layer for the entire inference stack (LLM for reasoning, embedding models for RAG/memory, guard models).
Integration with the stack
Typical production pattern: NeMo Guardrails (policy layer) → Triton (serving) with TensorRT-LLM backend (optimized LLM) + other NIMs (embeddings, rerankers) → Stateful agent orchestration (LangGraph) with RAG and memory.
Practice exercise
Deploy your agent’s LLM via Triton with a TensorRT-LLM backend. Enable dynamic batching. Compare throughput and latency under concurrent load vs. a simple single-request server. Add a second model (e.g., embedding or small guard model) to the same Triton instance.
| Concern | NeMo Guardrails | TensorRT-LLM Quantization | Triton + NIM Serving | Combined Recommendation |
|---|---|---|---|---|
| Safety & Compliance | Excellent (policies + rails) | Indirect (via evaluation) | Good (versioning, isolation) | Guardrails + evaluation |
| Inference Speed / Cost | Adds overhead | Major gains (INT8/INT4) | High utilization via batching | TensorRT-LLM + Triton |
| Scalability | Horizontal (replicas) | Per-GPU efficiency | Dynamic batching + multi-model | Triton + Kubernetes |
| Accuracy / Faithfulness | Improves via fact-checking | Can regress — must re-eval | Neutral | Re-evaluate after quantization |
| Operational Complexity | Policy maintenance | Quantization tuning + testing | Serving config & monitoring | Start simple, iterate |
Key principles for production agentic systems
Consolidated practice recommendation
Take your existing stateful ReAct + RAG + memory agent and productionize it: add NeMo Guardrails with Colang policies for PII, input filtering, and fact-checking; serve the core LLM via TensorRT-LLM (try INT8) behind Triton with dynamic batching; optionally containerize as a NIM-style microservice; run your full evaluation pipeline under load and compare safety, accuracy, latency, and throughput vs. the unoptimized baseline.
NVIDIA’s Nemotron 3 model family (Nano, Super, and Ultra sizes) launched as an efficiency-focused open-model line purpose-built for agentic AI, using a hybrid mixture-of-experts architecture that NVIDIA reports delivers roughly 4x higher throughput than the prior Nemotron 2 Nano generation, aimed specifically at high-token-per-second multi-agent workloads [21]. In April 2026, NVIDIA released Nemotron 3 Nano Omni, a multimodal model unifying vision, audio, and language understanding in a single efficient model for agent reasoning across document, video, and audio inputs, reporting leaderboard-topping results across six benchmarks [22]. All Nemotron 3 models are distributed as optimized NVIDIA NIM microservices, reinforcing the NIM containerized-deployment pattern as the standard path from model release to production serving on NVIDIA infrastructure.
Ongoing operation, monitoring, and maintenance of agentic systems post-deployment.
What it means
Observability across the entire agent stack — from high-level task outcomes down to individual ReAct steps, guardrail decisions, and infrastructure metrics.
Why it matters
Production agents make many sequential decisions. Without deep visibility you cannot debug failures, measure true performance, detect drift, or prove compliance.
Layers of observability (best practice)
Integration
Full traces should capture when NeMo Guardrails blocked or rewrote content, when checkpoints were created, and the exact retrieved documents used in RAG.
Common failure modes
Siloed observability (infra metrics separate from agent traces); missing context in logs/traces (no correlation IDs); no alerting on safety or compliance signals.
Practice exercise
Instrument your deployed agent with OpenTelemetry tracing. Export traces that span a full ReAct cycle + guardrail checks + RAG retrieval. Build a simple Grafana dashboard showing task success rate, average steps per task, and safety violation rate. Add alerting on high safety violation rates or latency spikes.
What it means
Automated, recurring execution of evaluation pipelines against golden datasets (and production shadow traffic) whenever any component changes.
Why it matters
It catches regressions early (accuracy drop after quantization, safety gap after policy change, latency regression after scaling change) before they reach users.
How it works
Integration
Continuous benchmarking should include your stateful orchestration recovery tests, guardrail effectiveness, RAG faithfulness, and end-to-end performance under load.
Practice exercise
Extend your CI pipeline to run the full evaluation suite (including safety and guardrail metrics) on every merge to main. Store results historically and add a step that fails the pipeline if key metrics regress beyond a threshold.
The industry has converged on OpenTelemetry as the standard telemetry
layer for agentic systems: as of the v1.41 GenAI semantic conventions,
the spec defines agent, workflow, tool, and model spans plus required
latency and token-usage metrics (though many gen_ai.*
attributes remain marked “Development stability” and can still change)
[31][32]. Major observability vendors (Datadog, Honeycomb, New Relic)
and agent frameworks (LangChain, CrewAI, AutoGen/AG2) now emit or ingest
OTel-compliant spans natively, so traces from different frameworks can
flow into a single backend without custom parsing. Each tool call, LLM
invocation, and retrieval step becomes its own child span, producing a
full reasoning-chain trace. The main gap identified in 2026
observability reviews is integration rather than depth: individual
layers (infrastructure, application, business/safety) are well
instrumented, but unifying them into a single, coherent operational
picture remains an open challenge [32].
Principles and practices that ensure agentic AI systems operate responsibly, uphold ethical standards, and comply with legal and regulatory frameworks.
What it means
Proactive controls and processes to prevent harm, ensure regulatory compliance, and reduce unfair or biased behavior in agentic systems.
Core areas
Integration with prior topics
NeMo Guardrails + Colang policies are the primary enforcement layer. Stateful orchestration with interrupts enables required human oversight. Evaluation pipelines must track safety and bias metrics. RAG and memory systems must support data subject rights.
Common failure modes
Treating safety as a one-time prompt instead of ongoing policy + monitoring; incomplete audit trails that fail compliance reviews; bias that only appears in long-horizon or multi-agent scenarios.
Evaluation dimensions
Safety violation/block rate, compliance audit pass rate, bias metrics (where measurable), time-to-detect and remediate issues, and coverage of high-risk paths by human oversight.
Practice exercise
Expand your Colang policies to cover prompt injection patterns and tool abuse prevention. Add bias monitoring to your evaluation dashboard (e.g., outcome distribution across synthetic demographic variations in test cases). Document how your system would respond to a GDPR data deletion request.
Governance frameworks have struggled to keep pace with agentic AI specifically: as of 2026, the NIST AI Risk Management Framework, ISO/IEC 42001, and the EU AI Act (Regulation 2024/1689) contain no explicit references to “agent,” “agentic,” or autonomous AI systems, and the EU AI Act does not address multi-agent risk, instead targeting the foundation-model layer rather than systems-level agentic behavior [25][24]. Agentic systems introduce risks that these frameworks were not written for: acting on external systems, accessing tools dynamically, executing multi-step plans where errors can cascade, maintaining persistent memory vulnerable to manipulation, and delegating across agent boundaries in ways that fragment accountability [25]. Enforcement of the EU AI Act’s high-risk provisions began in August 2026, with penalties reaching €35 million or 7% of global turnover; autonomous agents taking consequential actions (financial transactions, medical decisions, legal submissions) are likely to be classified as high-risk, triggering requirements for human oversight, auditability, and conformity assessment — and in a chain of cooperating agents, the compliance boundary extends to every agent performing a high-risk function [24][23]. Singapore’s IMDA published a Model AI Governance Framework for Agentic AI in January 2026 — described as the first comprehensive governance framework specifically for autonomous agents — requiring each agent to carry a verifiable digital identity and maintain an audit trail [23].
The design and implementation of systems that facilitate effective human oversight and interaction with agents.
What it means
Designing interfaces and processes so humans can effectively collaborate with, oversee, correct, and trust agentic systems.
Key elements
Integration
HITL via LangGraph interrupts + NeMo Guardrails creates enforceable human oversight points. Feedback can be written back into long-term memory or used to improve RAG indexes/evaluation sets. Transparency builds trust and aids debugging.
Common failure modes
Overwhelming users with raw traces instead of summarized, actionable views; feedback that is collected but never used (no closed loop); opaque “black box” behavior that erodes trust even when the agent is correct.
Evaluation dimensions
Human approval rate and latency, correction acceptance rate, user trust/satisfaction scores, time saved vs. fully manual process, and escalation rate to humans.
Practice exercise
Build a simple Streamlit or Gradio interface for your agent that shows: reasoning trace, retrieved sources, guardrail decisions, and proposed action. Add buttons for “Approve & Execute”, “Reject & Explain Why”, and “Edit & Retry”. Log all feedback and demonstrate writing it back into long-term memory.
The EU AI Act’s Article 14, enforceable from August 2, 2026, mandates human oversight capabilities for high-risk AI systems, and both the Act and NIST’s AI RMF now require oversight that is demonstrably “trained, measurable, and provable” rather than a nominal checkbox [27]. Practitioner and research literature increasingly distinguishes Human-in-the-Loop (HITL) — where the human is inside the control loop and the agent blocks pending explicit approval — from Human-on-the-Loop (HOTL) — where the human monitors execution and intervenes only when anomalies appear [26][27]. A widely adopted production pattern tiers every agent action by its reversibility and “blast radius,” assigning a different oversight mode to each tier, with the agent routing to human review whenever its own confidence in an action falls below a defined threshold [26]. There is also active debate about the limits of pure HITL at scale — some 2026 commentary argues that human-in-the-loop review “hits a wall” as agent volume grows, and proposes AI-assisted oversight of other AI agents (with humans overseeing the oversight layer) as a complementary pattern for high-throughput deployments [28].
Synthesis checklist (answer these confidently)
Recommended final activities
Overall practice recommendation
Take your stateful ReAct research/decision agent, add long-term memory, implement at least one advanced planning variant (e.g., Reflexion-style reflection or decomposition), tune key parameters, and run a full evaluation pipeline with success, latency, cost, and safety metrics. Perform a small A/B comparison of two configurations. Then productionize it end to end: containerize with Docker/Kubernetes, add NeMo Guardrails policies, serve via TensorRT-LLM/Triton/NIM, instrument OpenTelemetry-based observability, and build a human-in-the-loop review interface. A single running project exercising all ten domains — architecture, development, evaluation, deployment, cognition/memory, knowledge integration, NVIDIA platform tooling, monitoring, safety/compliance, and human oversight — is the strongest possible portfolio piece and study aid.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629. https://arxiv.org/abs/2210.03629
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366. https://arxiv.org/abs/2303.11366
Yehudai, A., Eden, L., Li, A., Uziel, G., Zhao, Y., Bar-Haim, R., Cohan, A., & Shmueli-Scheuer, M. (2025). Survey on Evaluation of LLM-based Agents. arXiv:2503.16416. https://arxiv.org/abs/2503.16416
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401. https://arxiv.org/abs/2005.11401
Gao, L., Ma, X., Lin, J., & Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). arXiv:2212.10496. https://arxiv.org/abs/2212.10496
AppScale Blog (2026). Durable Execution for LLM Agents 2026: Temporal + LangGraph. https://appscale.blog/en/blog/durable-execution-llm-agents-temporal-langgraph-checkpointing-2026; LangChain Blog. Building LangGraph: Designing an Agent Runtime from First Principles. https://blog.langchain.com/building-langgraph/
Benchmarkingagents.com (2026). AI Agent Benchmarks 2026 — SWE-bench, WebArena, AgentBench, Terminal-Bench, OSWorld, Tau-Bench. https://benchmarkingagents.com/agent-benchmarks/; MarkTechPost (2026). Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models. https://www.marktechpost.com/2026/04/26/top-7-benchmarks-that-actually-matter-for-agentic-reasoning-in-large-language-models/
Mem0 (2026). AI Agent Memory 2026: Progress Benchmark Report Evaluations. https://mem0.ai/blog/state-of-ai-agent-memory-2026; Mem0 (2026). Graph-Based Memory Solutions for AI Context: Top 5 Compared. https://mem0.ai/blog/graph-memory-solutions-ai-agents
Srinivasan, A. (2026). All You Need to Know About RAG (in 2026). AI with Aish, Substack. https://aishwaryasrinivasan.substack.com/p/all-you-need-to-know-about-rag-in
Zilliz (2026). Better RAG with HyDE — Hypothetical Document Embeddings. https://zilliz.com/learn/improve-rag-and-information-retrieval-with-hyde-hypothetical-document-embeddings
Oracle Database Blog. Using HNSW Vector Indexes in AI Vector Search. https://blogs.oracle.com/database/using-hnsw-vector-indexes-in-ai-vector-search; Using IVF Vector Indexes in AI Vector Search. https://blogs.oracle.com/database/using-ivf-vector-indexes; Indexing Guidelines with AI Vector Search. https://blogs.oracle.com/database/indexing-guidelines-with-ai-vector-search
Benchmarkingagents.com (2026). Agent Benchmark Leaderboard 2026: AgentBench, SWE-bench, GAIA. https://benchmarkingagents.com/benchmarks-list/
MarkTechPost (2026). Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models. https://www.marktechpost.com/2026/04/26/top-7-benchmarks-that-actually-matter-for-agentic-reasoning-in-large-language-models/
MachineLearningMastery.com (2026). 7 Agentic AI Trends to Watch in 2026. https://machinelearningmastery.com/7-agentic-ai-trends-to-watch-in-2026/
A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology. arXiv:2605.13850. https://arxiv.org/pdf/2605.13850
Agentic Design Patterns: A System-Theoretic Framework. arXiv:2601.19752. https://arxiv.org/pdf/2601.19752
Augment Code (2026). What Are Agentic Design Patterns? 2026 Pattern Catalog. https://www.augmentcode.com/guides/agentic-design-patterns
Google Cloud Architecture Center (2026). Choose a Design Pattern for Your Agentic AI System. https://docs.cloud.google.com/architecture/choose-design-pattern-agentic-ai-system
OpenAI (2026). The Next Evolution of the Agents SDK. https://openai.com/index/the-next-evolution-of-the-agents-sdk/; OpenAI (2026). Introducing AgentKit. https://openai.com/index/introducing-agentkit/
Google. Agent Development Kit (ADK). https://adk.dev/; LangChain (2026). The Best AI Agent Frameworks in 2026. https://www.langchain.com/resources/ai-agent-frameworks
NVIDIA Newsroom (2026). NVIDIA Debuts Nemotron 3 Family of Open Models. https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models
NVIDIA Developer Blog (2026). NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model. https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/; NVIDIA Blog (2026). NVIDIA Launches Nemotron 3 Nano Omni Model, Unifying Vision, Audio and Language for AI Agents. https://blogs.nvidia.com/blog/nemotron-3-nano-omni-multimodal-ai-agents/
Artificial Intelligence News (2026). Agentic AI’s Governance Challenges Under the EU AI Act in 2026. https://www.artificialintelligence-news.com/news/agentic-ais-governance-challenges-under-the-eu-ai-act-in-2026/
UC Berkeley Law, Berkeley Center for Law & Technology (2026). EU AI Act Risk Tiers, GDPR Data Minimization, and U.S. State Law Converge on Agentic AI Compliance in 2026. https://www.law.berkeley.edu/research/bclt/bclt-legal-analysis/eu-ai-act/
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents. arXiv:2505.02077. https://arxiv.org/pdf/2505.02077
Galileo AI (2026). How to Build Human-in-the-Loop Oversight for AI Agents. https://galileo.ai/blog/human-in-the-loop-agent-oversight
Digital Applied (2026). Human-in-the-Loop Escalation Design for AI Agents 2026. https://www.digitalapplied.com/blog/human-in-the-loop-escalation-design-ai-agents-2026
SiliconANGLE (2026). Human-in-the-Loop Has Hit the Wall. It’s Time for AI to Oversee AI. https://siliconangle.com/2026/01/18/human-loop-hit-wall-time-ai-oversee-ai/
CNCF Blog (2026). The Great Migration: Why Every AI Platform Is Converging on Kubernetes. https://www.cncf.io/blog/2026/03/05/the-great-migration-why-every-ai-platform-is-converging-on-kubernetes/
Epsilla Blog (2026). Scaling Autonomous AI Agents: Kubernetes, Runtimes, and Architecture Insights. https://www.epsilla.com/blogs/scaling-autonomous-ai-agents-kubernetes-runtimes-architecture
OpenTelemetry Blog (2025). AI Agent Observability — Evolving Standards and Best Practices. https://opentelemetry.io/blog/2025/ai-agent-observability/
Uptrace (2026). OpenTelemetry for AI Systems: LLM and Agent Observability. https://uptrace.dev/blog/opentelemetry-ai-systems; Digital Applied (2026). AI Agent Observability 2026: Tracing & Monitoring Stack. https://www.digitalapplied.com/blog/ai-agent-observability-2026-tracing-monitoring-stack-guide
Zylos Research (2026). AI Reasoning Models 2026: From OpenAI o3 to DeepSeek-R1 and the Test-Time Compute Revolution. https://zylos.ai/research/2026-01-24-ai-reasoning-models
Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents. arXiv:2509.03581. https://arxiv.org/pdf/2509.03581
Source note: This document reorganizes and cleans up a prior study-guide conversation (originally generated with Grok/SuperGrok) into the ten-domain exam structure above. All original substantive content — explanations, benchmarks, tables, practice exercises, and checklists — has been preserved; only conversational scaffolding (thinking-time markers, source counts, follow-up prompts, and repeated instructions) was removed. Footnote-style references from the original transcript have been resolved to full citations above, and a “Recent Developments (2026)” subsection was added to every domain with newly verified sources.