back|ai-agents

Third Book. Principles of Building AI Agents by Sam Bhagwat.

By Avik Mukherjee | Apr 18, 2026 · 15 min read · Updated Apr 18, 2026

After finishing Clean Architecture and Minimal Modeling, I wanted a book that does for AI agents what those books did for software and databases: less hype, more engineering.

Sam Bhagwat's Principles of Building AI Agents is that kind of book.

It is not "how to build your first chatbot" content. It assumes you already know your way around APIs, code, and production trade-offs. The value is in structure: it gives you a map from your first model call all the way to multi-agent systems, RAG, evals, and deployment.

The best part is that it treats agent development like real software engineering. You do not just prompt and pray. You define tools. You manage memory. You instrument traces. You evaluate changes. You ship with guardrails.

Here is everything, chapter by chapter.

Part I: Foundations - Models, Prompts, and Core Concepts

Chapter 1: A Brief History of LLMs#

Bhagwat starts with the timeline that actually matters.

AI has been around for decades. But 2017 is the real inflection point because of Attention Is All You Need. Transformer architectures changed what was possible in language modeling. Everything from GPT to Claude to Gemini builds on that shift.

Then came November 2022 and ChatGPT made this mainstream overnight. The book uses this timeline to make a practical point: when you choose tooling today, you are choosing inside an ecosystem that is moving fast and still unstable.

He also gives a quick map of major providers and open-source players. That context matters because each ecosystem pushes different strengths. Cost, latency, capability, openness, and deployment model all vary by provider.

Chapter 2: Choosing a Provider and Model#

This chapter answers the first practical question: what should you use on day one?

Bhagwat's recommendation is simple and good: start with hosted APIs. Do not begin by solving infra. Begin by proving product behavior.

The second recommendation is counterintuitive but correct: start with the strongest model, even if expensive. Establish the quality ceiling first. Only then optimize for speed and cost.

Key model-selection variables in this chapter:

Raw capability vs. cost and latency
Context window size for long inputs
Reasoning-model behavior for hard tasks
Vendor ecosystem and tooling maturity

He also highlights reasoning models as a different operating mode. They can be slower, but useful when you need deeper analysis. The warning: reasoning quality still depends heavily on good context and good examples.

Chapter 3: Writing Great Prompts#

Prompting here is treated as engineering, not copywriting.

Bhagwat walks through zero-shot, one-shot, and few-shot prompting, with the practical point that few-shot is often worth the extra tokens in production because output quality is more stable.

One useful trick: ask the model to draft prompts for itself. Use that as a seed, then tighten it with your constraints. It is a practical way to get unstuck quickly.

The chapter also covers structure. Clear sections, explicit constraints, and strong instruction formatting generally outperform loose paragraphs. System prompts shape behavior well, but they are not a guarantee of correctness.

Chapter 4: Agents 101#

This chapter draws the line between "LLM feature" and "agentic system."

A single model call is a one-shot transformation. An agent is stateful behavior over time: role, memory, tools, and iterative decision-making.

Bhagwat frames agency as a spectrum:

Low agency: simple branching and fixed paths
Medium agency: memory, tool use, retries
High agency: planning and autonomous task decomposition

Most production systems still live in low-to-medium agency because predictability beats novelty when real users are involved.

He introduces Mastra as the framework used in examples. The setup is minimal, but chapter by chapter he shows where complexity really lives: instructions, tools, memory, and control flow.

Part II: Capability Layer - Tools, Memory, and Runtime Adaptation

Chapter 5: Model Routing and Structured Output#

Two foundational practices show up here.

First, model routing: architect so you can switch providers and models without rewriting your app. This reduces lock-in and lets you match model cost to task complexity.

Second, structured output: force model responses into explicit schemas. If you need machine-usable results, free-form prose is a bug, not a feature.

Bhagwat uses extraction examples to show why this matters. Turning resumes, records, or long text into stable JSON structures unlocks downstream automation and removes brittle parsing logic.

Chapter 6: Tool Calling#

If structured output makes model responses useful, tool calling makes models operational.

The chapter's strongest idea is that tool design is the core product-design step in agent systems. If your tool boundaries are vague, your agent behavior will be vague too.

Principles he emphasizes:

Tool names should communicate intent
Tool descriptions should specify when to use them
Input and output schemas should be explicit
Tools should map to clear business actions

There is a good case study where an analysis problem failed when shoved into one prompt but worked when decomposed into purpose-built tools. Same model. Better architecture.

Chapter 7: Agent Memory#

Without memory, every interaction is first contact.

Bhagwat separates memory into persistent user profile memory and retrieval-based conversational memory. In practice, both matter: long-term personalization and short-term contextual recall.

He also covers practical memory processors:

Token limiters to avoid context-window overflow
Filters that remove noisy tool traces from prompt context

The design message is pragmatic: start simple, then add memory sophistication only when behavior or cost forces you to.

Chapter 8: Dynamic Agents#

Static agents are easier to debug. Dynamic agents are more adaptable.

This chapter covers runtime changes to model choice, prompt strategy, and tool access based on user tier, language, or operational context.

The trade-off is clear: more adaptability means more behavioral variance. You gain product flexibility but need better observability and stricter controls to keep behavior safe.

A good SaaS support example shows this clearly: enterprise users get deeper tooling and stronger models, while free users get faster, cheaper paths.

Chapter 9: Agent Middleware#

This is the security perimeter chapter.

Middleware is where guardrails and permissions should live. That includes prompt-injection defenses, output filtering, auth checks, and policy enforcement before the core agent logic runs.

Bhagwat also makes an important point that people underestimate: agents are not passive readers. They are active searchers. If permissions are loose, they will find things users should not access.

So the rule is simple: make access policies explicit, enforce them centrally, and do not rely on obscurity.

Part III: Integrations - External Tools and Protocols

Chapter 10: Popular Third-Party Tools#

Agents are only as useful as the systems they can touch.

This chapter surveys common categories: web search and scraping tools, browser automation, and business-system integrations. It also calls out operational pain honestly: anti-bot systems, brittle selectors, and maintenance churn.

For business workflows, direct integrations with systems like email, calendars, CRM, and Git providers are often where agent value becomes real.

He also discusses iPaaS options and gives a practical recommendation: use integration platforms first, then build custom connectors where differentiation justifies the cost.

Chapter 11: Model Context Protocol (MCP)#

MCP is framed as the standard interface between agents and tools.

Before MCP, every integration was custom. After MCP, tool providers can expose capabilities once and many agent clients can consume them.

Core pieces:

MCP servers expose tools
MCP clients discover and invoke them
Discovery and ecosystem registries help with adoption

The chapter is balanced about maturity: ecosystem quality is uneven, but momentum is real. The practical guidance is to use established SDKs and frameworks instead of hand-rolling protocol details.

Part IV: Workflow Control - Making Behavior Predictable

Chapter 12: Workflows 101#

Purely autonomous agents are powerful but inconsistent.

Workflows are how you enforce structure: explicit steps, explicit transitions, and explicit checkpoints. Instead of one giant black-box decision, you break execution into smaller choices with clear boundaries.

This is especially useful in domains where correctness and traceability matter more than "creative" behavior.

Chapter 13: Branching, Chaining, Merging, and Conditions#

This chapter covers the four building blocks of workflow graphs.

Branching for parallel work
Chaining for sequential pipelines
Merging to combine parallel results
Conditions to gate execution paths

A strong practical point here: keep each step narrow. One clear responsibility per step improves quality, debuggability, and trace clarity.

Chapter 14: Suspend and Resume#

Not every workflow can complete in one run.

Some steps depend on humans or external systems. Suspend/resume lets you persist state, stop execution cleanly, and continue later with new input.

This is essential for human-in-the-loop systems. Otherwise you waste resources keeping long-running processes alive while waiting for approval or data.

Chapter 15: Streaming Updates#

Latency is unavoidable. Silence is optional.

This chapter argues for streaming progress at every stage: planning updates, tool-call progress, retrieval status, and final synthesis tokens.

Users tolerate long operations much better when they can see movement. Perceived responsiveness matters almost as much as raw speed.

Chapter 16: Observability and Tracing#

If you cannot see what happened, you cannot improve it.

Bhagwat treats tracing as mandatory infrastructure. You need full execution visibility: inputs, outputs, step durations, tool calls, and failure points.

OpenTelemetry comes up as the practical standard because it keeps your telemetry portable across vendors.

He also connects traces with evals: once both are wired together, regressions become diagnosable instead of mysterious.

Part V: RAG - Domain Knowledge Without Retraining

Chapter 17: RAG 101#

This chapter defines the standard RAG pipeline end to end.

The sequence is familiar but important: chunk, embed, index, retrieve, optionally rerank, then synthesize. Each stage can fail in different ways, so understanding each boundary matters.

The key idea is straightforward: models do better when grounded in relevant, current context instead of relying only on pretrained knowledge.

Chapter 18: Choosing a Vector Database#

By 2025, vector database feature sets are converging.

So Bhagwat shifts the decision criterion from "which has the most features" to "which adds the least operational complexity for your stack."

His practical heuristics:

Already on Postgres: use pgvector
Greenfield and want managed simplicity: choose a hosted vector DB like Pinecone
Deep in one cloud ecosystem: use that provider's managed option

This chapter is refreshingly anti-hype.

Chapter 19: Setting Up a RAG Pipeline#

This chapter is implementation detail in the best way.

It covers chunking strategies, overlap choices, indexing updates, and hybrid retrieval with metadata filters. The focus is on decisions that materially affect retrieval quality.

One recommendation worth highlighting: start with a simple, working pipeline before reaching for advanced retrieval tricks. Most teams overcomplicate too early.

Chapter 20: Alternatives to RAG#

A useful contrarian chapter.

Bhagwat argues you should not default to full RAG infrastructure immediately. Sometimes larger context windows or direct tools are enough.

He outlines a practical progression:

Try full-context loading when feasible
Try tool-based access patterns (agentic retrieval)
Build full RAG only when needed

This "simple to complex" path can save weeks of unnecessary pipeline work.

Part VI: Multi-Agent Systems - Coordination as Architecture

Chapter 21: Multi-Agent 101#

Single agents are not always the best unit of work.

For complex tasks, specialist agents mapped to specific roles often perform better. Think department-style decomposition: planner, executor, reviewer, coordinator.

The chapter emphasizes role clarity, scoped memory, and scoped tool access. Giving every agent every tool is usually a design smell.

Chapter 22: Agent Supervisor#

The supervisor pattern is the most practical multi-agent pattern in this section.

A supervisor agent delegates to specialist agents represented as tools. Specialists focus on their domain. The supervisor handles routing and orchestration.

This improves modularity: you can upgrade one specialist without redesigning the whole system.

Chapter 23: Control Flow#

This chapter argues for deliberate structure before execution.

Instead of letting agents improvise everything at runtime, you define planning checkpoints and review points where needed. Especially for high-stakes tasks, this catches bad trajectories early.

The parallel to project management is intentional: align on plan before deep execution.

Chapter 24: Workflows as Tools#

Earlier chapters embedded agents inside workflows. This one flips the pattern.

You can package a full workflow as a tool, then let a higher-level agent decide when to invoke it. That gives you predictable execution internals with flexible top-level decision-making.

This is one of the most useful composition patterns in the book.

Chapter 25: Combining the Patterns#

This chapter is synthesis.

Bhagwat combines planning agents, coding agents, review loops, and workflow checkpoints into a layered architecture similar to modern code-generation products.

The key takeaway is not a single "best" architecture. It is composability. You can nest agents, tools, and workflows based on what the task needs.

Start simple, then add structure only when behavior quality or reliability requires it.

Chapter 26: Multi-Agent Standards#

If multiple agents from different providers need to cooperate, standards matter.

This chapter introduces A2A as an inter-agent communication protocol and contrasts it with MCP, which is focused on model-tool integration.

Practical point: protocol fragmentation is still real. Teams may need to support more than one standard depending on ecosystem constraints.

Part VII: Evaluation - Measuring Quality in Non-Deterministic Systems

Chapter 27: Evals 101#

Traditional tests are binary. Agent behavior is probabilistic.

So evals are scored, not pass/fail. You measure tendencies, regressions, and quality shifts over time rather than expecting deterministic identity.

The chapter also explains why you need both targeted component evals and end-to-end evals. Neither alone is enough for production confidence.

Chapter 28: Textual Evals#

This chapter covers LLM-judged output quality dimensions:

Faithfulness and hallucination risk
Relevance and completeness
Context usage quality
Tone and instruction adherence

Bhagwat positions these as practical safety rails for prompt and architecture iteration, not academic metrics for their own sake.

Chapter 29: Other Evals#

Beyond text-quality grading, this chapter covers:

Classification and extraction correctness
Tool-call behavior correctness
Prompt robustness under phrasing variance
A/B testing in production
Human review as a necessary complement

The thread through all of them is simple: measure both output quality and process quality.

Part VIII: Shipping - Development, Deployment, and What Comes Next

Chapter 30: Local Development#

Good local tooling changes iteration speed dramatically.

Bhagwat highlights practical capabilities: live chat testing, workflow visualization, endpoint testing, tool playgrounds, and integrated traces/evals during development.

The security warning is important: do not put full agent logic and credentials in the browser. Keep core execution and secrets server-side.

Chapter 31: Deployment#

Deploying agent systems is not identical to deploying standard web apps.

Long-running tasks, larger dependencies, and stateful workflow behavior can clash with short-timeout serverless assumptions.

The recommendation is to choose deployment targets that support durable execution and autoscaling without forcing constant timeout workarounds.

Chapter 32: Multimodal#

This chapter broadens from text to image, audio, and video systems.

The picture is realistic:

Image generation is commercially useful today
Voice is improving, but real-time quality and turn-taking remain hard
Video generation is promising but still operationally immature for many use cases

It is a practical maturity map, not hype.

Chapter 33: Code Generation#

This chapter focuses on coding agents specifically.

Three non-negotiables:

Tight feedback loops (generate, run, inspect, iterate)
Sandboxed execution
Objective quality signals from linters and type checkers

The argument is that reliable code agents are less about one perfect prompt and more about iterative control loops with strong guardrails.

Chapter 34: What's Next#

The final chapter looks ahead without pretending certainty.

Bhagwat expects stronger reasoning models, better automated evaluation loops, and increased protocol maturity. He also expects security pressure to increase as agents become more widespread.

The closing stance is useful: stay humble, keep instrumenting, and keep adapting. In this field, "I know the one true pattern" ages badly.

Final Thoughts#

This book sits in a useful middle ground.

It is not surface-level demo content, and it is not theory disconnected from production. It is a practical architecture guide for teams actually building and shipping agent systems.

Like any fast-moving technical book, specific tools will age. But the mental models are solid:

Design tools deliberately
Keep control flow explicit where reliability matters
Treat observability and evals as core infrastructure
Escalate complexity only when simpler approaches fail

If you are building with LLMs seriously in 2026, this is worth reading carefully, then revisiting while you build.

Let's Keep Talking#

That is the third book summary in this arc.

Clean Architecture helped frame boundaries. Minimal Modeling made data design concrete. This one fills the operational gap for AI systems: how to build agents that are useful, inspectable, and deployable.

If you are building an agent right now and want to pressure-test your architecture, send it over.

Share this, or find me on GitHub.

Feedback welcome. Call out mistakes. I would rather be corrected than stay wrong.

Reach me on GitHub, X, Peerlist, or LinkedIn.

Third Book. Principles of Building AI Agents by Sam Bhagwat.

Related posts