In 2026, AI agents have evolved from toy-level demos to production-line applications. More teams are deploying agents for real engineering tasks — from code generation to design-to-code conversion, automated testing to documentation. Yet when agents enter production environments, a series of hidden engineering pitfalls can drastically reduce efficiency or cause outright failure.
Developer Kieran Zhang (@ninthbit_ai) recently completed a Figma design-to-code conversion project and distilled four hard-won lessons about production agentic engineering. These insights align closely with industry research and deserve careful consideration from every AI engineer.
Pitfall 1: Making LLMs Do Dirty Work — Tokens Are Expensive, Attention Is Precious
This pitfall isn't about shifting grunt work back to humans. It's about ensuring that on a real production line, all fixed, template-like work that drains an agent's attention should be converted into scripts and code.
Typical dirty work includes:
- Scaffolding project initialization
- SDK path configuration and environment variable setup
- Deterministic external tool invocations
- File system operations and directory structure creation
These tasks fundamentally don't require reasoning or decision-making. When handed to an LLM, they consume significant tokens and attention window space, degrading performance on the core tasks that actually need reasoning.
Data Point: Exponential Token Cost Growth
Research on token-efficient agent patterns shows that in a typical 10-step agent workflow with a 4,000-token system prompt and 500-token tool outputs per step, cumulative input tokens exceed 40,000 by the final step. When a large share of those tokens is wasted on template operations, costs spiral out of control.
Best Practices
| Approach | Effect |
|---|---|
| Wrap scaffolding initialization in CLI scripts | Reduces agent context load |
| Use prompt caching (e.g., Anthropic's cache_control) | Approximately 88% token reduction |
| Encode deterministic workflows as fixed pipeline steps | Frees LLM reasoning for core decisions |
| Apply rolling summarization for conversation history | Reduces context bloat per call |
Core principle: Focus LLM attention on the most critical workflows. Only use it for tasks that require reasoning and decision-making.
Pitfall 2: Skills Are Not Silver Bullets — They're Soft Constraints
Many teams pin their hopes on carefully crafted skills (system instructions, role definitions, behavioral constraints) to control agent behavior. The reality is that once a long workflow far exceeds the LLM's effective attention span, the model will inevitably "forget" early instructions.
This isn't the model "deciding" to skip constraints. It's the inevitable result of attention weight dilution in Transformer architectures.
Research Evidence: Context Rot
ToolHalla and Zylos Research's 2026 studies identified four context failure modes that plague AI agents:
| Failure Mode | Manifestation |
|---|---|
| Context Poisoning | Hallucinations enter context and propagate as ground truth in subsequent steps |
| Context Distraction | Accumulated information overwhelms training-time knowledge; accuracy drops notably around 32K tokens |
| Context Confusion | Irrelevant information influences decisions; more tools can mean worse performance |
| Context Clash | Different parts of the context contain contradictory information |
Chroma Research's 2025 study found that across 18 tested models, all showed 20–50% accuracy drops between 10K and 100K tokens, with degradation typically hitting a cliff around 32K–64K tokens. This explains why a carefully set skill constraint works perfectly in short workflows but gets "forgotten" in longer ones.
Mitigation Strategy: Split the Pipeline (at a Cost)
When skills fail, the only option is to split long workflows into multiple shorter pipeline nodes. But splitting has a cost — more context must be transmitted between nodes, increasing engineering complexity.
Key insight: A focused 20K-token context almost always outperforms a bloated 200K-token context. The 2026 best practice is context engineering — strategically deciding what enters the context, what stays, and what gets removed or compressed.
Pitfall 3: Without Observability, Test Suites Are Just Burning Money
Observability means being able to see every decision, tool call, and result from your agent.
Coding agents like Claude Code and Codex let developers observe agent behavior in real-time through REPL interactions. But when running test suites at scale, the agent becomes a black box — without observability, you'll see a stream of failed test cases and growing token bills, but never gain the insight needed to improve the process.
Why Observability Is the Highest-Priority Infrastructure
As developer 卡颂 (@kasong2048) pointed out: the highest-priority infrastructure for agent engineering (Harness Engineering) is observability — recording all agent process logs, then using a cost-effective model (like DeepSeek) to generate process summaries and tag critical path labels.
Without observability, running 10 failed test cases gives you nothing but 10 rounds of million-token burn. With observability, those same 10 cases provide the critical insights needed to solve problems.
Industry Data
AI Agents Plus's 2026 production guide shows that organizations with mature monitoring practices achieve:
| Metric | Improvement |
|---|---|
| Incident resolution speed | 80% faster |
| Production issues | 50% reduction |
| Resource optimization | 30% cost savings |
Observability Toolchain
Leading agent observability tools include:
- LangSmith — LangChain's official tracing and evaluation platform with distributed tracing, per-step token usage, and latency analysis
- Langfuse — Open-source LLM observability platform supporting OpenTelemetry standards
- TraceHawk — An emerging tool focused on agent behavior pattern analysis
Three pillars of observability:
- Distributed Tracing — Capture the full execution lifecycle including every LLM call, tool use, and decision point
- Structured Logging — Record what happened at each step
- Aggregate Performance Metrics — Monitor latency (P50/P95/P99), token consumption, throughput, and error rates
Pitfall 4: Don't Just Supervise — Think Deeply
The final pitfall is the easiest to overlook: some paths you haven't walked yourself, the AI won't know either.
When an agent gets stuck and can't make progress, don't rely solely on the LLM to find the path and formulate a plan. Humans shouldn't just be supervisors — they should be deep thinkers and guides.
Kieran Zhang encountered a classic case in his Figma-to-code project: converting absolute layouts to responsive layouts. The LLM's proposed solutions consistently fell short of expectations. He ultimately chose to rewrite the solution manually, understood the problem's essence, then distilled the solution into generalizable steps and fed them back to the agent.
Why Human Thinking Remains Irreplaceable
OpenAI's Harness Engineering paper describes a compelling case: three engineers generated over 1 million lines of code through AI agents in five months. But the key wasn't letting agents operate autonomously — it was the engineers providing carefully designed constraint systems, architectural decisions, and quality gates.
| Role | Human Engineers | AI Agents |
|---|---|---|
| Architecture design | Define overall architecture and module boundaries | Implement specific features within constraints |
| Critical path exploration | Personally validate unproven technical paths | Execute at scale on validated paths |
| Quality standards | Define acceptance criteria and evaluation methods | Execute against standards and self-evaluate |
| Exception handling | Identify and solve novel problems | Handle known pattern problems |
Core takeaway: Human value lies not in supervising every agent step, but in doing what agents cannot — deep thinking, path exploration, and knowledge distillation.
Summary: Four Principles for Production-Grade Agent Engineering
| Pitfall | Principle |
|---|---|
| Making LLMs do dirty work | Script template work to free tokens and attention for core reasoning |
| Using skills as silver bullets | Recognize skills as soft constraints; use pipeline splitting and context engineering |
| Lacking observability | Build observability infrastructure so every failure yields improvement insights |
| Only supervising | Think deeply, personally validate critical paths, then let agents scale execution |
Productionizing AI agents isn't simply "letting AI do the work" — it requires a complete engineering methodology. Token optimization, context engineering, observability, and human-agent collaboration are all essential. As the harness engineering community agrees: in a world where models are becoming commodities, your engineering practices are the real moat.
References
- Kieran Zhang (@ninthbit_ai) — Agentic Production Engineering: 4 Pitfalls
- 8 Production Patterns for Token-Efficient Agentic AI — Medium
- Context Rot: Why Your AI Agent Gets Dumber Over Time — ToolHalla
- Context Engineering for AI Agents: The Complete Guide — ToolHalla
- AI Agent Monitoring & Observability: Production Guide 2026 — AI Agents Plus
- LangGraph + LangSmith: Building & Monitoring Production Multi-Agent Systems — Idea to MVP
- Harness Engineering: Leveraging Codex in an Agent-First World — OpenAI
- Figma Opens Design Files to AI Coding Agents Through MCP — CXO Digital Pulse
