Four Pitfalls of Production-Grade Agentic Engineering: Lessons from a Figma-to-Code Project

In 2026, AI agents have evolved from toy-level demos to production-line applications. More teams are deploying agents for real engineering tasks — from code generation to design-to-code conversion, automated testing to documentation. Yet when agents enter production environments, a series of hidden engineering pitfalls can drastically reduce efficiency or cause outright failure.

Developer Kieran Zhang (@ninthbit_ai) recently completed a Figma design-to-code conversion project and distilled four hard-won lessons about production agentic engineering. These insights align closely with industry research and deserve careful consideration from every AI engineer.

Pitfall 1: Making LLMs Do Dirty Work — Tokens Are Expensive, Attention Is Precious

This pitfall isn't about shifting grunt work back to humans. It's about ensuring that on a real production line, all fixed, template-like work that drains an agent's attention should be converted into scripts and code.

Typical dirty work includes:

Scaffolding project initialization
SDK path configuration and environment variable setup
Deterministic external tool invocations
File system operations and directory structure creation

These tasks fundamentally don't require reasoning or decision-making. When handed to an LLM, they consume significant tokens and attention window space, degrading performance on the core tasks that actually need reasoning.

Data Point: Exponential Token Cost Growth

Research on token-efficient agent patterns shows that in a typical 10-step agent workflow with a 4,000-token system prompt and 500-token tool outputs per step, cumulative input tokens exceed 40,000 by the final step. When a large share of those tokens is wasted on template operations, costs spiral out of control.

Best Practices

Approach	Effect
Wrap scaffolding initialization in CLI scripts	Reduces agent context load
Use prompt caching (e.g., Anthropic's cache_control)	Approximately 88% token reduction
Encode deterministic workflows as fixed pipeline steps	Frees LLM reasoning for core decisions
Apply rolling summarization for conversation history	Reduces context bloat per call

Core principle: Focus LLM attention on the most critical workflows. Only use it for tasks that require reasoning and decision-making.

Pitfall 2: Skills Are Not Silver Bullets — They're Soft Constraints

Many teams pin their hopes on carefully crafted skills (system instructions, role definitions, behavioral constraints) to control agent behavior. The reality is that once a long workflow far exceeds the LLM's effective attention span, the model will inevitably "forget" early instructions.

This isn't the model "deciding" to skip constraints. It's the inevitable result of attention weight dilution in Transformer architectures.

Research Evidence: Context Rot

ToolHalla and Zylos Research's 2026 studies identified four context failure modes that plague AI agents:

Failure Mode	Manifestation
Context Poisoning	Hallucinations enter context and propagate as ground truth in subsequent steps
Context Distraction	Accumulated information overwhelms training-time knowledge; accuracy drops notably around 32K tokens
Context Confusion	Irrelevant information influences decisions; more tools can mean worse performance
Context Clash	Different parts of the context contain contradictory information

Chroma Research's 2025 study found that across 18 tested models, all showed 20–50% accuracy drops between 10K and 100K tokens, with degradation typically hitting a cliff around 32K–64K tokens. This explains why a carefully set skill constraint works perfectly in short workflows but gets "forgotten" in longer ones.

Mitigation Strategy: Split the Pipeline (at a Cost)

When skills fail, the only option is to split long workflows into multiple shorter pipeline nodes. But splitting has a cost — more context must be transmitted between nodes, increasing engineering complexity.

Key insight: A focused 20K-token context almost always outperforms a bloated 200K-token context. The 2026 best practice is context engineering — strategically deciding what enters the context, what stays, and what gets removed or compressed.

Pitfall 3: Without Observability, Test Suites Are Just Burning Money

Observability means being able to see every decision, tool call, and result from your agent.

Coding agents like Claude Code and Codex let developers observe agent behavior in real-time through REPL interactions. But when running test suites at scale, the agent becomes a black box — without observability, you'll see a stream of failed test cases and growing token bills, but never gain the insight needed to improve the process.

Why Observability Is the Highest-Priority Infrastructure

As developer 卡颂 (@kasong2048) pointed out: the highest-priority infrastructure for agent engineering (Harness Engineering) is observability — recording all agent process logs, then using a cost-effective model (like DeepSeek) to generate process summaries and tag critical path labels.

Without observability, running 10 failed test cases gives you nothing but 10 rounds of million-token burn. With observability, those same 10 cases provide the critical insights needed to solve problems.

Industry Data

AI Agents Plus's 2026 production guide shows that organizations with mature monitoring practices achieve:

Metric	Improvement
Incident resolution speed	80% faster
Production issues	50% reduction
Resource optimization	30% cost savings

Observability Toolchain

Leading agent observability tools include:

LangSmith — LangChain's official tracing and evaluation platform with distributed tracing, per-step token usage, and latency analysis
Langfuse — Open-source LLM observability platform supporting OpenTelemetry standards
TraceHawk — An emerging tool focused on agent behavior pattern analysis

Three pillars of observability:

Distributed Tracing — Capture the full execution lifecycle including every LLM call, tool use, and decision point
Structured Logging — Record what happened at each step
Aggregate Performance Metrics — Monitor latency (P50/P95/P99), token consumption, throughput, and error rates

Pitfall 4: Don't Just Supervise — Think Deeply

The final pitfall is the easiest to overlook: some paths you haven't walked yourself, the AI won't know either.

When an agent gets stuck and can't make progress, don't rely solely on the LLM to find the path and formulate a plan. Humans shouldn't just be supervisors — they should be deep thinkers and guides.

Kieran Zhang encountered a classic case in his Figma-to-code project: converting absolute layouts to responsive layouts. The LLM's proposed solutions consistently fell short of expectations. He ultimately chose to rewrite the solution manually, understood the problem's essence, then distilled the solution into generalizable steps and fed them back to the agent.

Why Human Thinking Remains Irreplaceable

OpenAI's Harness Engineering paper describes a compelling case: three engineers generated over 1 million lines of code through AI agents in five months. But the key wasn't letting agents operate autonomously — it was the engineers providing carefully designed constraint systems, architectural decisions, and quality gates.

Role	Human Engineers	AI Agents
Architecture design	Define overall architecture and module boundaries	Implement specific features within constraints
Critical path exploration	Personally validate unproven technical paths	Execute at scale on validated paths
Quality standards	Define acceptance criteria and evaluation methods	Execute against standards and self-evaluate
Exception handling	Identify and solve novel problems	Handle known pattern problems

Core takeaway: Human value lies not in supervising every agent step, but in doing what agents cannot — deep thinking, path exploration, and knowledge distillation.

Summary: Four Principles for Production-Grade Agent Engineering

Pitfall	Principle
Making LLMs do dirty work	Script template work to free tokens and attention for core reasoning
Using skills as silver bullets	Recognize skills as soft constraints; use pipeline splitting and context engineering
Lacking observability	Build observability infrastructure so every failure yields improvement insights
Only supervising	Think deeply, personally validate critical paths, then let agents scale execution

Productionizing AI agents isn't simply "letting AI do the work" — it requires a complete engineering methodology. Token optimization, context engineering, observability, and human-agent collaboration are all essential. As the harness engineering community agrees: in a world where models are becoming commodities, your engineering practices are the real moat.