Tool RAG: Solving the Tool-Scaling Problem When Your Agent Has 500+ Functions

The first time you connect an MCP server with fifty tools to an agent, everything works fine. The tool definitions fit comfortably in the context window, the model can reason about all available capabilities simultaneously, and tool selection feels natural. The second MCP server brings you to one hundred tools. The third to one hundred and fifty. By the time you have a fully integrated agent environment — filesystem access, web browsing, database queries, code execution, external APIs, communication integrations — you are routinely operating with three hundred to five hundred distinct tool definitions.

At that scale, the naive approach of sending every tool definition in every context window stops working. It degrades performance in ways that are worth understanding precisely, and the solution — semantic tool retrieval at query time, commonly called Tool RAG — involves architectural decisions that affect agent behavior in non-obvious ways.

The Context Window Bottleneck

Modern frontier models support context windows measured in hundreds of thousands of tokens. On paper, five hundred tool definitions sounds tractable. In practice, the problem is not raw token capacity — it is attention quality.

A tool definition in the MCP format includes a name, a description, parameter schemas with types and descriptions, and sometimes usage examples. A well-documented tool definition runs 200-500 tokens.

Tool Count	Token Overhead (tool definitions only)	Remaining Context (200k window)
50 tools	10,000-25,000 tokens	~175,000-190,000 tokens
200 tools	40,000-100,000 tokens	~100,000-160,000 tokens
500 tools	100,000-250,000 tokens	~0-100,000 tokens

Five hundred tools is therefore 100,000-250,000 tokens of tool definitions alone, before the actual conversation, system prompt, or retrieved context is added.

Research on long-context model performance consistently shows that models lose precision on information buried deep in long contexts — the "lost in the middle" phenomenon. When a model must select among five hundred tools, relevant tools defined early or late in the context window receive disproportionate attention. Tools defined in the middle of a long list are systematically underselected even when they are the most appropriate choice for the current task.

There is also a reasoning cost. When a model processes a tool call, it is implicitly reasoning over all available tools even when only a small subset is relevant. This increases token generation cost and inference latency for every call, whether or not the task actually required broad tool consideration.

The third problem is context pollution. When tool definitions occupy a large fraction of the available context window, there is less room for task-relevant content: conversation history, retrieved memory, document context, intermediate reasoning. At 200k-token context windows, this becomes a real allocation problem.

Tool RAG: The Core Idea

Tool RAG applies the same insight behind Retrieval-Augmented Generation to the tool selection problem. Rather than loading all available tools into context at the start of every conversation, a Tool RAG system maintains an indexed representation of the full tool catalog and retrieves the most relevant subset at query time.

The retrieval process works in terms of semantic similarity between the user's request (or the agent's current planning context) and the tool descriptions in the catalog. When a user asks an agent to "find all open pull requests from the last week and summarize the review comments," the retrieval step surfaces tools related to version control, PR management, and text summarization — not tools related to image generation, audio processing, or database administration.

The retrieved subset — typically ten to thirty tools — is what the model actually sees in its context window. Tool selection operates over a compact, relevant set rather than a sprawling complete catalog.

The architectural implication is a two-step tool resolution process:

Tool retrieval: semantic search over the full catalog to produce a relevant subset
Tool invocation: standard model reasoning over the retrieved subset to produce a tool call

The retrieval step is fast — typically a few milliseconds for an approximate nearest-neighbor search over an indexed embedding store — and adds negligible latency relative to model inference time.

Implementation Considerations

Tool RAG sounds straightforward but has several important implementation details that determine whether it works well in practice.

What to embed. Tool descriptions alone are insufficient as embedding targets. A tool called git_pr_list with the description "List pull requests" will not reliably surface for the query "show me what's waiting for my review" unless the embedding captures the semantic relationship between PR listing and review workflows. Good tool embeddings incorporate the description, usage examples, parameter purposes, and any human-written notes about typical use cases. Richer embeddings yield better retrieval recall at the cost of larger index size.

Retrieval set sizing. Ten tools is too few for complex multi-step tasks that require coordinating across several tool categories. Fifty tools starts to approach the same quality degradation seen with full catalogs. Empirically, fifteen to twenty-five tools covers the majority of single-session tasks while keeping tool-definition context overhead under 10,000 tokens. Adaptive sizing — expanding the retrieval set when the initial set does not yield a tool call — handles edge cases without defaulting to large static sets.

Retrieval triggering. Some implementations retrieve tools once per conversation turn. Better implementations re-retrieve when the task context shifts significantly — when an agent moves from a research subtask to a code generation subtask, for instance. Context-aware re-retrieval ensures that the tool window stays aligned with current intent rather than being anchored to the initial query.

Handling retrieval misses. A tool that does not surface in retrieval is effectively invisible to the agent. For safety-critical tools (confirmation requests, rollback operations, escalation paths), they should be pinned in context regardless of retrieval results. A small set of always-present tools guarantees that the agent's critical operational vocabulary is never absent.

Neumar's Tool Ecosystem at Scale

Neumar's MCP marketplace provides access to more than 10,000 community-contributed skills, making Tool RAG not an optional optimization but a foundational requirement. Sending 10,000 tool definitions in every agent context window is computationally impossible and would produce terrible results even if it were possible.

The marketplace integration uses a layered tool resolution architecture. Tools are organized into categories (development, productivity, communication, media, data, infrastructure), and the first retrieval pass operates at the category level — selecting two to four relevant categories based on the task context. A second retrieval pass within those categories surfaces specific tools. This hierarchical approach yields better precision than flat retrieval over the full catalog, because category-level signals are stronger and less ambiguous than individual tool-level signals.

Community tools in the marketplace include user ratings and usage frequency signals alongside their semantic descriptions. The retrieval system incorporates these signals as soft priors — a highly-rated, frequently-used tool gets a small boost in retrieval ranking that reflects the community's validation of its utility. This is similar to how web search engines incorporate authority signals alongside semantic relevance.

Personal tool history is also incorporated. Tools that a user has successfully invoked in previous sessions for similar tasks receive a retrieval boost. Over time, the system learns the user's actual tool vocabulary and surfaces it preferentially, reducing the cognitive load of working with a large catalog.

The Bigger Picture: Composable Capability

Tool RAG is a specific instance of a broader principle: agents should not need to hold their entire capability set in working memory at once. Just as humans do not enumerate all their possible actions before each decision, well-designed agents should dynamically access capability in proportion to immediate relevance.

This principle extends beyond tool selection. Memory retrieval, knowledge base queries, and context window allocation all benefit from relevance-driven loading rather than comprehensive pre-loading. The architectural pattern is consistent: index everything, load only what the current moment needs.

The practical benefit for users is an agent that handles a 10,000-tool ecosystem with the same precision and speed as one operating over fifty tools. The capability ceiling rises without the performance floor dropping. That is the core promise of Tool RAG, and for applications operating at MCP-marketplace scale, it is the difference between a catalog that is theoretically available and one that is practically usable.