Narrow Tools, Deep Expertise

The limiting factor in AI agents isn't the model. It's whether you know enough to build the right tools. What building Elastibot taught me about composition, domain knowledge, and what the papers leave out.

The cluster was down. The team spent ~2 hours on it: shard allocation first, then index settings, then a circuit breaker theory that went nowhere. They were good engineers on a cluster they knew well. The root cause kept slipping.

Elastibot was still new at that point, not yet the first instinct. When I finally ran it, it came back in under four minutes with a structured report: root cause identified, remediation steps outlined. The same logs the team had been reading for two hours.

What it found matters less than how it found it.

The Wrong First Question

When most engineers first get access to a capable LLM, they ask: what can I ask it? That question leads to a better search engine. Occasionally useful, not transformative.

The right question is: what can I make it run?

A language model that can only answer questions is bounded by its training data. It knows about Elasticsearch the way it knows about everything in the abstract. It will confidently name configuration options that don't exist in your version. It has no access to your actual cluster state.

An agent with real tools doesn't have that problem. It calls the API, reads actual heap usage, checks actual swap configuration. No hallucination necessary. It can just look.

The tool is not an enhancement to the language model. The language model is the orchestration layer for the tools.

Once you internalize that inversion, you stop asking "how smart is this model" and start asking "how precise are these tools." The model is already smart enough. What limits the agent is the quality and breadth of what it can actually do.

How Elastibot Started

I was the primary Elasticsearch SME on a 30-node federal production cluster: 30,000–40,000 events per second, 300+ agency users depending on it. Every shard allocation failure, circuit breaker trip, node falling out of the cluster — I was the one doing triage. The bus factor was one.

I knew the APIs cold: which endpoints told you what, which metrics preceded which failures, which configuration flags bite federal air-gapped deployments specifically. That expertise took years to build.

When I got access to an LLM API, the obvious move was to encode that expertise: build a chatbot, point people at it when I wasn't around. The LLM knew Elasticsearch reasonably well in the abstract. The problem was that every useful answer required knowing the actual state of the actual cluster. "Is my JVM heap healthy?" isn't answerable from training data. You need to call _nodes/stats. "Why are shards unassigned?" requires _cluster/allocation/explain. Every diagnostic question was, fundamentally, an API call. I'd built something that could only talk, not act.

How Tools Compose

The first real version of Elastibot had one tool: cluster_health. A thin wrapper around the cluster health API, returning structured JSON.

// Tool definition
{
  name: "cluster_health",
  description: "Get Elasticsearch cluster health status, node count, shard counts, and active/relocating/initializing/unassigned shard breakdown",
  parameters: {
    endpoint: "string: cluster base URL"
  }
}
    

With one tool, the agent told you green/yellow/red — the same as the Kibana dashboard. But something more important happened when I ran it: the model read the output, reasoned about it, and stated what it wanted to know next, even though it couldn't find out. It told me what tool to build.

I built allocation_explain, then nodes_stats, then nodes_os_stats, disk_watermarks, circuit_breaker_stats. Each implementation was straightforward — a parameterized fetch against a well-documented REST API. Each one took maybe half an hour to write. Knowing which fetch to build took years.

This is essentially the Unix philosophy applied to agent tooling: narrow, composable primitives that an orchestrator combines. The concept isn't new; the orchestrator is. Adding allocation_explain to a one-tool agent didn't just add "allocation diagnosis." It created a diagnostic path: health check, find unassigned shards, explain why, cross-reference node disk usage — a triage workflow the model assembled itself, without me scripting the sequence.

cluster_health × nodes_stats × nodes_os_stats = coherent triage path

Each new tool composes with every existing one. Capability grows combinatorially — far faster than linear — but only across coherent diagnostic paths. The model can only compose what exists, and useful chains are a much smaller set than all possible combinations. How much capability you actually get depends entirely on which tools you build.

Expertise Is the Real Prerequisite

Most writing about AI agents skips the uncomfortable part: the tools only work if the person building them has deep expertise in the domain.

An engineer who doesn't know Elasticsearch deeply would build the obvious tools first — the ones in every quickstart guide. Those tools handle simple cases. The production failures that actually keep engineers up at 2am would still require a human expert, because the tools that diagnose them would never get built.

cluster_health

Green/yellow/red status, shard counts, node count. In every getting-started guide. An LLM-assisted engineer building their first agent would start here.

allocation_explain

Why specific shards are unassigned. Requires knowing this endpoint exists, what it returns, and when it's useful — non-obvious until you've spent time debugging allocation failures in production.

nodes_os_stats

Raw per-node OS metrics from _nodes/stats/os: CPU, memory, swap utilization. Not a swap diagnostic — just OS telemetry. The model reads it and draws its own conclusions. Only exists because I knew swap was worth watching on ES clusters. Not in the quickstart. An LLM reading docs would not build this first.

The difference between cluster_health and nodes_os_stats is years of production experience. The implementation of either is trivial. Knowing to build the second one at all is not.

The AI amplifies expertise. It doesn't create it. If you don't know which tools to build, you'll build a capable-looking agent that misses exactly the failures that matter. The interface will be impressive. The coverage will have holes precisely where production bites you.

The Architecture

Elastibot is built around a parameterized tool server: a Node.js service that exposes Elasticsearch API calls through a structured interface the LLM can invoke. The model doesn't touch Elasticsearch directly. It generates a plan — a directed acyclic graph of tool calls, where each step can depend on the results of previous ones — then executes that plan deterministically. Reasoning happens before and after execution, not interleaved with side effects. This makes the agent's decisions inspectable: you can see the plan before it runs, and understand why it did what it did after.

The tool server is model-agnostic. It talks to an internal completions API that can swap the underlying model without touching agent logic. Today it runs on Gemini. When something better ships, the swap is a config change. The model is a configuration choice, not a structural one.

The Full Incident

CRITICAL PRODUCTION INCIDENT ✓ RESOLVED

cluster 30-node federal ES cluster, 30K–40K events/sec, 300+ agency users

symptom Cluster unresponsive. Search and indexing failing.

manual triage ~2 hours. Multiple engineers. Checked shard allocation, index settings. Elastibot was still new — not yet the first stop.

elastibot triage cluster_health → nodes_stats (heap at 98%) → nodes_os_stats (swap active on 4 nodes) → model identifies root cause

time to diagnosis < 4 minutes from first invocation

root cause JVM heap pressure causing GC thrash. OS swap enabled. ES requires it disabled via bootstrap.memory_lock. Heap swapping to disk under load.

outcome Swap disabled, nodes restarted with memory lock. Cluster green in 22 minutes. ~2 hours recovered on that incident. First-stop tool from that point forward.

The tool chain: cluster_health → nodes_stats → nodes_os_stats. Three tools — none designed for this failure mode specifically. The model read raw OS metrics, saw swap actively in use across four nodes, and connected it to the heap pressure. There was no swap diagnostic tool. It just had access to the right data and knew what to do with it. That path only worked because someone who'd watched swap kill this class of cluster before knew that OS telemetry was worth exposing in the first place.

If You're Building Agents

The pattern holds anywhere you have domain depth and a well-defined API surface.

Build narrow tools, not broad ones. Each tool should do exactly one thing and return structured, parseable output. Don't build a "diagnose cluster" tool. Build a "get cluster health" tool, an "explain allocation" tool, a "get node stats" tool. The model composes them. Your job is to make each primitive precise.

Let the model tell you what's missing. Run the agent on real problems early. When it says "I would need X to continue," build X. The model's requests are a requirements document for your next tool.

Separate planning from execution. An agent that interleaves reasoning and side effects is harder to debug, harder to trust, and harder to improve. Keep them distinct. Make the plan visible before it runs.

Build against an abstraction layer, not a specific model. Models improve fast. If your agent is tightly coupled to one provider's API, you pay a migration cost every time a better one ships.

Respect the prerequisite. The agent is only as good as the tools. The tools are only as good as your domain knowledge. An impressive interface with wrong coverage doesn't fail gracefully — it fails at exactly the moments that matter, giving false confidence in diagnoses that miss the actual problem.

The person who knows which tools to build is the person who's been paged when the cluster goes down. That knowledge doesn't come from documentation. It comes from the incidents — the ones that taught you which questions to ask, which APIs exist for exactly this, and which failure modes don't show up until production. That's the part that can't be automated.

Narrow Tools,Deep Expertise