Photo of Joe T. Sylve

Joe T. Sylve, Ph.D.

Digital Forensic Researcher and Educator

ida-mcp 2.2: From Tool Calls to Analysis Scripts

ida-mcp 2.2.0 is out. This release removes the friction between what the LLM wants to do and what MCP lets it express in a single round trip.

In 2.1, each action was a discrete tool call: decompile this function, get cross-references to that address, rename this symbol. Every step was a full MCP round trip. Every intermediate result landed in the context window. An analysis workflow that a human would express as a ten-line IDAPython script became thirty sequential tool calls, each waiting for the previous one to return before the LLM could decide what to do next. The LLM knew what it wanted to do, but it couldn’t say it all at once.

2.2 introduces meta-tools that let the LLM operate at a higher level of abstraction: writing multi-step analysis scripts, issuing bulk operations, and calling tools it discovers at runtime. It also makes the server persistent, so analysis state survives across sessions. And for the first time, ida-mcp can analyze firmware and raw binaries directly.

Meta-tools

execute: sandboxed analysis scripts

execute accepts Python code that calls IDA tools through await invoke(name, params), with full control flow: loops, conditionals, regex, struct unpacking, list comprehensions. Individual tools are still the right choice for simple operations, but for multi-step analysis, the LLM becomes a script writer.

Consider a common reverse engineering task: finding every function that references an error string and understanding how each one handles the error. In 2.1, this was a multi-step conversation:

  1. Call get_strings with a filter → get back 40 matching strings
  2. Call get_xrefs_to for the first string address → get back 3 cross-references
  3. Call decompile_function for each referencing function → get back pseudocode
  4. Repeat steps 2–3 for each of the remaining 39 strings

That’s potentially 160+ tool calls, each a full round trip, with the LLM holding intermediate addresses in context between calls. If the context window fills up mid-workflow, earlier results get compacted and the LLM loses track of where it was.

With execute, the same workflow is a single tool call:

strings = await invoke("get_strings", {"filter": "error|fail|panic"})
results = []
for s in strings["strings"]:
    xrefs = await invoke("get_xrefs_to", {"address": s["address"]})
    for xref in xrefs["xrefs"]:
        decomp = await invoke("decompile_function", {"address": xref["from"]})
        results.append({
            "string": s["value"],
            "function": decomp["name"],
            "pseudocode": decomp["pseudocode"]
        })
return results

One round trip. The LLM gets back a structured result containing every error-handling function with its decompiled pseudocode. No intermediate state to track, no context window spent on addresses it only needed temporarily. And if the LLM decides the approach is wrong, it’s only wasted one tool call finding out.

Any “get a list, then process each item” workflow collapses from O(n) tool calls to one. The bigger gain is for workflows that don’t reduce to sequential calls: conditional logic, data transformation, or cross-referencing between results.

Automated renaming based on string references:

A stripped binary might have thousands of sub_* functions with no meaningful names, but many of them reference string literals that hint at their purpose. A human analyst would scan through decompiled output, spot a string like "failed to parse header", and rename the function accordingly. With execute, the LLM can do this systematically across the entire binary in a single tool call:

import re

funcs = await invoke("list_functions", {"filter": "sub_"})
renamed = []
for func in funcs["functions"]:
    decomp = await invoke("decompile_function", {"address": func["address"]})
    strings = re.findall(r'"([^"]{4,})"', decomp["pseudocode"])
    if strings:
        candidate = re.sub(r'[^a-zA-Z0-9_]', '_', strings[0])[:40]
        await invoke("rename_function", {
            "address": func["address"],
            "new_name": f"uses_{candidate}"
        })
        renamed.append({"old": func["name"], "new": f"uses_{candidate}"})
return {"renamed": len(renamed), "functions": renamed}

The names this generates are rough: a first pass rather than a final answer. But uses_failed_to_parse_header is vastly more useful than sub_140001A30 when you’re trying to understand a binary’s structure, and the LLM can refine them in a second pass once it understands the broader architecture.

Cross-database patch diffing:

Patch analysis requires comparing function lists between two versions of a library, identifying what was added or removed, and diffing the implementations that exist in both. Without execute, the LLM would pull function lists from each database in separate tool calls, hold both in context, compute set differences itself, and decompile changed functions one at a time. Dozens of round trips, large intermediate results sitting in context.

With execute, the entire triage happens server-side:

old_funcs = await invoke("list_functions", {"database": "libcrypto_1.1.1"})
new_funcs = await invoke("list_functions", {"database": "libcrypto_1.1.2"})

old_names = {f["name"] for f in old_funcs["functions"]}
new_names = {f["name"] for f in new_funcs["functions"]}

added = sorted(new_names - old_names)
removed = sorted(old_names - new_names)

# Spot-check shared functions for implementation changes
changed = []
for name in sorted(old_names & new_names)[:30]:
    old_dec = await invoke("decompile_function", {
        "address": name, "database": "libcrypto_1.1.1"
    })
    new_dec = await invoke("decompile_function", {
        "address": name, "database": "libcrypto_1.1.2"
    })
    if old_dec["pseudocode"] != new_dec["pseudocode"]:
        changed.append(name)

return {
    "added": added[:50],
    "removed": removed[:50],
    "changed": changed,
    "summary": {
        "added": len(added),
        "removed": len(removed),
        "shared_checked": min(30, len(old_names & new_names)),
        "shared_changed": len(changed)
    }
}

The database parameter override lets a single execute block work across multiple open databases. Each invoke call can target a different database by name. The LLM gets back a structured summary of what changed between versions, and can then drill into specific changed functions in follow-up calls. The set operations, sorting, and conditional comparison all happen server-side rather than burning context on intermediate data the LLM only needs to pass through.

The sandbox

The code runs in a RestrictedPython sandbox. The LLM can import re, struct, json, math, collections, itertools, functools, and a few other safe standard library modules. It cannot access the filesystem, open network connections, or spawn subprocesses. Attribute access to dunder names (__class__, __globals__, __code__) is blocked at the AST level, closing Python sandbox escape hatches. Print output is capped at ~1 MiB to prevent runaway loops from exhausting worker memory.

Database lifecycle tools (open_database, close_database, wait_for_analysis) are blocked inside the sandbox; an execute block shouldn’t be spawning or tearing down workers as a side effect. The meta-tools themselves (execute, batch, call) are also blocked to prevent recursion. Everything else (decompilation, disassembly, renaming, commenting, type manipulation, structure editing) is available through await invoke().

A failed invoke call raises a Python exception that the script can catch with try/except, or that terminates the block with an error message if uncaught.

If the LLM writes an execute block that contains a single invoke call with no processing logic around it, the server detects this and returns a hint suggesting the simpler call meta-tool instead. Small nudges like this help the LLM learn the right tool for the job over the course of a session.

batch: bulk operations without scripting overhead

Not every multi-call workflow needs control flow. Sometimes it’s the same operation twenty times: decompile a list of functions, rename a set of symbols, add comments at known addresses. For these, execute is overkill: sandbox overhead just to loop over a list. batch handles this directly: a list of operations, run sequentially with per-item error handling.

{
  "operations": [
    {"tool": "decompile_function", "params": {"address": "0x401000"}},
    {"tool": "decompile_function", "params": {"address": "0x401100"}},
    {"tool": "rename_function", "params": {"address": "0x401000", "new_name": "parse_header"}},
    {"tool": "rename_function", "params": {"address": "0x401100", "new_name": "validate_checksum"}},
    {"tool": "set_comment", "params": {"address": "0x401000", "comment": "Entry point for packet parsing"}},
    {"tool": "set_comment", "params": {"address": "0x401100", "comment": "CRC-32 validation"}}
  ]
}

Up to 50 operations per call, mixing different tools freely. This example decompiles two functions, renames them, and annotates them: six operations that would have been six separate tool calls in 2.1, collapsed into one.

In 2.1, batching was baked into individual tools: decompile_function accepted up to 50 addresses, get_xrefs_to accepted up to 50, each with its own batch parameter format. The LLM had to remember which tools supported batching and how each one worked. The unified batch meta-tool replaces all of that: a list of {tool, params} objects. Any tool can be batched.

stop_on_error controls whether the batch aborts on the first failure or continues collecting results. The default is to continue: if 30 functions are being renamed and one address is invalid, the other 29 still succeed. The response includes per-operation success/failure status, so the LLM can see exactly what failed and decide whether to retry or move on.

The split is straightforward: if there’s no data dependency between operations (the output of one doesn’t feed into another), the LLM uses batch. If the workflow chains outputs, filters intermediate results, or applies conditional logic, it writes an execute script.

call and get_schema: the discovery layer

2.1 introduced progressive tool discovery: ~20 core tools registered upfront, the rest discoverable via search_tools and callable through call_tool. 2.2 refines this into a cleaner surface:

~25 tools are now pinned (up from ~20), and the total count is down to ~125 after 2.1’s resource consolidation. The remaining ~100 specialized tools are discoverable through search_tools and callable through call, batch, or execute.

Together, the five meta-tools form a hierarchy:

Need Meta-tool
Find a tool search_tools
Check its parameters get_schema
Call it once call (or directly, if pinned)
Call many tools independently batch
Chain tool outputs with logic execute

The LLM picks the right level without prompting. A quick rename uses a pinned tool directly. A bulk annotation uses batch. A multi-step investigation uses execute. When it needs something specialized (applying a calling convention, editing register variables), it searches, checks the schema, and calls through call.

Daemon mode

The meta-tools only pay off if the server stays alive long enough to use them. In 2.1, ida-mcp ran as a stdio subprocess of the MCP client. When the client disconnected (closing an editor, cycling a session, restarting after a crash), the server process died and took all worker state with it. Every open database, every completed auto-analysis pass, every renamed function: gone. For quick, single-session analysis, this was acceptable. But reverse engineering work rarely fits in a single session. You open a binary, let auto-analysis run, rename a few hundred functions, apply types, and then come back the next day to continue. Or the session cycles for an unrelated reason and you lose everything.

The problem was worse in Claude Code, where subagents share a single MCP session. A subagent halfway through analyzing a firmware image (hundreds of functions renamed, types applied) loses everything when the session cycles. It reconnects, but has to reopen, re-analyze, and reconstruct its progress from whatever survived context compaction.

In 2.2, the server runs as a persistent HTTP daemon behind a lightweight stdio proxy:

LLM Client  <──stdio──>  Proxy  <──HTTP──>  Daemon
                                             Workers + Databases

The first time an MCP client connects, the proxy spawns a daemon process and detaches it. Subsequent connections (including reconnections after a session cycle, from a different editor, or from a completely new conversation) reuse the running daemon. Workers and their databases persist across disconnects: renamed symbols, added comments, applied types all survive.

The daemon also supports collaboration across clients. If a human analyst has been annotating a binary through one MCP session, a second session connecting to the same daemon sees all those annotations immediately. The daemon doesn’t care who made the changes; it just maintains the databases.

The daemon listens on 127.0.0.1 with a per-instance 256-bit bearer token. The state file is written with 0600 permissions so only the spawning user can read the token. To stop the daemon:

ida-mcp stop

This is the default transport now. Existing MCP client configurations (ida-mcp as the command) work without changes. The proxy handles daemon lifecycle transparently.

Raw binary and firmware support

ida-mcp could already open ELF, PE, and Mach-O files, where IDA auto-detects the architecture and load address from file headers. But firmware analysis (bootloaders, ROM dumps, flash extractions) starts with a blob of bytes and no metadata. Previously, you had to preprocess the binary in IDA’s GUI or write a loader script before ida-mcp could work with it. In 2.2, open_database accepts three new parameters that give the LLM what it needs to bootstrap analysis on raw binaries:

For structured formats, these parameters are optional. IDA figures them out from the file headers. For raw binaries, the LLM needs to provide them. If the user says “analyze this Cortex-M firmware dump loaded at 0x08000000,” those three parameters map directly:

{
  "file_path": "/path/to/firmware.bin",
  "processor": "arm:ARMv7-M",
  "loader": "Binary file",
  "base_address": "0x08000000"
}

The server validates processor names and catches a subtle headless-mode pitfall: processor names like arm, metapc, and mips are ambiguous. In IDA’s GUI, selecting one of these pops up a dialog asking which variant you mean: ARM or AArch64? 32-bit or 64-bit x86? But headless idalib never shows that dialog. It silently picks a default, and the default is often wrong. A Cortex-M firmware blob opened with bare arm ends up disassembled as AArch64, producing nonsense.

The server rejects these bare names on raw binaries and returns the available variants with descriptions:

"arm" is ambiguous for raw binaries. It defaults to AArch64 in headless mode.
Use a specific variant:
  arm:ARMv7-M    Cortex-M (32-bit Thumb-2)
  arm:ARMv7-A    32-bit A-profile
  arm:AArch64    64-bit (explicit)

The LLM can also call list_targets to enumerate all available processors and loaders, so it can match an unknown binary to the right target without guessing.

Fat Mach-O support

macOS universal binaries pack multiple architecture slices into a single file. In 2.1, opening one would silently pick whichever slice IDA defaulted to, usually arm64, even when the target was x86_64. Nothing indicated the wrong slice had been selected until the disassembly didn’t make sense.

In 2.2, the server parses the fat header, identifies the available slices, and requires the caller to choose explicitly:

AmbiguousFatBinary: universal binary contains multiple architectures.
Available slices: arm64, arm64e, x86_64
Pass fat_arch="arm64" to select a slice.

Each slice gets its own .i64 sidecar (binary.arm64.i64, binary.x86_64.i64), so multiple architectures can be opened simultaneously in separate workers. Combined with execute’s cross-database support, the LLM can decompile the same function in both the arm64 and x86_64 slices and diff the pseudocode. This helps when finding platform-specific behavior, verifying that a vulnerability affects all architectures, or understanding how the compiler optimized differently for each target.

The fat header parser also handles an edge case that has bitten other tools: Java .class files share the same magic bytes (0xCAFEBABE) as Mach-O fat binaries. The parser validates slice counts and CPU types to distinguish the two, so a directory full of Java classes won’t trigger false fat-binary detection.

Tuning for your model and client

Not every model writes good Python, and not every MCP client needs server-side tool discovery. The meta-tools are designed to be independently useful, so you can enable the ones that match your setup and disable the ones that don’t.

Three environment variables control which meta-tools are available:

These are environment variables on the server process, so they apply to all sessions against that daemon. Set them in your MCP client configuration:

{
  "command": "ida-mcp",
  "env": {
    "IDA_MCP_DISABLE_TOOL_SEARCH": "1"
  }
}

As a starting point: if you’re using Claude (Opus or Sonnet) through Claude Code, disable tool search. If you’re using a smaller model or a client without native tool deferral, leave everything enabled and let server-side progressive disclosure handle it.

Other improvements

Upgrading

uv tool install --upgrade ida-mcp

Or with pip:

pip install --upgrade ida-mcp

The MCP interface is backward compatible. Existing client configurations work without changes. The daemon spawns automatically on first connection.

If you run into issues or have feature requests, please open an issue on GitHub.


IDA Pro and Hex-Rays are trademarks of Hex-Rays SA. ida-mcp is an independent project and is not affiliated with or endorsed by Hex-Rays.

Find an issue or technical inaccuracy in this post? Please file an issue so that it may be corrected.