Tool Calls With Agentic Code Generation Using 'Monty'

Inspired by Anthropic’s recent article on using code execution to improve MCP tool calling, and having just discovered Monty — a new sandboxed Python runtime from Pydantic — I was motivated to spend a weekend building an example project to explore further.

Standard LLM tool calling has a fundamental inefficiency: the model calls one tool at a time, waits for the result, decides what to call next, and round-trips back to the model for every step. For questions that require fetching data from multiple sources, time (and tokens) add up fast. The monty-example project explores a different approach: let the model write code that orchestrates tool calls and executes that code in a sandbox running within your code!

Monty: A Sandboxed Python Runtime

Monty is a Rust-based sandboxed Python runtime. It compiles a subset of Python into its own bytecode and runs it in its own VM, in-process on the calling thread. Crucially, it has no direct access to networking or OS I/O — but it can call a set of external host functions that you explicitly expose. This makes it a practical execution environment for LLM-generated code: the model gets a real Python interpreter with real control flow, and the host retains control over what the code can actually touch.

The Three-Phase Pattern

The key idea is to split each conversational turn into up to three phases: tool discovery, optional code generation, and final synthesis back to the user.

Figure 1. Three-phase orchestration flow.

Phase 1 — Tool Discovery (and direct answer when possible)

The model receives the user prompt, conversation history, and OpenAI tool schemas derived from a set of functions to be used by the agent. If the data needed to answer is already present in the conversation history, the model answers directly — no further LLM calls are made. If new data is required, the LLM will return tool calls in its response — signalling that tools are needed — and Phase 2 begins.

Phase 2 — Code Generation + Monty (Python) Execution

The model writes Python code using type stubs generated from the host tool functions. It decides which functions to call, in what order, and how to combine results, including asyncio.gather for parallel fetches. That code runs in the Monty sandbox, which calls back to the external host functions. If the generated code fails to compile or run, the error is fed back to the model and it retries up to a max number of attempts. The result is collected as a <tool_results> context block added to the conversation.

Phase 3 — Final Answer

The original prompt plus the tool results are sent to the model as part of the final prompt, no tools array provided, as any data will come from the context data/results inserted from Phase 2. The model produces a data-grounded natural-language reply.

Conversation history retains only the user prompt and assistant reply — keeping context compact across turns.

Demo Session

The left pane tails the session log showing phase details and actions; the right pane shows the chat.

Code Generation in Action

For the example, a set of functions and data related to an expense management system are provided. The agent is given these, along with a system prompt:

_PHASE1_SYSTEM = (
    "You are a helpful assistant that analyses team expense data. "
    "First check the conversation history — if the data needed to answer the "
    "user's question is already present (e.g. in a prior <tool_results> block), "
    "answer directly using that data without calling any tools. "
    "Only call tools when genuinely new data is required that is not already "
    "in the conversation."
)

Here is an example of the code the model generates for the prompt “fetch all flight expenses across the team”. Phase 1 detects that expense data is needed and hands off to Phase 2, where the model writes:

import asyncio

department = 'Engineering'
quarter = 'Q3'
category = 'travel'

members_data = await get_team_members(department)
members = members_data['members']

expenses_tasks = []
for member in members:
    expenses_tasks.append(get_expenses(member['id'], quarter, category))

expenses_list = await asyncio.gather(*expenses_tasks)

filtered_expenses = []
for expenses in expenses_list:
    for expense in expenses['expenses']:
        if 'flight' in expense['description'].lower():
            filtered_expenses.append({
                'user_id': expenses['user_id'],
                'expense': expense
            })

filtered_expenses

All five team members’ expenses are fetched in parallel via asyncio.gather, then filtered in a single Monty execution. The model only sees the final result — not five intermediate round-trips.

Follow-up turns that ask for reformatting or aggregation are answered by Phase 1 directly from context, with no tool calls at all.

Session Logging

Session logging makes the token and time breakdown visible per turn, and is useful to see how the example is working. In a real-world example you’d use Open Telemetry and observability tools, like Pydantic Logfire, but the simple logging here is easy to use for debugging (see Demo Session above in the left pane).

When Code Generation Goes Wrong

The following are examples of mishaps I encountered along the way.

Assignment statement as last expression breaks things

Prompt: “get the first expense line for Bob Smith and tell me the items on it”

The generated code correctly located Bob Smith (user_id 2) and fetched his expenses, but ended with an assignment statement rather than a bare expression:

import asyncio

members_data = await get_team_members('Engineering')
members = members_data['members']

bob_id = None
for member in members:
    if 'bob smith' in member['name'].lower():
        bob_id = member['id']
        break

expenses_data = await get_expenses(bob_id, 'Q3', 'travel')
expenses = expenses_data['expenses']

first_expense = expenses[0] if expenses else None

result = first_expense   # ← assignment, not a bare expression

Monty evaluates the last expression as the return value. An assignment statement has no value, so the sandbox returned null. Phase 3 received null as its tool result and (correctly) reported that no expenses were found.

Fix: the code generation rules now explicitly require that the last line be a bare expression, not an assignment — e.g. first_expense rather than result = first_expense.

Over-fetching due to unnecessary parameter combinations

Prompt: “try looking for user id 2” (follow-up after the null result above)

With the user_id supplied directly, the model generated code that iterated over all quarters and categories rather than making a single call:

import asyncio

quarters = ['Q1', 'Q2', 'Q3', 'Q4']
categories = ['travel', 'meals', 'accommodation']

tasks = []
for quarter in quarters:
    for category in categories:
        tasks.append(get_expenses(2, quarter, category))

results = await asyncio.gather(*tasks)

Because get_expenses ignores the quarter and category arguments and always returns the same full expense list, this produced 12 identical responses — 572 lines of duplicate JSON. Phase 3 received 5,620 prompt tokens to answer what was a trivial single-call lookup. The session log made the cost immediately visible:

[TURN STATS]
  phase 1 :    ...
  phase 2 :  1,004 tokens (1 attempt)  ...
  phase 3 :  5,620p + ...
  subtotal:  7,229 tokens  (code-gen: 1,004  non-code: 6,225)  13.45s LLM

Root cause: the model learned from the Phase 2 type stubs that get_expenses accepts quarter and category parameters, and assumed (reasonably) that they were filters. Because the stub does not signal that the parameters are currently ignored, the model hedged by fetching all combinations.

Takeaway: keep tool semantics honest in docstrings and stubs. If a parameter is accepted but not yet filtering, document that clearly so the model does not over-fetch defensively.

Logic wrapped in an uncalled function — silent failure

Prompt: “itemize member’s expenses that have ‘flight’ in the description”

The generated code placed all fetching and filtering logic inside an async def main() function but never called it:

import asyncio

async def main():
    user_ids = [1, 2, 3, 4, 5]
    expenses_results = await asyncio.gather(
        *[get_expenses(user_id, 'Q3', 'travel') for user_id in user_ids]
    )
    flight_expenses = []
    for result in expenses_results:
        for expense in result['expenses']:
            if 'flight' in expense['description'].lower():
                flight_expenses.append({...})
    flight_expenses  # ← inside the function body, never reached

From Monty’s perspective, the top-level code was a single async def statement — a declaration with no value. The sandbox returned null without error. Phase 3 received null and faithfully reported that no flight expenses existed.

This failure is particularly insidious because the retry loop only triggers on exceptions. A silent null passes straight through to Phase 3.

Fixes applied (two layers):

Prompt hardening — The code-gen system prompt was strengthened from the vague “do not define functions” to an explicit prohibition: “NEVER use def or async def — not even a helper. Every await must appear at the top level.”
Deterministic pre-check — Before compilation, the generated code is scanned for the pattern \bdef\s+\w. If found, the attempt is rejected immediately with a clear error message fed back to the model, without ever running the code:

if re.search(r"\bdef\s+\w", code):
    last_error = (
        "Code contains a `def` or `async def` statement, which is forbidden. "
        "All logic must be written as flat top-level async code."
    )

Null-result guard — Even if a future pattern produces None without defining a function (e.g. the last line is an assignment), the executor now rejects None as a retriable error rather than passing it to Phase 3.

Broader principle: prompt rules alone are insufficient for constraints that have a clear syntactic signature. Pair each important rule with a deterministic code scan so violations are caught before they silently corrupt results. Other candidates for the same treatment: next() (not available in Monty’s builtins — scan for \bnext\s*\(), return statements at the top level, and bare import of third-party libraries.

LLM arithmetic in Phase 3 produces wrong totals

Prompt: “itemize member’s expenses that have ‘flight’ in the description” (same prompt as Failure 3, different manifestation once the null was fixed)

After the null fix, Phase 2 correctly returned a list of 15 flight expense items. Phase 3 was then asked to present the results and computed the total in prose — but large language models are unreliable at arithmetic. Across multiple runs the same 15 items produced totals of 8,530, 9,370, and 11,950 depending on the run, when the correct answer is 10,510.

Root cause: Phase 3 model call was doing the sum itself from the raw item list rather than reading a pre-computed value from the Phase 2 result.

Fix — add a bookkeeping tool and require its use:

A sum_amounts external function was added to external_tools.py:

async def sum_amounts(items: list[dict[str, Any]], field: str = "amount") -> float:
    """Sum a numeric field across a list of dicts."""
    return sum(float(item[field]) for item in items)

Because it is registered in TOOL_FUNCTIONS, OPENAI_TOOLS, and MONTY_TOOLS, it is visible to Phase 2’s code generator as a callable. A new code-gen rule was added to the system prompt:

“Always compute totals and subtotals using sum_amounts and include them in the returned dict or list. Never leave arithmetic to the final answer phase — if you return a list of expense items, wrap it: {"items": [...], "total": await sum_amounts(items)}.”

With this in place, Phase 2 returns, for example:

{
  "items": [...],
  "total": 10510.0
}

Phase 3 reads 10510.0 directly — no arithmetic required, no rounding errors, no hallucinated sums.

Broader principle: any value that requires exact computation (sums, counts, percentages, date arithmetic) should be calculated in Python by Phase 2 and surfaced as a named field in the return value. Phase 3’s role is narration, not calculation.

Final Takeaways

Performance observations

	Tokens	Time
Code Generation	3,957 (32%)	0.005s
Model Processing	8,299 (68%)	23.86s
Total	12,256	23.865s

Table 1. Session totals from the demo conversation.

Code execution is essentially free. The entire session required 0.005 seconds of code execution time. In a standard sequential tool-calling setup, each of those fetches would have required a round-trip back to the LLM to decide what to call next.

Code generation overhead is real but bounded. Code-generated tokens (3,957) accounted for about 32% of total token spend. That cost is paid once per turn that requires new data, not once per tool call. Turns five and six in the demo were answered directly from context — no code generated, no tools called — so the overhead only appears when it’s actually needed.

Conversation history stays compact. Rather than accumulating individual tool call/response pairs for every intermediate step, the history retains only the user prompt, a single <tool_results> block with the computed result, and the assistant reply. This trimmed-down context helps improve model performance.

Other considerations

Evaluations are not optional. During development I encountered “happy mistakes” in the generated code — cases where the model produced the right answer for the wrong reasons. In a production setting, keep your model “Honest ABE” (Always Be Evaluating) to detect new edge cases and make adjustments.

Prompts alone won’t hold the line. Code-generation constraints are expressed in the system prompt, but the model ignores them often enough that prompt rules alone are insufficient. The more reliable pattern is to pair each important rule with a deterministic check that runs before the code is compiled and run. A couple of targeted regexes (\bdef\s+\w, \bnext\s*\() catch whole classes of silent failures cheaply.

Fit matters. This approach works best when the tools map cleanly to the domain and the data shapes are predictable. As the failure cases above show, good docstrings and type stubs matter a lot. The Anthropic article [3] referenced at the top covers similar tradeoffs in the context of MCP and is worth reading alongside this example.

References

[1] oshea00, “monty-example: LLM + Monty Tool Calling,” GitHub, 2025. Available: https://github.com/oshea00/monty-example

[2] pydantic, “Monty: Sandboxed Python Runtime,” GitHub, 2025. Available: https://github.com/pydantic/monty

[3] Anthropic, “Code execution with MCP,” Anthropic Engineering, 2025. Available: https://www.anthropic.com/engineering/code-execution-with-mcp

Monty: A Sandboxed Python Runtime#

The Three-Phase Pattern#

Demo Session#

Code Generation in Action#

Session Logging#

When Code Generation Goes Wrong#

Assignment statement as last expression breaks things#

Over-fetching due to unnecessary parameter combinations#

Logic wrapped in an uncalled function — silent failure#

LLM arithmetic in Phase 3 produces wrong totals#

Final Takeaways#

Performance observations#

Other considerations#

References#