Beyond Tool Calls: Verified LLM Code Execution for ABF

At Alloan, we build AI that lenders can actually trust. In a domain where a miscalculation can mean a denied loan or a mispriced risk, “trust but verify” is not enough - we need AI that is verified by design. This post walks through how we think about verification at different layers of our system, and why LLM-driven code execution - sandboxed with RestrictedPython - has become one of the most powerful tools in our stack.

Verified AI: Our Core Tenet

The phrase “AI hallucination” has become a cliche, but in lending analytics it is not an abstract risk - it is a liability. A model that confidently fabricates a delinquency rate or invents a debt-to-income ratio can cause real harm. At Alloan, verified AI means that every output the system produces is traceable, reproducible, and constrained. An agent’s answer to “what is the average LTV across this cohort?” should not be a confident guess but an actual computation, against actual data, with an audit trail.

This is not simply a matter of adding guardrails after the fact. Verification has to be woven into how the system reasons, what it is allowed to do, and how its outputs are produced.

Level 1: Structured Output as First-Pass Verification

The most accessible layer of verification comes from asking the LLM to produce structured output rather than free-form text. When a language model is constrained to respond in a well-defined schema - a JSON object with specified fields and types - the application layer can immediately validate the response before acting on it. If the model was asked for a loan grade and returned “Grade”: “X” when only A-G are valid, the system can reject or re-prompt without ever surfacing the bad answer to a user.

This is sometimes called “tool calling” in the OpenAI and Gemini APIs: the model is told it has access to a set of typed functions and must invoke them with well-formed arguments. The schema of the function signatures acts as a contract. The model must respect it, and the caller can mechanically verify it.

Example: asking the model to call a structured function

{
  "tool": "compute_cohort_stat",
  "arguments": {
    "metric": "ltv",
    "aggregation": "mean",
    "filters": { "state": "CA", "origination_year": 2023 }
  }
}

Structured output is elegant and often sufficient for narrow, well-scoped tasks, but complex scenarios require complicated JSON schemas that the LLMs might not be able to reason well on statistically. For example, ensuring that histogram SQLs are always computed weighted by the total loan amount rather than the count of loans is not easy to encode as a constraint on a JSON schema, even if the SQL string itself was represented as a JSON syntax tree.

Level 2: LLM-Generated Code - Tool Calling on Steroids

The more powerful - and more interesting - approach is to give the model a programming language as its output modality.

Instead of asking the model to pick from a list of tools, we ask it to write a short Python program that calls a curated set of analytical functions we provide. The model’s expressiveness is no longer bounded by a fixed API surface. It can compose, loop, filter, aggregate, and derive in arbitrarily complex ways - anything that Python allows.

# Example of LLM-generated analysis code
results = spark_sql("""
    SELECT grade, AVG(dti) as avg_dti, COUNT(*) as loan_count
    FROM loans
    WHERE origination_year = 2023
      AND state IN ('CA', 'TX', 'FL')
    GROUP BY grade
    ORDER BY avg_dti DESC
""")

summary = results.to_markdown()
artifact("cohort_dti_by_grade", summary)

The model writes this. The system runs it. The output is a real computation on real data.

This is tool calling on steroids: instead of selecting a single action from a menu, the model is writing a mini-program that orchestrates many actions in sequence, conditionally, with intermediate state. The same LLM capability that made structured output useful now becomes a general-purpose analytical reasoning engine.

The catch, of course, is that running LLM-generated code in production is terrifying if you do it naively. exec() on arbitrary strings is one easy path to a compromised server. This is where RestrictedPython enters.

Level 3: RestrictedPython - A Verified Execution Sandbox

RestrictedPython is a Python library that transforms arbitrary Python source into a restricted AST and compiles it under a security policy before execution. Rather than running code in the host interpreter’s full namespace, you execute it in a carefully curated environment. In practice, it looks something like this on the host side:

from RestrictedPython import compile_restricted
from RestrictedPython.Guards import safe_builtins, guarded_unpack_sequence

def execute_safely(code: str, allowed_globals: dict):
    # Compile under restricted policy - raises SyntaxError or SecurityError
    # if the code violates the policy
    compiled = compile_restricted(code, "<llm_program>", "exec")

    safe_env = {
        "__builtins__": {
            **safe_builtins,
            "__import__": make_safe_importer(allowed_modules=["math", "statistics"]),
        },
        "_getattr_": getattr,
        # ... inject your analytical functions here
        **allowed_globals,
    }

    exec(compiled, safe_env)

The model can write a fully expressive program. But the worst it can do within the sandbox is crash with a NameError or ImportError. It cannot touch the filesystem, spawn processes, read environment variables, or escape the interpreter. Verification is structural, not heuristic. We get to control the imports and globals available to the interpreter; of course this has to be done with care.

Connecting to Databricks

For our data analysis AI, we inject into the sandbox, thin wrappers around Spark SQL queries that execute against our loan data catalog. Here is a simplified illustration of how such a toolset might be structured:

# What the LLM sees in its system prompt (simplified)
"""
You have access to the following functions in your program:

spark_sql(query: str) -> DataFrame
    Execute a Spark SQL query against the loan data catalog.
    Returns a DataFrame with the results.

artifact(name: str, content: str) -> None
    Store a named text artifact (markdown table, analysis summary, etc.)
    as part of the response.
"""

# What the host injects into the sandbox
sandbox_globals = {
    "spark_sql": lambda q: run_spark_query(databricks_session, q),
    "artifact": artifact_store.add,
}

The LLM writes a Python program using spark_sql and artifact. RestrictedPython compiles and runs it. The program queries petabytes of loan data via Databricks, formats the result, and stores an artifact - all without ever touching a file, a socket, or a system call.

Prompt Engineering: Grounding the Model in What It Can Do

Code-generating agents work best when the system prompt is itself a precise specification of the execution environment. A vague “you can query data” leads to the model inventing function signatures that don’t exist. A precise, schema-driven prompt grounds the model in reality. As of writing, most thinking models accept a 0-shot description of the business semantics via their system prompt and conform to the query semantics as needed.

Here is a toy illustration of the pattern:

SYSTEM:
You are a loan analytics assistant. You write Python programs to answer
questions about our loan portfolio.

Available functions:
  spark_sql(query: str) -> pd.DataFrame
      Run a SQL query against the `loans` table.
      Schema:
        - loan_id (string)
        - grade (string: A-G)
        - dti (float)         -- debt-to-income ratio
        - ltv (float)         -- loan-to-value ratio
        - state (string)
        - origination_year (int)
        - status (string: current, delinquent, default, paid_off)

  plot_bar(df, x_col, y_col, title) -> None
      Render a bar chart as an artifact.

  artifact(name, content) -> None
      Store markdown text as a named output artifact.

Rules:
  - Do not use any imports.
  - Do not use print(); use artifact() to surface results.
  - Write a single self-contained program.

USER:
What are the top 3 states by average DTI for Grade B loans originated in 2022?

The model might respond with:

df = spark_sql("""
    SELECT state, AVG(dti) as avg_dti
    FROM loans
    WHERE grade = 'B' AND origination_year = 2022
    GROUP BY state
    ORDER BY avg_dti DESC
    LIMIT 3
""")

artifact("top_states_dti", df.to_markdown(index=False))

RestrictedPython compiles this, verifies there are no forbidden constructs, and executes it. The spark_sql binding fires off the actual Databricks query. The result lands in the artifact store. The whole round trip is auditable, reproducible, and safe.

Beyond the Sandbox: A Language Built for Verification

RestrictedPython solves a real problem elegantly - but its underlying posture is defensive. It starts from a general-purpose language and removes what is dangerous. The safety guarantee is “we blocked the bad things.” For most applications, that is exactly the right trade-off.

In loan analytics, the safety bar for SQL-based computation is higher than most domains. SQL is expressive - and that expressiveness is a liability when calculations need to be verifiable, not just executable. Consider a monthly delinquency rate for a loan pool: the correct computation must normalize by active balances in that specific period, not total originations or an all-time count. Get the normalization wrong and the number looks perfectly plausible while being meaningfully incorrect - the kind of error that slips past code review. A sandbox cannot catch this.

This points toward a stronger requirement: not just a safe language, but a strongly typed one - where the type system encodes the semantics of lending computations, not just their data types. The language statically checks the business semantics.

There is a second requirement that matters just as much: auditability by people without deep coding expertise. The analysts, credit officers, and auditors who need to trust these computations are not software engineers.

This is the premise of ABF, a domain-specific language and compiler we have built specifically for loan analytics. Where RestrictedPython narrows Python, ABF is built from scratch around the semantics of what lenders actually compute.

An ABF program is deliberately minimal:

let performing = indexed(balance, status == "current");
let total      = sum(balance);
let cur_bal    = sum(performing);

output sql(cur_bal / total)

A credit analyst can read this. An auditor can review it. There is no hidden state, no control flow to trace, no library code to vet. The program is the computation, stated plainly. Business semantics are directly available as language constructs. This is what we mean when we say verified by design rather than verified after the fact.

We are putting this compiler at the heart of Alloan, our AI analytics platform for capital markets. The AI your analysts interact with does not just write Python and hope for the best, but rather generates ABF, which the compiler either accepts and executes precisely, or rejects outright. No ambiguous outputs. No silent miscalculations. Every answer is traceable to a type-checked, formally-grounded program.

If you work with fintech lending data and want AI that you can actually put in front of a credit committee, we would love to show you what this looks like in practice.

Alloan Inc. is building AI-powered analytics for the lending industry. We believe that AI in finance must be explainable, auditable, and - above all - verified.