I built an AI hedge fund simulator. Five agents with investor personas — Buffett, Munger, Ackman, Dalio, Lynch — running on a locally-proxied LLM. Each agent reads fundamentals, prices, news and insider trades for a watchlist of tickers as of a cutoff date. They vote. A CIO agent turns the votes into a portfolio. The backtest compares that portfolio against a benchmark.

After five iterations of tuning, the backtest said: +10.44% portfolio return, +5.05% alpha.

I was drafting the victory post. Then I audited the backtest instead.

This is what the audit found, in order of how much it hurt.

Bug #1: The field that "changes slowly"

Prices were filtered by the cutoff date. News were filtered. Insider trades were filtered.

Fundamentals were not.

The query that fed my three fundamentals-driven agents grabbed the latest row available, ordered by date descending, no cutoff filter. There was even a comment in the code explaining the reasoning: fundamentals change slowly.

fund_query = (  # ── Fundamentals (most recent available, fundamentals change slowly) ──
    select(Fundamentals)
    .where(Fundamentals.stock_id == stock.id)
    .order_by(Fundamentals.date.desc())
    .limit(1)
)

So Buffett, Munger and Ackman were reading post-cutoff earnings while pretending to live in the past. Three of five committee seats, contaminated on every single decision.

The lesson generalizes: lookahead bias rarely walks in through the front door. It comes in through the one field you decided was "stable enough" not to filter. If your backtest has one unfiltered data path, it effectively has zero filtered data paths — the leak propagates through every downstream decision.

The fix was boring, which is typical: point-in-time ingestion, rows keyed to reporting period plus a publication lag, and a <= cutoff filter like every other table already had.

Bug #2: The model had already read the ending

This bug does not exist in classic backtesting, and it's the main reason I'm writing this post.

My agent prompts included the ticker symbol:

prompt = self.config.prompt_template.format(
    ticker=ticker,
    data=json.dumps(scoped, indent=2, default=str),
)

Note the second argument: everything the agent sees — fundamentals, news, insider trades — travels as a single JSON blob. There is no separate "news field" to sanitize. Scrubbing this means parsing and rewriting the entire context object.

The model behind the agents is a hosted alias whose training data extends well past my backtest cutoff. So when "Warren Buffett" evaluates a well-known ticker as of January 1st, the model doing the evaluating has already read what happened to that company after January 1st. Not in the context window — in the weights. No data-layer discipline can filter that out.

My first instinct was to scrub tickers from the prompts. It's not enough, and understanding why matters more than the fix:

Either you scrub the entire textual context — headlines, names, sector-identifying details — or you accept that your backtest is theater. And if you scrub that aggressively, you've also destroyed most of the signal you were trying to test. News-driven agents with anonymized news are reading static.

Which leads to the uncomfortable question this project forced me on: can you meaningfully backtest an LLM on any period inside its training window? My current answer is no. The model is a market participant that has read the future. Every backtest metric it produces on pre-cutoff data carries an unknown, unremovable optimism bias.

Bug #3: One sample, dressed as a track record

The +10.44% came from a single run over a single evaluation window, anchored to the day the script happened to execute.

Temperature was set to 0.0. No seed. No response cache. And temperature zero does not make a hosted LLM deterministic anyway — same prompt, different day, different verdicts.

Now the embarrassing part. The system went through five "improved" versions, V1 to V5. Each iteration tuned prompts, thresholds and vote weights to push the headline number up. The run-to-run variance of this pipeline exceeds 10 percentage points. Which means the entire V1→V5 improvement history is indistinguishable from noise with a changelog attached.

I optimized prompts for five versions against a metric that couldn't tell my changes apart from a coin flip. If you're iterating on an LLM system against a single-run metric, you're not doing engineering. You're doing astrology with version control.

The fix is a multi-run harness: N runs per configuration, report mean, standard deviation and a bootstrap confidence interval — and a hard rule that no metric gets stated without the interval next to it. If the CI of your alpha includes zero, you don't have alpha. You have a number.

The smaller sins

The rest of the audit, compressed — none of these is exotic, and every one of them appears in most "LLM picks stocks" repos I've read:

Ingestion pinned to today. Auto-ingest pulled 90 days of data back from the current date, not from the cutoff. Deep historical backtests silently ran on whatever data happened to exist. The system looked like it supported arbitrary cutoffs; in practice only recent ones produced real tests.

Zero transaction costs, and a bonus micro-lookahead. No commissions, no slippage, returns computed close-to-close — with entry at the same close used to make the decision. You cannot trade at the price that triggered your trade.

Post-hoc renormalization. Tickers that returned no data were dropped and the remaining weights renormalized — after returns were already visible. Portfolio adjustment with hindsight, hiding in an error-handling branch.

No risk adjustment. Raw alpha, no beta correction, no Sharpe, no drawdown. Alpha without risk adjustment is how leverage gets dressed up as skill.

What survives the audit

The fundamentals bug is fixed. The multi-run harness is next. The +5.05% alpha is retired — not because it's necessarily wrong, but because as measured it's unfalsifiable, which is worse.

Bug #2 stays open, and I think it stays open for everyone. The realistic options I see:

  1. Backtest only on data after the model's training cutoff — tiny windows, always trailing, but clean.
  2. Anonymize everything — clean in theory, destroys the signal in practice.
  3. Stop backtesting, evaluate forward-only: paper-trade in real time and let the future arrive on its own.

I'm leaning toward the third. It's slower, and it's the only one where the model provably hasn't read the answer key.

If you're building anything in the "LLM committee trades stocks" genre: audit the boring parts first. The data layer, not the prompts. In my system, every point of fake alpha came from plumbing. The leak is never where the intelligence is.


MIMIR Intelligence — systems that fail in interesting ways, documented before they're polished.