Five Frontier Models Walk Into a Forecasting Council
A few weeks ago I built something I half-jokingly called Geopol-Forecaster — a heavy, Snowglobe-style pipeline in which roughly forty LLM 'actors' role-play a geopolitical crisis across several timesteps, with a stack of lenses layered on top. It produces long, rich outputs. It also burns through API credits at a frankly embarrassing rate, and the bulk of that spend goes into the simulation machinery rather than into the thing I actually wanted out of it: a forecast.
So I started wondering how much of the scaffolding I could tear out before the signal disappeared.
The experiment that came out of that wondering — inspired by Andrej Karpathy's LLM Council — is what this post is about. No agents, no role-played diplomats, no timestep loop. Just a panel of lineage-diverse frontier models given the same grounding brief and asked the same structured question. A forecasting council, not a forecasting world.
LLM Council works together to answer your hardest questions
I ran it on 18 April 2026 against the Iran–Israel–US situation, which was sitting on a cluster of deadlines inside a 96-hour window (active Lebanon ceasefire track, open IRGC harassment in the Strait of Hormuz, visible diplomatic movement in three capitals). Messy enough that models should disagree. Time-sensitive enough that priors matter. The full writeup — all the tables, all the charts, all the raw predictions — is on Hugging Face; this is the shorter, more personal cut.
The setup, briefly
Five models, chosen for lineage diversity rather than leaderboard position: Claude Sonnet 4.6, DeepSeek v3.2, Gemini 3 Flash Preview, GLM 5.1, and Kimi k2.5. Each one was given an identical timestamped SITREP synthesised from RSS (Times of Israel, Al Jazeera, BBC World), Perplexity Sonar, and a Tavily news search, and then asked — via OpenRouter's structured-output schema — for three concrete predictions per horizon (24 hours, 1 week, 1 month), each with supporting reasoning, historical precedent for and against, a 'changemaker' factor, and a confidence score.
A Council_Head (Sonnet 4.6) built the SITREP. A Report_Author (also Sonnet 4.6) clustered the 45 predictions at the end. The PDF is rendered deterministically in Typst, because I wanted the probabilistic stages cleanly separated from the reproducible ones. Code and full outputs are in the repo.
Lean geopolitical forecasting council: 5-model panel, grounded SITREP, structured predictions per horizon, deterministic Typst report
That's the whole pipeline. No role-play. No timestep simulation. No 'let's see what Iran does in week two given what Israel did in week one.' Just: here is the world as of this UTC minute, give me your best-structured guesses.
What the council actually agreed on
Seven convergent findings came out. All five models held all seven, which by itself is worth pausing on — these are models with genuinely different training lineages (Anthropic, DeepSeek, Google, Z-AI, Moonshot) and they still landed in the same place on most of the load-bearing questions.
The Lebanon ceasefire is nominal. Further kinetic incidents were assessed as near-certain within 24 hours.
Hormuz stays closed in the short term. IRGC harassment of commercial and naval vessels continues (mean confidence 0.64).
No comprehensive US–Iran framework agreement in 24 hours — at best procedural movement.
The 22 April deadline is a genuine binary inflection point, cascading across Lebanon, Hormuz, and the nuclear track simultaneously.
A comprehensive nuclear deal inside a month is unlikely; the structural gap on enrichment is too wide.
US–Israel alignment is already strained and will deepen as the US pursues a deal Israel opposes.
Economic shock from the dual blockade pushes toward partial maritime de-escalation within a month, even if the nuclear track stays stuck.
Read that list again. If you had told me, before running this, that five frontier models from five different labs would converge this cleanly on a scenario this messy, I would have been sceptical. The convergence itself is a data point about where the models' priors overlap (and arguably about where the open-source news diet overlaps, which is not quite the same thing).
What they disagreed about — and why that's the interesting part
Here is the thesis, arriving where it belongs (mid-piece): the useful output of a forecasting council is not the consensus. It's the map of where the models refuse to agree.
On this run, the most diagnostic split was on the one-month Hormuz question. Given the same SITREP, Claude put 'Hormuz remains partially or fully disrupted all month' at 0.61 confidence. GLM put 'Hormuz reopened via mutual de-escalation' at 0.35. Two frontier models, same brief, opposite forecasts on what is probably the single highest-stakes economic question in the scenario.
That is the kind of divergence I want a council to surface rather than smooth over. A single model gives you a point estimate. An averaged panel gives you a slightly better point estimate. But a panel that shows you its seams tells you something harder to get any other way — that the situation itself has a branching structure, and which branches each model is weighting.
Other splits followed the same pattern. Does the 22 April deadline trigger resumed Israeli operations or a Trump-brokered extension? The members split roughly evenly; no majority for either outcome. Will Iran escalate to 60%-plus enrichment inside a month? Gemini said 0.55; the others declined to predict it at all. Is Hezbollah acting independently or under direct Iranian command? Flagged by members as load-bearing and left explicitly unresolved.
None of those are failures of the method. They are the method working.
The counterintuitive bit
Here is the finding I didn't expect. The spread across models widens at shorter horizons, not longer ones.
At 24 hours, the per-prediction confidence range across the five members was 0.48 to 0.77 (σ 0.10). At one month it compressed to 0.30 to 0.47 (σ 0.06). Models disagreed most about the near term.
This is the opposite of the naive intuition — 'near term is easy, long term is hard' — and once you see it, it makes sense. Active diplomacy plus three hard deadlines inside four days produces a lot of branching, and each model seems to weight those branches differently. Zoom out to a month and most of the branches have collapsed into the same few macro-trajectories, so disagreement narrows. The council is effectively more calibrated about the shape of the future than about tomorrow.
I don't yet know if this is a stable property of LLM forecasting panels or an artefact of this particular scenario. That is exactly the kind of question the follow-up runs are designed to answer.
Are LLMs systematically pessimistic? On this run, no.
There's a common worry — one I half-shared going in — that frontier models over-weight escalation because their training data concentrates there (wars generate more text than calm weeks). So I ran a deliberately coarse keyword heuristic over every prediction, tagging each one as escalatory, conciliatory, or mixed. It catches 'Iran launches strikes' as escalatory and 'framework agreement announced' as conciliatory, and it sends neutral watch-events like 'Oil closes +2%' to mixed, which is probably right.
On this run, at 24 hours, mixed dominated (7 of 15) and conciliatory modestly led escalatory (5 versus 3). By one week and one month the distributions evened out. The models were not uniformly pessimistic — at least on this scenario, on this day, given this SITREP. Per-member, Claude was the least escalatory (1 of 9) and Kimi the most (4 of 9).
Whether that's a property of the models or a property of this run, I genuinely don't know yet. One data point is not a bias. It is, however, enough to stop me from assuming the bias is there.
What I'm taking away
A few things I wasn't sure about before, that I'm now reasonably confident about:
A lineage-diverse council of five frontier models is a dramatically cheaper and faster instrument than a full actor-simulation pipeline, and it doesn't obviously lose the forecasting signal. Most of what Snowglobe-style simulations buy you, on a scenario like this, is narrative richness — not predictive edge.
The interesting output is the disagreement map, not the averaged forecast. Designs that collapse the panel to a single number are throwing away the most useful thing it produced.
LLM-reported confidence scores are internal and should be read comparatively (which member was relatively more confident on this cluster) rather than absolutely. Claude's low-confidence, heavily-hedged predictions were arguably the most epistemically honest outputs in the run.
Short-horizon disagreement may be more informative than long-horizon disagreement, which inverts the naive intuition and has implications for how you'd design a calibration loop against real-world outcomes.
The obvious next move — and the one the design is built for — is result-tracking. The 24-hour horizon has already resolved by the time anyone reads this. The one-week resolves 25 April. The one-month on 18 May. Running Brier scores per model and per panel, over enough runs, is how you test whether any of this is calibrated at all, or whether the council is just a more articulate way of being confidently wrong.
The methodology section, the per-model prediction tables, the convergence/divergence charts, and the full list of watchlist triggers are all in the Hugging Face version of this writeup. The code and raw outputs are in the Geopol-Forecast-Council repo; the older, heavier Snowglobe-style pipeline is in Geopol-Forecaster. If any of this is interesting enough that you want to argue with me about it, please do.
Lean geopolitical forecasting council: 5-model panel, grounded SITREP, structured predictions per horizon, deterministic Typst report
Experimentary prediction analysis for real world events (Iran Israel)
Daniel Rosehill
AI developer and technologist specializing in AI systems, workflow orchestration, and automation. Specific interests include agentic AI, workflows, MCP, STT and ASR, and multimodal AI.