Geopol Forecaster: An Open-Source AI Geopolitical Prediction Pipeline

I’ve been spending a lot of time lately thinking about what LLMs are actually good at — and whether “geopolitical forecasting” belongs on that list. The honest answer is: I’m not sure yet. But I built something to find out, and the results have been interesting enough that I wanted to share the approach.

The project is called Geopol Forecaster. It’s an open-source pipeline that generates structured geopolitical predictions by running two completely independent analytical stages and then comparing where they agree and (more importantly) where they don’t. The whole thing runs on Claude Sonnet 4.5 via OpenRouter, costs about six to twelve dollars per run, and takes roughly eighteen minutes to produce a chairman’s report, an executive briefing PDF, and a full archival transcript.

Before I get into the mechanics, a quick disclaimer: I’m not a political scientist, intelligence analyst, or anyone with special access to classified information. I’m a technology communications guy who got curious about whether you could build a system that does something more interesting than asking ChatGPT “will there be a war?” and getting back a hedged paragraph of nothing. The bar, in other words, was low.

The intelligence community is already doing this

It’s worth noting that what I’m doing here — using LLMs for structured geopolitical wargaming and synthesised perspective analysis — is not some fringe experiment. The intelligence community and defence establishments of several nation-states are actively exploring this space, and in some cases have been for years.

Johns Hopkins APL established a dedicated GenWar Lab for AI-driven wargaming, integrating LLM-powered tabletop exercises where senior military commanders and civilian leaders engage with AI agents playing adversarial roles. The U.S. Air Force is building WarMatrix, a cloud-based AI sandbox that aims to run wargames at up to 10,000 times real-time speed. The U.S. Army Command and General Staff College ran AI-enabled wargaming exercises in late 2025 that used LLMs to amplify analytical throughput. The State Department has been integrating AI into its strategic games for dynamic scenario generation and stakeholder simulation. RAND has published research on using AI to help wargame participants understand “possible perspectives, perceptions, and calculations of adversaries who are operating with uncertainties and misimpressions” — which is essentially a description of what Stage A of my pipeline does.

CSIS has argued that it’s time to “democratise wargaming” using generative AI — that these tools shouldn’t remain confined to classified environments. I tend to agree. The analytical techniques themselves aren’t secret; what’s classified is the intelligence that feeds them. There’s no reason civilians, journalists, researchers, and policy analysts shouldn’t have access to the same structural frameworks, run against open-source data.

Standing on the shoulders of two very good open-source projects

Geopol Forecaster didn’t emerge from thin air. It’s a composition of two existing projects that each solve half the problem — and that turn out to be remarkably complementary when you wire them together.

Stage A draws from IQTLabs’ snowglobe. IQTLabs is the technology research lab affiliated with In-Q-Tel — the CIA’s venture capital arm, for those unfamiliar (yes, the CIA has a venture capital arm, and yes, they open-source some of their tools). Snowglobe is an open-source geopolitical game engine where persona-driven actors make decisions and a referee umpire narrates the consequences. It was published alongside a peer-reviewed paper and — rather remarkably — featured in a CIA Center for the Study of Intelligence publication in December 2025. The system integrates with the ICB Project’s dataset of 496 historical geopolitical crisis scenarios, providing a library of off-the-shelf geopolitical games.

The architectural pattern is elegant: a Control (the referee) orchestrates a list of Players (actors), each with a persona system prompt, across N turns. After all players commit an action, the referee adjudicates conflicts and updates the shared world state. Actors never see each other’s private reasoning — only the public consequences. What I took from snowglobe is this sealed-off actor reasoning pattern. In my implementation, ten core geopolitical actors (Khamenei, Netanyahu, Trump, the IRGC, Hezbollah, CENTCOM, Mossad, the IDF, Russia, MBS) each receive a detailed persona with historical decision patterns, known red lines, and institutional constraints. They make decisions based only on the referee-authored world state. The referee adjudicates conflicting actions using authority-precedence rules (so if Khamenei and the IRGC disagree, Khamenei’s directive takes precedence, which is how the actual chain of command works).

Stage B draws from Andrej Karpathy’s llm-council. Karpathy — formerly of Tesla Autopilot and OpenAI, and one of the more influential figures in the current AI landscape — published an open-source project implementing a deliberation protocol that’s deceptively simple but effective. The idea: instead of asking one model for an answer, you ask multiple models, have them blind-review each other’s responses anonymously, and then synthesise. His implementation gets diversity from routing to different LLM providers (GPT-4, Gemini, Claude, Grok). The 3-stage protocol is: parallel query to every member, blind peer review where each member ranks the others’ anonymised answers (labelled Response A through F), and then a chairman synthesis that reads everything.

I ported the protocol wholesale but changed where the diversity comes from. Instead of different models, I use six analytical lens personas — Neutral, Pessimistic, Optimistic, Blindsides, Probabilistic, and Historical — all running on the same model (Claude Sonnet 4.5). The hypothesis was that prompt-level persona variation could produce genuinely different analytical perspectives without needing model diversity. Based on the probability spread across lenses and the peer review scores, the answer is a qualified yes (the Probabilistic and Historical lenses consistently produce the most calibrated estimates, while the Optimistic lens tends to hedge so much it barely differs from Neutral — a finding that’s interesting in its own right).

Why these two together

Neither component alone is sufficient — and this is the part I find most interesting.

Snowglobe gives you actor-attributed reasoning and empirical probabilities from simulated play, but it has no mechanism for grounding against live real-world data. On its own it drifts into plausible fiction (which is entertaining but not useful for forecasting). The llm-council protocol gives you rigorous multi-perspective deliberation grounded in whatever context you feed it, but it has no independent model of how specific decision-makers actually behave. On its own it collapses into a well-organised version of “ask an LLM what it thinks.”

Composed together — snowglobe’s simulation summary becomes one of llm-council’s inputs — they produce two independent forecasting signals in a single pipeline. Where they converge, confidence is relatively high. Where they diverge, the divergences themselves are diagnostic. In a recent run, the actor simulation predicted an 847-rocket Hezbollah barrage within a month; fresh data showed they’d actually fired 70. That twelve-times overestimate told us the simulation excels at modelling military-operational mechanics but overestimates reconstitution capacity when real-world logistics have been degraded. That diagnostic finding is arguably more valuable than any single probability number the system produces.

The technical bits

All LLM calls use the same model (Claude Sonnet 4.5 via OpenRouter). Orchestration uses LangGraph with SQLite checkpointing, so if a run gets interrupted you can resume from the last completed stage. News gathering is a single-pass operation that produces a frozen bundle — every council member reasons from identical context. Past runs are archived in Pinecone for semantic retrieval, so the chairman can reference how previous forecasts played out. A typical run involves around 165 LLM calls across both stages.

Each run produces a report directory with the chairman’s synthesised forecast, the raw lens answers, the blind peer reviews, frozen news data, the full simulation transcript, and rendered PDFs. Individual forecast runs get published as their own GitHub repositories with all the data, so anyone can audit the full reasoning chain.

I’ll be sharing results from a couple of recent runs in follow-up posts — one on the probability of forced regime change in Iran, and another on the durability of the April 2026 ceasefire. The full code is open source at github.com/danielrosehill/Geopol-Forecaster.

danielrosehill/Geopol-Forecaster ★ 2

Experimentary prediction analysis for real world events (Iran Israel)

Python1 forksUpdated Apr 2026

ai-agentsgeopoliticsiran-israel

Whether any of this constitutes “good” forecasting in the Tetlock sense is a question I can’t answer yet (the runs are too recent to have been validated against outcomes). But the pipeline reliably surfaces assumptions, identifies blind spots, and produces probability estimates that are at least internally coherent — which puts it ahead of most punditry, if nothing else.

The stack

Geopol Forecaster is built on two key open-source projects and a handful of supporting tools:

Stage A (Actor Simulation): IQTLabs/snowglobe — an open-ended wargaming engine from In-Q-Tel’s research lab, featuring persona-driven actors and referee adjudication. Published alongside a peer-reviewed paper and featured in a CIA Center for the Study of Intelligence publication (December 2025).

IQTLabs/snowglobe ★ 54

Open-ended wargames with large language models

Python21 forksUpdated Feb 2026

human-ai-teaminglarge-language-modelsmulti-agent-systemstabletop-exerciseswargames

Stage B (Analytical Council): karpathy/llm-council — Andrej Karpathy’s 3-stage deliberation protocol with parallel query, blind peer review, and chairman synthesis.

karpathy/llm-council ★ 18.3k

LLM Council works together to answer your hardest questions

Python3.6k forksUpdated Nov 2025

LLM: Claude Sonnet 4.5 via OpenRouter (single model, single router — diversity comes from prompt engineering, not model switching)

Orchestration: LangGraph with SQLite checkpointing

langchain-ai/langgraph ★ 31.3k

Build resilient language agents as graphs. Available in TypeScript!

Python5.3k forksUpdated May 2026

agentsaiai-agentschatgptdeepagents

News: Tavily search + RSS/ISW feeds, frozen into a single shared bundle

Memory: Pinecone vector archive for cross-run semantic retrieval

The full pipeline code is open source: github.com/danielrosehill/Geopol-Forecaster.