Forecasting Record | Core Group

I

The claim

We grade our own forecasts. Here is the record.

This page reconciles the Levant Edition of our Quarterly Forecasting Record (TR-2026-Q1-LEV) against the claims that resolved this quarter inside our home theater. The credited and strict rates are stated as they fell. The calibration that came out of them is shown without correction. The scoring corrections we made along the way are published with the result.

The full claim-by-claim ledger and the per-document scorecard are gated; the bottom-line work is here, in public.

II

The scorecard

Inside our home theater, the record holds.

Of 542 Levant-landing claims this quarter, 272 resolved. Credited accuracy 78.8 percent. Strict 74.3 percent, taken across the 257 binary-scorable claims inside the resolved set. Brier score 0.150 against a climatology benchmark of 0.236. Skill 0.365.

Across all theaters, including out-of-area files, the credited rate is 63.6 percent and skill 0.154.

Levant Edition · the six metrics that anchor every claim downstream.

Source: TR-2026-Q1-LEV, scorecard page. Levant theater only.

Both records, side by side. Same scoring rules, two theaters.

Source: TR-2026-Q1-LEV, page 11. Theater scope as stated in source.

III

Calibration

The middle of the curve was underconfident. The ends overshot in opposite directions.

Claims forecast at 60 to 80 percent confidence resolved at 80 percent. The middle band hit the top of its own range, which is to say we sold the call short.

Claims forecast at the 90 percent band resolved at 79 percent. Slightly past honest. We were too sure of the strong calls.

Claims forecast at 20 to 40 percent confidence resolved at 9 percent. The low band collapsed. Most of what we said was unlikely turned out to be even less likely than the band would suggest, which is its own form of miscalibration.

Stated confidence against what resolved. The diagonal is honest calibration.

Source: TR-2026-Q1-LEV, page 7. Bands as defined in the Record.

IV

Surprise

Three measures, three answers. The least flattering one publishes first.

At the finest grain, against the 73-event list compiled from the same sources we forecast against, surprise was 53 percent. Crediting the retired Daily Tracker as a forecasting layer brings the rate to 29 percent. Read against the coarse like-for-like rubric used in Section 2, surprise is 21 percent.

The three measures answer different questions. The fine-grained list asks whether we saw the day-to-day. The Tracker credit asks whether we counted everything that was actually published. The coarse rubric asks whether the quarter's structural calls held. They all hold. The publishing rule is to ship all three rather than the most favorable one.

Two blind spots are absolute and name themselves. Palestine: 9 of 9 claims missed. Every Palestine-touching forecast this quarter missed. Syria: 5 of 7 missed. Two held; five did not.

The forward fixes ship with the blind spots, not as caveats afterward. A Palestine file owner is now named: the file has a desk, a beat, and a publishing cadence. The daily layer moves inside the accountability perimeter: the retired Tracker resumes under the same scoring rules as the rest of the Record, with surprise counted from the day of publication.

Three measures of surprise, scored against three granularities. Blind spots named.

Source: TR-2026-Q1-LEV, surprise section. The Daily Tracker line is the retired layer now folded back in.

V

Tradecraft audit

The first scoring run was wrong in both directions. The second one earned its result.

The first attempt at scoring this quarter carried two known defects, both found late enough to embarrass us and early enough to fix. The first treated scenario-branch claims as independent bets, double-counting branches that resolved together. The second was a parser fault that mapped any forecast containing the substring "unlikely" to a numeric confidence of 0.70: the opposite end of what the word means.

We rebuilt the scorer from scratch and re-ran the adversarial pass on every verdict the first scorer had touched. Of 109 verdicts re-examined, 39 changed. The page above carries the rebuilt verdicts. Nothing here is first-pass output. No single-pass verdict publishes again.

The exhibits get graded too. Against the ICD 203 visual-information mark, the diagrams on this page score 0.31. That is not a passing reading by the standard's own scale. We publish the mark anyway. It cuts in two directions: the next quarter's Record is measured against this baseline, and the gated appendices already document where the 0.31 comes from line by line.

The audit, summarized. The bookkeeping is in the gated appendices.

Source: TR-2026-Q1-LEV, tradecraft audit section.

VI

Forward lessons

Six forward assessments fall out of this quarter's reads.

Two of the six are named in Section 4: a Palestine file owner with a desk, a beat, and a publishing cadence; the daily layer moved inside the accountability perimeter. They handle the two blind spots the surprise numbers named.

The remaining four are extracted from the calibration pattern and the surprise rates. They name where the next quarter's work concentrates. The full list is in the Record's Forward Lessons section and in the gated per-document appendix. The public page does not paraphrase them. The test we set ourselves is whether they hold next quarter, and paraphrasing them now would let us mark our own homework with looser ink.

VII

Access

The page is the public summary. The bookkeeping is gated.

The full claim-by-claim ledger (Appendix B) and the per-document scorecard (Appendix A) sit behind a token. So does the TR-2026-Q1-LEV PDF itself. The summary here ships without a token because the test we set ourselves does not require gating the result of it.

Request access to the gated material on the catalog counterpart of this page: TR-2026-Q1-LEV abstract.

Every product line we publish is graded against outcomes; this page is where that grading lives.