Methodology Laboratory
Two generations of formula development: 90 forward-derived formulas (v0.1) and 170 backward-induced formulas from 15 historical episodes (v0.2). A public record of what worked, what failed, and why.
These are works in progress. The point of this laboratory is not to claim we can measure civic stress today. It is to show, formula by formula and failure by failure, exactly what it would take to measure it honestly tomorrow.
The fifteen methodologies
Each methodology makes a distinct theoretical bet about the primary driver of civic stress. The T2-Baseline is the most defensible central estimate for each. Five temperature variations explore what happens when the core bet is pushed in different directions.
Sanity check results
Zero clean passes is the correct result for a v0.1 methodology laboratory. The 41 FLAGs are the working queue: each carries a specific, mostly mechanical fix. The 48 FAILs are the map of where the data wall, the 2020 calibration trap, and the circularity problem actually lie — so the next iteration does not walk back into them.
What we learned
Roughly half of the 48 FAILs failed because a formula's dominant signal does not exist for the period it must calibrate against. The canonical anchor is 2000–2005, but ACLED US coverage begins ~2020; GDELT 2.0 begins 2015; Census Household Pulse began April 2020; EIG Distressed Communities begins ~2015; Fed SHED begins 2013. Any formula that loads its dominant weight on one of these and then claims a 2000–2005 anchor is calibrating against nothing. The fix: either move the lower anchor to a window every input can cover, or restrict the high-frequency engine to the post-2020 era entirely.
A logistic or affine map with two free parameters fitted to exactly two anchor points will always hit those two points whenever inputs are monotone between them. This is mathematically guaranteed and proves nothing. The honest tests — out-of-sample fit at intermediate years (2008–09, 2014–16), and correlation against ACLED+GDELT across the full series — have not been run for any formula in this batch. A two-point fit is a sanity floor, not evidence of skill.
Formula after formula reached for ACLED event counts and GDELT tone as inputs — the freshest, most unrest-relevant high-frequency signals available — then proposed to validate the index by correlating it against the same ACLED + GDELT. Any reported correlation is partially mechanical self-correlation, not predictive skill. The discipline needed: either exclude the ground-truth series from inputs, or hold it out and validate the non-event terms separately.
The 2020 peak was driven by a compound of pandemic, racial-justice protest, and electoral crisis — overwhelmingly non-economic. Several single-channel instruments went the wrong direction: household financial fragility was anomalously low (stimulus + forbearance); necessity-price acceleration was negative (gasoline collapsed); real median wages rose (composition effects from low-wage job losses); eviction filings fell (moratoria). A single-channel index cannot be honestly anchored to a multi-cause peak. These methodologies should each be understood as one tributary, valid for the episodes their channel actually drives.
Polarization as gain, legitimacy as a transfer function, social capital as a brittleness amplifier — these elegant structures repeatedly failed. The gain's data is sparse (annual surveys), so the multiplier is near-static within a year. The gain coefficients are argued, not empirically fit. Applied outside the squashing function, the multiplier pushes output past 100. The multiplicative bet is not wrong in principle — Gurr, PITF, and procedural-justice literature all support the direction. But it needs a third or fourth historical anchor to identify the gain, and the decisive empirical test — does the interaction term carry significant weight beyond main effects? — has not been run.
Even among FAILs and FLAGs, certain components earned the validated rating and should anchor any future ensemble: necessity-weighted buying-power erosion (Stantcheva NBER w32300; food-at-home + energy + renter-shelter, bottom-40% weighted); GSCPI supply-pressure pass-through (NY Fed SR1017: 1-SD = +0.5pp PCE, 4–8 week lead); PITF prior-episode persistence (the strongest empirical regularity in instability forecasting); Burke/Hsiang/Miguel temperature–conflict elasticity (the only climate channel with a peer-reviewed effect size); and Chetty absolute-mobility decline (a validated, replicated structural floor).
v0.2 — Backward induction from historical episodes
The v0.1 laboratory derived formulas forward: start from theory (relative deprivation, institutional legitimacy, supply cascades), specify a formula, then check whether it survives sanity review. Zero passed. The lesson is not that the theories are wrong — it is that forward derivation produces formulas calibrated to the theory rather than to what the data actually shows.
v0.2 reverses the direction. Backward induction starts from 15 documented historical US conflict episodes — events where civic stress demonstrably peaked or was demonstrably suppressed — and works backwards to ask: what indicator patterns, lag structures, and functional forms would a model need to explain what actually happened?
Forward derivation asks: given this theory, what formula follows? Backward induction asks: given what happened in 1968, 1992, 2001, 2008–09, 2011, 2020, and 2021 — and given the striking negative cases where severe shocks produced no street protest — what model structure is implied? The 15 episodes serve as a calibration panel, not just two anchor points. Each induced model must specify which episodes support it and which challenge it. A model that explains 2020 but fails 2001 and 2008–09 is less useful than one that accounts for the suppressors.
The 15 episodes span 1967–2023 and were selected to include both positive cases (unrest that materialized) and critical negative cases (conditions that "should" have produced unrest but did not). The negative cases — 2001 post-9/11 rally-around-the-flag, 2008–09 financial crisis with anomalously low street protest, 2018–19 high-polarization full-employment calm — are as theoretically important as the peaks. A model that cannot account for suppression is incomplete regardless of how well it fits 2020.
Where v0.1 relied primarily on weighted linear sums with multiplicative gains, v0.2 admits the full range of structures implied by the episode data: Vector Autoregression (VAR) for Granger-causal ordering; Threshold Autoregressive (TAR/SETAR) for regime-switching; proportional hazard / survival models for time-to-next-episode; Kalman filter state-space for latent stress estimation; PCA / factor analysis for collinear indicator families; Error Correction Models (ECM) for long-run equilibrium; wavelet decomposition for multi-scale temporal structure; and Bayesian hierarchical models for pooling evidence across episodes. The mathematical structure is chosen to fit the pattern the data implies, not to satisfy theoretical elegance.
The validated components from v0.1 remain as building blocks — necessity-weighted buying-power erosion (Stantcheva NBER w32300), GSCPI supply-pressure pass-through (NY Fed SR1017), PITF prior-episode persistence, Burke/Hsiang/Miguel temperature-conflict elasticity, and Chetty absolute-mobility decline. What changes is how they enter the model: as inputs into structures induced from the episode panel, not as the primary organizing logic of a theory-first formula. The design disciplines that worked in v0.1 — hard caps on secondary channels, zero-weighting speculative components with documentation, applying multipliers inside bounding functions — are also carried forward.
v0.2 — Backward induction results
The v0.2 laboratory reversed the derivation direction. Rather than building formulas from theory and checking them against two anchor points, v0.2 assembled a 15-episode panel of documented US conflict episodes (1967–2023), scored each for peak intensity and primary channel, and derived model structures by asking: what mathematical form is necessary to reproduce this pattern? The backward-induction method shifts the burden of proof — a model earns its weights by being necessary given the data, not by being theoretically plausible.
Seven formulas cleared all four sanity-check criteria: mathematical soundness, full computability from free open data, internal calibration consistency, and absence of circular validation. This is a meaningful improvement over v0.1's zero passes. The pattern in the passes is informative: all seven either use Bayesian hierarchical priors (which prevent the two-point calibration trap by construction), proper statistical frameworks with identified parameters (VAR with Granger ordering, survival models with partial-likelihood), or citation-locked weights that zero out everything without a published effect size.
BM2-T7 (VAR-Granger IRF-Weighted Hawkes) and BM2-T11 (Loss-Averse Hawkes with GJR-GARCH) pass because they replace the base model's argued exponential kernel with an estimated one — letting the data determine the lag shape and branching ratio rather than imposing them. BM3-T14 and BM7-T14 pass because the Bayesian hierarchical structure is epistemically honest: with 6–9 effective panel events, it is the only statistically defensible approach, and the temperature-14 instruction forces credible intervals rather than false-precision point estimates. BM6-T3 (EWMA Copula) passes by making recency the operative variable — channels are measured against their recent trajectory, not their all-time level. BM9-T1 is the most conservative pass: it zero-weights every channel without a published effect size, eliminating degrees of freedom until what remains is genuinely identified. BM10-T5 passes because it treats the annual-cadence cointegration gain as a frozen background multiplier rather than a fast driver — converting the prior run's near-static-multiplier failure mode into a feature.
The four negative cases — E7 (post-9/11 unity, intensity=15), E8 (2008–09 financial crisis, intensity=25), E12 (2018–19 high-polarization full-employment, intensity=28), and E3/E4 (stagflation-era low protest) — are the most structurally important episodes. They falsify three common assumptions: (1) economic severity does not linearly predict street protest; (2) polarization is not a simple additive stress-raiser — at full employment it may actually suppress collective action; (3) grievance without organizational infrastructure stays latent. Any formula that cannot explain why 2008–09 was quieter than 1992 despite far worse macro conditions is misspecified regardless of how well it fits 2020.
Three structural shifts separated the laboratories. First, backward induction from episodes replaced forward derivation from theory — the model structure was constrained by what the data requires, not what the theory suggests. Second, calibration against a 15-episode panel replaced two-point affine rescaling — two-point calibration is unfalsifiable by construction, since any monotone function can be stretched to hit two targets. Third, suppressors became first-class citizens of the model rather than afterthoughts: the negative cases (E7, E8, E12) are now requirements that every formula must explain, not anomalies to be set aside.