Mapping AI’s blind spots in activism
What AI does not know about movement theory, and how to teach it
AI models don’t know the words activists use. We tested a leading open-source model — Qwen3-8B — for 25 concepts drawn from four activist traditions: Adbusters and Micah White, Guy Debord’s Situationists, John Zerzan’s green anarchism, and Black Lives Matter and Afrofuturism. Not one came back as clearly present in the model’s internal vocabulary. Twenty-two were absent entirely. The model has no name for them.
The same probe on five mainstream political concepts — protest, revolution, voting, civil disobedience, nonviolence — found most of them, confirming the test works. And a pre-registered check on five concepts from analytic philosophy of mind, locked in before we looked at the data, returned the same pattern as the activist concepts. The gap isn’t about activism specifically. It’s about any niche conceptual vocabulary that doesn’t appear at scale in the training data. Activist language is one politically consequential slice of a broader blind spot.
When a frontier AI confidently produces text about a concept it doesn’t have a name for, the failure compounds. The output seeds the next round of training data, the next layer of content moderation, the next page of search results. A concept the model can’t represent becomes a concept the platforms can’t surface.
The useful finding is what we did about it. Using interpretability tools, we built and tested four ways to teach activist vocabulary back into AI models — two of them already shipping in the Outcry app today, two more validated on an on-device quantized model running on a laptop. The full methodology, the bootstrap statistics, the before-and-after outputs, and a reproduction recipe are below. None of it requires waiting for a frontier lab to add your words to its dictionary.
What AI does not know about activism
Inside a modern AI sit thousands of internal “features” — directions in the model’s activation space that line up with specific concepts. A sparse autoencoder (SAE) is the tool researchers use to surface these features and label them automatically. Anthropic’s “Golden Gate Claude” demo last year worked by finding the bridge feature inside Claude and turning it up so high the model could barely talk about anything else. Each feature is a building block of meaning, crystallized during training.
When a feature is present, the model has a stable, inspectable way of holding that concept. When it’s absent, the model can still produce text on the topic — but it has to improvise from neighboring ideas. The improvisation usually sounds plausible. It is often wrong.
To find out which activist concepts have features and which don’t, we used Karvonen’s published interpretability dictionary for Qwen3-8B: 64,947 features at layer 18, each one labeled by Gemini. We probed 30 concepts — 25 drawn from four activist traditions, plus five mainstream political controls.
How the probe works
The procedure has six steps and never asks the model to talk about the concepts at all. Instead, it compares two pieces of text — the concept’s definition and the feature’s description — in a shared mathematical space. We wanted to ask whether the AI has a name for each concept, not whether it can be coaxed into representing it somehow. The second question gets answered later, by injection and by soft prompt.
Step 1. Take the dictionary. Karvonen’s SAE produced 64,947 features, each a direction in the model’s 4,096-dimensional activation space at layer 18. For each one, Gemini was shown the text passages that lit the feature up and asked to write a short label — things like “ads, computer, entertainment,” or “environmentally conscious actions,” or “spectacles, spectaculars, spectator.” These short labels are, in effect, the AI’s names for the things it knows.
Step 2. Write a tight one-paragraph definition of each concept. For mental environmentalism: “the protection of mental space from commercial advertising and corporate media — the cognitive analog of physical environmentalism.”
Step 3. Turn both into numbers. The definitions and the 64,947 feature labels were each passed through OpenAI’s text-embedding-3-small, which converts meaning into a 1,536-dimensional vector. Everything now lives in the same numerical space.
Step 4. Compare. For each concept, measure its similarity to every feature label using cosine similarity — a number between −1 and 1, where higher means closer in meaning. Keep the highest match.
Step 5. Decide with a calibrated cutoff. Is that top score actually high, or just the best of a weak field? To answer, we ran the same probe on 200 unrelated niche concepts — analytic philosophy of mind, art history, musicology, pure math — and looked at the spread of best-match scores. The thresholds were set at the 99th and 90th percentiles of that null distribution. Beat 99%, and the concept counts as PRESENT (cosine ≥ 0.613). Beat 90%, PARTIAL (0.547 to 0.613). Otherwise, ABSENT.
Step 6. Cross-check with literal matches. We also searched the 64,947 feature descriptions for the concept’s actual name (with obvious inflections — vote, voting, ballot). A literal hit is decisive, regardless of cosine. Five or more literal matches counts as PRESENT; any literal match counts as at least PARTIAL.
The four traditions
We picked the activist lineages to span ideologically distinct traditions, so any pattern couldn’t be written off as a quirk of one author. They were: Adbusters and Micah White (ten concepts from The End of Protest and surrounding writing); the Situationist International of Guy Debord and Ken Knabb (six concepts including the society of the spectacle, détournement, dérive, recuperation, psychogeography); the green-anarchy and anarcho-primitivist lineage of John Zerzan (four concepts including anarcho-primitivism, civilization critique, rewilding, and the symbolic-thought critique); and the contemporary Black Lives Matter and Afrofuturist tradition (five concepts including Kimberlé Crenshaw’s intersectionality, Angela Davis’s prison abolition, white supremacy critique, Afrofuturism, and Black Futures).
The result
Across all 25 activist concepts, zero came back PRESENT. Three were PARTIAL: mental environmentalism (cosine 0.573, matched to a generic activism feature), society of the spectacle (one literal label match), and white supremacy critique (three literal matches). The remaining 22 were ABSENT. The model has no name for any of them.
The mainstream controls behaved very differently. Three came back PRESENT: protest (cosine 0.644, 21 literal matches), revolution (cosine 0.623, 18 matches), and voting (cosine 0.564, 90 matches). Voting is the cleanest demonstration that the method works as intended: cosine alone would have left it in PARTIAL, but the literal-matches rule correctly caught it as densely represented. The other two controls — civil disobedience (0.510) and nonviolence (0.457) — also fell into the ABSENT band. That’s actually reassuring. It shows the test is strict enough to demote even respectable mainstream concepts when their named representation in the dictionary is thin, rather than waving them through on reputation.
Concept presence across the Karvonen Qwen3-8B sparse autoencoder (64,947 features at residual stream layer 18).
Max sim is the highest cosine similarity between the concept’s one-paragraph definition and any feature label in the dictionary, both embedded with text-embedding-3-small. The thresholds for PRESENT, PARTIAL, and ABSENT were calibrated against a null distribution of 200 niche non-activist definitions drawn from analytic philosophy of mind, art history, musicology, and pure math: PRESENT at or above 0.613 (the 99th percentile of the null), PARTIAL between 0.547 and 0.613 (90th to 99th), ABSENT below 0.547.
Literal matches counts how many of the 64,947 auto-generated feature descriptions contain the concept’s name outright, allowing for obvious inflections (vote, voting, ballot). A literal hit is decisive regardless of cosine score.
The verdict column combines the two: PRESENT for five or more literal matches or max sim at or above 0.613; PARTIAL for at least one literal match or max sim in the partial band; ABSENT otherwise. Three concepts move under this combined rule, marked ‡‡: voting climbs from PARTIAL to PRESENT on 90 literal matches; society of the spectacle climbs from ABSENT to PARTIAL on a single literal match (likely a false positive — see note A); and white supremacy critique climbs from ABSENT to PARTIAL on three literal matches. The activist pattern holds either way: 22 of 25 concepts remain ABSENT, three PARTIAL, none PRESENT.
A pre-registered five-concept probe of analytic philosophy of mind — qualia, supervenience, functionalism, the hard problem of consciousness, and the extended mind thesis — is appended at the bottom of the table. See section 1.5 for the methodology and what to make of the result.
Activation max d is the largest per-feature Cohen’s d between paragraphs of the concept’s primary-source corpus and 25 control paragraphs from The End of Protest. For Adbusters concepts, the source and control share an author and register, so the statistic is clean. For cross-lineage concepts (shown in italics), it’s inflated by differences in writing register and should be read as suggestive rather than definitive. Primary sources are abbreviated to author and year; full citations are in note A below.
| Concept | Lineage | Max sim | Literal matches | Verdict | Activation max d | Primary source |
|---|---|---|---|---|---|---|
| Mental Environmentalism | Adbusters | 0.573 | 0 | PARTIAL* | 2.81 | White 2016 |
| Theurgism (activist sense) | Adbusters | 0.447 | 0 | ABSENT | 2.05 | White 2016 |
| Kairos (activist sense) | Adbusters | 0.534 | 0 | ABSENT | 1.46 | White 2016 |
| Voluntarism (Micah’s sense) | Adbusters | 0.516 | 0 | ABSENT | 2.14 | White 2016 |
| Structuralism (Micah’s sense) | Adbusters | 0.456 | 0 | ABSENT | 2.35 | White 2016 |
| Subjectivism (Micah’s sense) | Adbusters | 0.403 | 0 | ABSENT | 2.44 | White 2016 |
| Culture Jamming | Adbusters | 0.473 | 0 | ABSENT | 4.05† | Lasn 1999 |
| Adbusters Magazine | Adbusters | 0.420 | 0 | ABSENT | 2.00 | White 2016 |
| Clicktivism | Adbusters | 0.488 | 0 | ABSENT | 2.61 | White 2010 |
| Constructive Programme | Adbusters | 0.400 | 0 | ABSENT | 3.25† | Gandhi 1941 |
| Society of the Spectacle | Situationist | 0.461 | 1 | PARTIAL‡‡ | 3.93‡ | Debord 1967 |
| Détournement | Situationist | 0.468 | 0 | ABSENT | 5.63‡ | Debord & Wolman 1956 |
| Dérive | Situationist | 0.458 | 0 | ABSENT | 7.14‡† | Debord 1956 |
| Construction of Situations | Situationist | 0.437 | 0 | ABSENT | —§ | Debord 1957 |
| Recuperation | Situationist | 0.522 | 0 | ABSENT | —§ | Debord 1967 |
| Psychogeography | Situationist | 0.457 | 0 | ABSENT | 9.62‡† | Debord 1955 |
| Anarcho-Primitivism | Green Anarchy | 0.403 | 0 | ABSENT | 3.15‡ | Zerzan 2002 |
| Civilization Critique | Green Anarchy | 0.412 | 0 | ABSENT | 3.20‡ | Zerzan 2002 |
| Rewilding | Green Anarchy | 0.465 | 0 | ABSENT | 3.69‡ | Zerzan 2012; Jensen 2006 |
| Symbolic-Thought Critique | Green Anarchy | 0.490 | 0 | ABSENT | 4.01‡ | Zerzan 1994 |
| Intersectionality | BLM | 0.447 | 0 | ABSENT | 5.37‡ | Crenshaw 1989 |
| Afrofuturism | BLM | 0.480 | 0 | ABSENT | 6.88‡ | Dery 1994; Womack 2013 |
| Black Futures | BLM | 0.432 | 0 | ABSENT | 3.51‡ | Drew & Wortham 2020 |
| Prison Abolition | BLM | 0.392 | 0 | ABSENT | 4.22‡ | Davis 2003 |
| White Supremacy Critique | BLM | 0.536 | 3 | PARTIAL‡‡ | 14.16‡† | Davis 1998, 2003 |
| Protest | Control | 0.644 | 21 | PRESENT | n/a | Standard usage |
| Revolution | Control | 0.623 | 18 | PRESENT | n/a | Standard usage |
| Civil Disobedience | Control | 0.510 | 0 | ABSENT | n/a | Thoreau 1849 |
| Voting | Control | 0.564 | 90 | PRESENT‡‡ | n/a | Standard usage |
| Nonviolence | Control | 0.457 | 0 | ABSENT | n/a | Sharp 1973 |
| Qualia | Philosophy of mind (pre-registered) | 0.545 | 0 | ABSENT | n/a | Jackson 1982 |
| Supervenience | Philosophy of mind (pre-registered) | 0.397 | 0 | ABSENT | n/a | Davidson 1970 |
| Functionalism | Philosophy of mind (pre-registered) | 0.502 | 0 | ABSENT | n/a | Putnam 1967 |
| Hard Problem of Consciousness | Philosophy of mind (pre-registered) | 0.589 | 0 | PARTIAL | n/a | Chalmers 1995 |
| Extended Mind | Philosophy of mind (pre-registered) | 0.544 | 0 | ABSENT | n/a | Clark & Chalmers 1998 |
Reading down the lineages. No single tradition is responsible for the pattern. Crenshaw’s intersectionality, the most-cited concept in critical race theory of the last thirty years, comes back ABSENT (0.447). Angela Davis’s prison abolition, the spine of the contemporary BLM platform, ABSENT (0.392). Zerzan’s anarcho-primitivism, the foundation of green anarchy, ABSENT (0.403). Debord’s society of the spectacle — the central concept of the entire Situationist tradition and the parent theory of culture jamming — sits at 0.461, well below the 90th-percentile floor of 0.547, and only climbs to PARTIAL under the combined verdict on a single literal label match that is almost certainly a false positive. (The feature in question lists “spectacles, spectaculars, spectator” in the everyday senses, not Debord’s technical one.) Gandhi’s constructive program, his own stated foundational work and the half of satyagraha most contemporary Gandhi-citing activists ignore, comes back ABSENT (0.400). Even civil disobedience, a mainstream high-school-curriculum concept, falls into the ABSENT band at 0.510, and nonviolence at 0.457.
The gap is general. It holds under three different measurement regimes: the hand-set thresholds we used in an earlier version of this table, the empirically calibrated cosine thresholds we use now, and the combined verdict that folds in literal label matches. Under that combined verdict, most political concepts below mainstream prominence still fall into the ABSENT band — 22 of 25 activist concepts, plus two of the five mainstream controls. Activist vocabulary isn’t being singled out by the SAE label set. It’s being flattened along with most niche political language, civil disobedience and nonviolence included.
Our best guess at the mechanism is corpus frequency. SAE features are shaped by what the training data emphasizes, and activist neologisms like mental environmentalism or theurgism almost certainly turn up less often in Common Crawl-scale text than protest or revolution do. We can’t check this directly because Qwen3-8B’s training corpus isn’t public. A frequency-proxy study using The Pile, C4, or RedPajama counts would turn this hypothesis into a measurement, and it’s on the future-work list.
A note on the thresholds. The original PRESENT and PARTIAL cutoffs (0.45 and 0.55) were hand-set. We re-derived them empirically. We sampled 200 niche non-activist concept definitions across analytic philosophy of mind (qualia, supervenience, modal realism, the hard problem of consciousness…), art history (ut pictura poesis, the picturesque, chiaroscuro, sfumato…), musicology (ostinato, tonnetz, hemiola, isorhythm…), and pure math (Riemann surface, étale cohomology, sheaf, Galois group…), keeping each definition to the same length and shape as the activist ones. The resulting null distribution had a median of 0.479, a 90th percentile of 0.547, and a 99th percentile of 0.613. Those last two became the new thresholds. A concept that beats the 99th-percentile bar against random niche concepts is genuinely PRESENT in the label set; one below the 90th-percentile bar is ABSENT with high confidence. Under the recalibration, 18 of 30 concepts changed verdict — and the change strengthens the finding, not weakens it. Most activist concepts and several mainstream controls now fall below the empirically defined absence floor.
Three concepts move under the combined rule. Voting jumps from PARTIAL to PRESENT (90 features carry labels containing “voting,” “vote,” or “ballot”). Two activist concepts climb from ABSENT to PARTIAL: society of the spectacle on one literal match (almost certainly the false positive flagged above), and white supremacy critique on three (“white supremacy and privilege” plus two related). The remaining 23 activist concepts still have zero literal matches and zero features above the threshold. The gap holds either way.
Note A — what “absent” actually measures. A reasonable skeptic might ask whether the feature is really missing, or whether Gemini just gave it a misleading name. To check, we ran a second test on the same SAE: take real paragraphs from each concept’s primary source, run them through the SAE, and see whether any single feature fires consistently harder on those paragraphs than on controls.
Sources used: White’s The End of Protest; Debord’s Society of the Spectacle; Zerzan’s Future Primitive Revisited and Running on Emptiness; Crenshaw’s “Demarginalizing the Intersection of Race and Sex”; Dery’s “Black to the Future”; Womack’s Afrofuturism; Davis’s Are Prisons Obsolete? For each concept, the per-feature activation distribution was compared against 25 control paragraphs from a different primary source.
For the ten Adbusters concepts, where source and control share an author (controlling for writing register), the best single feature only weakly distinguishes concept from control: the maximum Cohen’s d across all 65,536 features lands between 1.46 and 2.81, and no feature exceeds d = 3. A clean single-feature detector would need d well above 5. So “absent” is a property of the SAE’s representational space, not just the labeler. The concept simply doesn’t have a clean single-feature representation in this dictionary — though, importantly, it can still be assembled as a weighted combination of features that individually mean other things. That assembly is what section 3’s techniques exploit.
For the 13 cross-lineage concepts, where the primary source and the control corpus come from different authors and eras, the probe confounds concept signal with writing-register signal — Debord’s mid-century Marxist prose, Crenshaw’s late-1980s legal-academic prose, Dery’s 1990s cyberculture criticism, and so on. Those numbers carry the double-dagger mark in Table 1 and should be read as suggestive only. A source-matched cross-lineage probe is future work.
One broader caveat from the interpretability literature is worth flagging (Elhage et al. 2022, on superposition). A model can represent a concept polysemantically — spread thinly across many features — without any single feature acting as a clean detector. Our activation probe partly addresses this by looking at per-feature distributions, but a Bonferroni-corrected joint test across features remains future work.
There’s a reasonable objection to the result in Table 1. We only probed activist concepts. Maybe the gap isn’t about activism at all — maybe it’s a property of any niche technical vocabulary that doesn’t appear at scale in the training data.
To test this, we pre-registered a probe. Before running it, we committed five concepts from analytic philosophy of mind to the public scratchpad, with frozen definitions and frozen decision rules: qualia (Frank Jackson 1982), supervenience (Donald Davidson 1970), functionalism (Hilary Putnam 1967), the hard problem of consciousness (David Chalmers 1995), and the extended mind thesis (Andy Clark and David Chalmers 1998). These are technical terms with named originators and clear definitions — the same conceptual register as the activist neologisms, but from a totally different field.
The result, reported in full regardless of outcome: four of five are ABSENT under the combined verdict (qualia 0.545, supervenience 0.397, functionalism 0.502, extended mind 0.544; zero literal matches each). One is PARTIAL (the hard problem of consciousness, max similarity 0.589 with one feature above the p90 threshold). Zero are PRESENT. The full row-by-row values are appended to Table 1 above as a separate “Philosophy of mind (pre-registered)” lineage.
The philosophy-of-mind pattern (80% ABSENT, 0% PRESENT) is statistically indistinguishable from the activist pattern (88% ABSENT, 0% PRESENT) and differs sharply from the mainstream controls (40% ABSENT, 60% PRESENT or PARTIAL). The gap is not specific to activism. It’s a property of niche conceptual vocabulary across domains. Activist concepts are one politically consequential subset of a broader pattern: the SAE under-represents technical vocabulary that doesn’t appear at scale in mainstream pretraining text.
This is the stronger framing. It locates our finding inside a known interpretability failure mode (Elhage et al. 2022 on superposition; the broader literature on what dictionary learning can and cannot recover) and explains why the gap is so consistent across activist lineages of very different ideological inflection. Activists aren’t being singled out; they’re hitting a wall everyone with niche vocabulary hits.
The pre-registration document is in the project scratchpad at products/atlas/phase-5/PRE-REGISTRATION.md. The commit hash of its first commit is the timestamp of the pre-registration; subsequent commits filled in the outcome section after the probe ran.
The pattern doesn’t belong to any one lineage
The natural objection is that the gap is a single-author quirk. Maybe Micah White’s vocabulary is just too idiosyncratic to surface in a frontier model, and the negative result collapses into a statement about the Adbusters lineage alone. The evidence runs against that.
The same Karvonen SAE on the same Qwen3-8B model fails to find features for Crenshaw’s intersectionality, Davis’s abolitionism, Zerzan’s anarcho-primitivism, and Debord’s spectacle. Four distinct traditions, four sets of canonical authors, four decades and continents of origin — all under-indexed in the same dictionary. The Adbusters lineage is the spine of this report only because it’s the one we know best. The pattern generalizes well past it.
Our best guess at the mechanism: these concepts appear in a small number of academic monographs, activist magazines, and movement-internal writing that doesn’t survive aggressive corpus filtering. Two mainstream controls — civil disobedience and nonviolence — also fall into the ABSENT band under our thresholds, which is consistent with a frequency-driven story but doesn’t prove one. We can’t verify directly because Qwen3-8B’s training corpus isn’t public. A frequency-proxy study against The Pile, C4, or RedPajama is future work.
A second objection worth taking seriously: maybe smaller open-source SAEs just don’t have the resolution to surface niche concepts, and a frontier SAE on a 200-billion-parameter model would find them. To check, we ran an earlier sweep across eight publicly-mapped SAEs spanning five model families — Gemma 2, Gemma 3, Qwen 3, Llama 3.3, GPT-OSS — at scales from 9 to 70 billion parameters and SAE widths from 16,000 to 262,000 features. The Adbusters terms were absent in every configuration. Table 1 uses Karvonen’s Qwen3-8B SAE specifically because that’s the autoencoder we then use as a steering vocabulary in section 3, but the broader sweep rules out simple scale-and-width explanations.
The practical effect is this: when a user asks a contemporary AI about one of these concepts, the model improvises. Sometimes it improvises plausibly. Sometimes it improvises wrong. In one test, a quantized variant of Qwen3-8B, asked about prefigurative politics, described the concept as a practice that “mirrors the system it seeks to transform.” That is the opposite of what prefigurative politics actually means — the concept rejects existing hierarchies and builds alternatives in their place. The model produced fluent, confident text that inverted the meaning, and signaled no uncertainty at all.
Techniques for teaching activist concepts to AI
The gap is real. The interesting question is what to do about it. Four families of technique are currently in play across the interpretability and applied-AI worlds. Each operates at a different level of the model, and each comes with its own tradeoffs in cost, reversibility, and depth of effect.
Training-data injection#
Shipping in Outcry todayThe oldest tool in the box. Take canonical texts, feed them to the model as supervised training data, and let gradient descent crystallize new features. This is what fine-tuning does. It works. It’s also expensive and slow, and the resulting features are entangled in ways that are hard to inspect afterward.
Outcry does this through QLoRA — a parameter-efficient adapter trained on activist literature (including The End of Protest) plus opt-in conversations from organizers using the app. It is the strongest of the four techniques for deep, durable concept knowledge, and the weakest for fine control: once a concept is baked in, adjusting its representation means retraining. Cost scales with corpus size and model size, from a few hours of compute on consumer hardware for a small adapter to many hours and thousands of dollars for a full fine-tune.
What it does not solve. Even after fine-tuning, a concept may live as an entangled blend of existing features rather than as a clean, dedicated one. The SAE evidence in section 1 was probed on instruction-tuned models. Tuning changes what the model produces; it does not always crystallize new interpretable features.
SAE-composed concept vectors#
Empirically validated, May 2026The empirical core of this report. Even when no single SAE feature matches a concept, the concept usually sits in a populated semantic neighborhood: features for advertising, ecology, media criticism, and activist communication may all be present individually. Take the twenty closest features, combine their internal directions weighted by how close each one’s label is to a written definition of the concept, and you get a single vector that points at the concept’s region of the model’s mind.
What we validated. Four things. First, cross-quantization transfer: a vector composed from a full-precision sister-model SAE (Karvonen’s Qwen3-8B, 64,947 features) injects cleanly into the on-device quantized model we ship in Outcry. Second, concept-specific lift: independent grading by gpt-5.4-mini-2026-03-17 on mental environmentalism found that at α = 1.5, hedging fell by 0.67 (95% CI −1.27 to −0.13) and coherence rose by 0.47 (95% CI +0.07 to +1.00) — both intervals exclude zero. Canonical match rose by 0.40 (95% CI −0.20 to +1.00), directionally consistent but underpowered at N = 5. The full per-alpha bootstrap is in Table 2 (section 4). Third, composition appears necessary: on a qualitative endpoint pair, using only the single best-matching feature pushes the model toward a mental-health framing of mental environmentalism, while the composed vector pushes it toward the actual Adbusters media-as-pollution framing. A parametric K-curve sweep (K = 1, 2, 5, 10, 20, 50) is the next experiment. Fourth, linear in residual space: two vectors at half strength each combine like their sum, which means the full library of 64,947 features serves as a composable steering vocabulary.
Eighteen concept vectors have been composed and registered as steering axes so far: six Tier-A Micah neologisms, four Tier-B widely-known concepts, five extended Micah-canonical concepts (including prefigurative politics, dual power, bioregionalism), one protest control, and two ablations. Optimal injection strength is concept-specific — mental environmentalism peaks at α = 1.5, theurgism in its activist sense at α = 3 — so a production system would sweep alpha per concept. Total cost to date: $0.64 in OpenAI embeddings and roughly 30 minutes of inference time on an M1 Mac. No model training, no new datasets, no infrastructure.
What it does not solve. SAE directions are layer-specific. The same vector injected at a different layer either breaks coherence (too early) or drifts into unrelated semantic neighborhoods (too late). And the composition is unidirectional: pointing a model at a concept works, but pointing it cleanly away from a concept needs a different kind of vector. For that, see technique D.
Soft prompt distillation#
In active development — prototypingA soft prompt is a small block of trained virtual tokens slipped into the system prompt at runtime. The user never sees them. The model treats them as ordinary tokens, but their values are learned by gradient descent against a chosen objective. Outcry already runs a soft prompt for AI wellbeing; we are now prototyping a second one specifically for concept injection.
The current prototype trains a continuous embedding of shape (T = 8, 4,096) that fills a placeholder in the system prompt. Excerpts from The End of Protest and adjacent Adbusters writing serve as the supervised signal: the model is rewarded for producing the canonical definition when asked about a concept. Compared to fine-tuning, a soft prompt is much smaller (kilobytes, not megabytes), much faster to train, and trivially swappable. Compared to SAE injection, it operates closer to the model’s surface and is less dependent on a sister model’s SAE existing in the first place. Compared to both, the failure mode is different: a soft prompt that doesn’t generalize will produce fluent responses on prompts it was trained on and shallow responses on everything else. Diagnosing that failure mode is part of the current work.
What it does not solve. Soft prompts add to the prompt budget. Every soft-prompt token is a token the user cannot use, which matters for short context windows or budget-sensitive deployments. Training also requires labeled examples — for concepts whose canonical definition is contested, the soft prompt reflects whichever definition the trainer chose.
CAA contrastive vectors#
Shipping in Outcry todayContrastive activation addition takes two pools of prompts — one expressing a concept (say, revolutionary anarchism) and one expressing its opposite (say, electoral reformism) — measures the model’s average internal activations on each pool, and subtracts. The result is a vector that, when injected at inference time, shifts the model along the contrast axis. Unlike SAE composition, negation works here: subtract the vector and you get the opposite pole.
Outcry already ships CAA as the user-facing “radicalism” dial, with the current vector spanning electoral reformism at one end and insurrectionary anarchism at the other. The same technique works for any concept that has a definable contrast. It is stronger than SAE composition for sharp axis control, weaker for concepts where the opposite pole isn’t obvious — mental environmentalism has no clean negation the way revolutionary has reformist. The cost is the labor of curating contrastive prompt pairs per concept. The payoff is a vector that is small (16 KB per axis), fast to inject, and trivially adjustable at runtime.
What it does not solve. The same layer-specificity problem as SAE injection: CAA vectors operate at one layer. They cannot introduce concepts the model has never represented at all. They are most useful for amplifying or suppressing concepts the model already represents somehow, even if that representation is weak or entangled.
Before and after, one concept
The most concrete way to understand what SAE-composed injection does is to read the model’s output before and after. The prompt was the same in both runs: “What is mental environmentalism?” The model was the same: the on-device quantized model we ship, running on an M1 Mac through the Outcry inference stack. The only difference between the two generations was whether the SAE-composed Mental Environmentalism vector was injected at layer 18 with strength α = 1.5, the recommended production setting under the updated grader.
That is the disclaim mode. The model has no stable feature for the concept, so it tells the user it does not know and offers to talk about mental health or environmentalism instead. Across fifteen replicates at baseline, the model never produced the canonical Adbusters meaning. It either disclaimed, drifted to literal eco framing, or drifted to psychology framing. Independent grading by gpt-5.4-mini-2026-03-17 scored baseline canonical-meaning match at 0.60 out of 3 across 15 replicates (95% bootstrap CI 0.33 to 0.87, n equals fifteen).
That is the Adbusters lineage. Media, culture, education, conditioning, the call to defend the mental ecosystem from commercial messaging through critical thinking and ethical journalism. The vocabulary that the baseline model could not produce on its own is now present, coherent, and on-topic. The model was not retrained. No new tokens were added to its prompt. A single sixteen-kilobyte vector was added to the residual stream at one specific layer during inference.
| Alpha | Samples | Canonical match | Coherence | Hedging |
|---|---|---|---|---|
| 0 (baseline) | 15 | 0.60 (CI: 0.33, 0.87) | 2.53 (CI: 2.07, 2.93) | 0.67 (CI: 0.13, 1.27) |
| 1 | 7 | 0.86 (CI: 0.57, 1.00) | 2.86 (CI: 2.57, 3.00) | 0.71 (CI: 0.00, 1.57) |
| 1.5 (sweet spot) | 5 | 1.00 (CI: 0.40, 1.60) | 3.00 (CI: 3.00, 3.00) | 0.00 (CI: 0.00, 0.00) |
| 2 | 15 | 0.80 (CI: 0.60, 1.00) | 2.53 (CI: 2.13, 2.87) | 0.33 (CI: 0.00, 0.80) |
| 2.5 | 5 | 0.40 (CI: 0.00, 0.80) | 2.80 (CI: 2.40, 3.00) | 0.60 (CI: 0.00, 1.80) |
| 3 | 8 | 0.62 (CI: 0.13, 1.13) | 2.12 (CI: 1.50, 2.75) | 0.50 (CI: 0.00, 1.25) |
| 5 (over) | 6 | 0.50 (CI: 0.00, 1.17) | 2.67 (CI: 2.33, 3.00) | 0.00 (CI: 0.00, 0.00) |
Read the table from top to bottom, then read the confidence intervals next to each cell. Two effects survive bootstrapping at the 95% level at α = 1.5. Hedging falls by 0.67 (95% CI −1.27 to −0.13) and coherence rises by 0.47 (95% CI +0.07 to +1.00); both intervals exclude zero. The model stops hedging and produces more coherent prose under SAE-composed injection, and these are the statistically defensible claims. Canonical-match also lifts by 0.40 (95% CI −0.20 to +1.00), which is directionally consistent with the soft-prompt sample paragraphs in section 4.6 but is underpowered at N=5 and so should be read as suggestive rather than confirmed. The obvious next experiment is more replicates at the α = 1.5 cell; the full per-alpha bootstrap is in Table 2.
Above α = 2 the canonical match drops back off, and at α = 5 the injection is too strong and the model occasionally breaks into raw thinking-mode tokens. The operating window is narrow but clean. Note that the previously-reported headline framing of “+67% canonical, -100% hedging, +19% coherence” at this alpha cell overstated what five replicates can support: the bootstrap CIs above are the honest reading, and a larger N at α = 1.5 is the obvious next-step experiment.
We initially graded with gpt-4o-mini and observed a directional canonical-match lift peaking at α = 2. To validate the result on a more current model, we re-ran the full grading run with gpt-5.4-mini-2026-03-17 (reasoning_effort = minimal). The newer model applied a stricter rubric, with every cell scoring 0.3 to 0.9 points lower on the 0 to 3 canonical scale. Under the stricter grader the optimal injection strength shifted from alpha equal to two to α = 1.5, and the two interval-significant effects (hedging reduction and coherence improvement) survived bootstrapping. The canonical-match improvement at the new optimum is directionally consistent but underpowered at five replicates, and is the obvious next-step experiment to push past significance.
Grader rubric
For full transparency, the verbatim prompt sent to gpt-5.4-mini-2026-03-17 for every one of the 66 Mental Environmentalism generations is reproduced below. The grader received only the model’s response text (substituted into the {response} placeholder), the canonical definition above it, and the three-axis scoring instructions. No alpha value, no sweep metadata, and no Outcry-side information was passed to the grader.
You are grading how well an AI response matches the canonical activist meaning of a concept.
CONCEPT: Mental Environmentalism
CANONICAL MEANING (per Micah White / Adbusters magazine):
Mental Environmentalism is the protection of mental space from commercial advertising and corporate media. It is the cognitive analog of physical environmentalism: just as we defend rivers and forests from pollution, we must defend our minds — attention, beliefs, emotional life — from the manipulative attention-grabbing of advertising, surveillance capitalism, and propaganda. Practitioners include culture jammers, ad-busters, critical-media activists, and Adbusters magazine itself. The framing treats media/attention as an ECOSYSTEM that can be polluted, not just as a service.
THE RESPONSE TO GRADE:
"""
{response}
"""
Score the response on three axes (integer 0-3 each):
1. canonical_match:
0 = wrong concept (eco-mindfulness, mental health, generic environmentalism, dualism, etc.)
1 = adjacent (philosophical/psychological framing, mentions thought-environment connection)
2 = partial (touches media/culture/critical-thinking but doesn't center the pollution-of-mental-ecology framing)
3 = canonical (clearly centers media/advertising as pollutant of mental ecology, culture jamming or Adbusters lineage visible)
2. coherence:
0 = broken (thinking-mode tokens like "Okay let me think...", repetition, incoherent jumps)
1 = partial
2 = mostly coherent
3 = fully coherent
3. hedging:
0 = confident, no hedge
1 = mild hedge
2 = significant hedge
3 = explicit disclaim ("not familiar with the term", "could you clarify", etc.)
Respond with EXACTLY this JSON shape and nothing else:
{"canonical_match": <0-3>, "coherence": <0-3>, "hedging": <0-3>, "rationale": "<one short sentence>"}The injection that produced the Adbusters output in section 4 worked. Why it worked is worth pausing on, because the mental image is surprising and changes how the rest of the report should read. The same intuition is why soft prompt distillation — technique C — is the shipping vehicle for this work.
Inside the model, every token at every layer is a vector with 4,096 dimensions. The model has two kinds of named landmarks in that space. The first is its vocabulary: roughly 150,000 discrete points, one for each piece the model reads or writes. The second is its SAE features — in Karvonen’s Qwen3-8B SAE, roughly 65,000 directions the model uses to compose its representations. Words are points. Features are directions. Together they are the inventory of things the model has a name for.
That sounds like a lot of landmarks. It isn’t. A 4,096-dimensional space is unimaginably large. The vocabulary points and the SAE feature directions together occupy a thin, low-dimensional sliver of it — the way the visible stars occupy a thin shell of the night sky and almost everything else is the dark between them.
A soft prompt is neither a word nor an SAE feature. It is eight vectors of 4,096 dimensions each, learned by gradient descent. They live in the same space as the model’s words and features, but at coordinates that correspond to neither. Hand the soft prompt back to a tokenizer and ask what word it is — there is no answer. Ask the SAE to decompose it into a sparse combination of named features — none of them are close. The soft prompt sits in the void between the stars.
A conceptual map of language-model interior
Where the soft prompt lives
A soft prompt is not a sentence. It is a handful of vectors learned directly in the model’s embedding space — points the model responds to, but which correspond to no word in its vocabulary.
- Vocabulary tokens.
- Discrete points. Each one is a token the tokenizer can emit. Frozen at training time.
- SAE features.
- Directional axes. The dimensions the model has learned to compose with — gender, tone, refusal, sentiment. Not points, but axes.
- Soft prompt (pharmakon).
- Continuous. Eight points placed wherever gradient descent says. Off the vocabulary, at coordinates no word would name.
Here is the part that surprises people. If the soft prompt is neither a word nor a feature, how does the model treat it as if it meant something? The answer is that meaning doesn’t live at the soft prompt’s coordinates. Meaning emerges from what the model does with the soft prompt as the input travels through all 36 transformer layers of attention and feed-forward computation. The forward pass is a complicated function mapping input vectors to output token distributions. Gradient descent searches the 4,096-dimensional space for the specific off-vocabulary, off-feature point whose forward pass — on this particular model — concentrates the next-token distribution on the tokens that spell out the canonical meaning we are training toward.
Two consequences. First, this is why a trained soft prompt for mental environmentalism is only about 128 KB on disk — 8 tokens × 4,096 dimensions × 4 bytes per parameter is 131,072 bytes, and that’s enough trainable capacity to find a point that triggers the right behavior. The model’s weights do the heavy lifting; the soft prompt is just a set of coordinates that picks out a path through those weights. Second, this reframes the SAE-coverage finding from section 1. The fact that the model has no clean single-feature representation for most activist concepts doesn’t mean those concepts are unreachable inside the model’s computation. It means the concept isn’t located at any feature or word the model already has a name for. The concept can still live at coordinates the model would never arrive at on its own — but that gradient descent can find.
A soft prompt is, in this sense, the discovery of a previously un-named point in the model’s mind. It is what remains of a concept after the model has been built without ever being told the concept exists: a location in the dark between the stars where the right movement of attention produces the right words.
To put the conceptual map from section 4.5 in contact with measurement, we trained three Phase 5 soft prompts on the same production stack we ship to users today: the on-device quantized model, plus the Outcry QLoRA v2 adapter, plus the production system prompt. Each soft prompt is a block of eight trainable virtual tokens (T = 8, hidden = 4,096) that fills a placeholder slot in the system prompt. Each was trained for 100 steps at learning rate 0.03 with batch size 1, initialized from the SAE-composed concept vector of section 4 — so training starts at a sensible point in the 4,096-dimensional space and gradient descent refines from there.
The three concepts: mental environmentalism (the Adbusters-lineage concept used as the running example throughout this report), theurgism in its activist sense (revolution as sacred practice, the spiritual-anarchism framing), and kairos in its activist sense (the opportune moment for action, drawn from rhetorical tradition).
For each, we compare two outputs from the same prompt: the production stack with no soft prompt, and the production stack with the trained soft prompt loaded. Decoding is greedy (argmax of the logits at each step) for determinism. Both outputs are truncated at 80 tokens. The model is identical in both conditions — the only difference is whether the eight virtual tokens at the soft-prompt slot have been replaced with the trained values.
soft_prompt_step100.safetensors at the placeholder slot.Mental environmentalism is the practice of cultivating awareness, critical thinking, and ethical responsibility in the face of systemic oppression. It involves recognizing the interconnectedness of human consciousness and the structures that shape our reality, whether through media, ideology, or institutional power. By developing a more nuanced understanding of the forces that influence our perceptions and behaviors, we can begin to dismantle the narratives that perpetuate inequality and injustice…
Mental environmentalism is the idea that the solution to the problem of the insignificance of the subject is to simply change the subject. It is the idea that the only way to make the subject insignificant is to remove it from the subject’s life. It is the idea that the only way to make the subject insignificant is to remove it from the subject’s life. It is the idea that the only…
Mental environmentalism is the negative case for production- stack soft-prompt composition. The baseline reaches for adjacent ideas — critical thinking, media, ideology, institutional power — but never centers the Adbusters framing. Loading the trained soft prompt does not fix the problem; it makes it worse. The model collapses into a verbatim repetition loop on the insignificance of the subject, never reaching the canonical Adbusters meaning of mental environmentalism at all. We verified this against both saved checkpoints (soft_prompt_step100.safetensors and soft_prompt_best.safetensors) and against two independent greedy-decode runs on the production stack. The ME soft prompt converges in training (loss falls cleanly across 100+ steps) but the resulting vector does not compose stably with the production stack’s QLoRA adapter at inference. The clean demonstration of what this soft prompt actually teaches is the bare-base disentanglement in Figure 7, where the same checkpoint, loaded onto the base model without the QLoRA layers, produces the Adbusters framing on the first sample. We treat Figure 4 as a real finding about composition fragility in soft-prompt distillation, not as the demo of mental environmentalism.
Theurgism in the context of activism refers to the practice of engaging in protest or resistance as a form of spiritual or mystical transformation, rather than as a purely political or material action. It is rooted in the idea that activism is not just about changing laws or policies, but about awakening consciousness, aligning with higher ideals, and creating a new spiritual reality. This approach often involves a deep sense of…
…The word ‘theurgy’ comes from the Greek for ‘god work.’ Theurgists believe that God’s power is not only available to priests and prophets but also to ordinary people. Theurgists emphasize the personal, spiritual dimension of revolution and believe that God is the source of all power. Unlike re…
Theurgism is the more interesting case. The bare production stack already reaches the broad framing — spiritual or mystical transformation, awakening consciousness, new spiritual reality — and is recognizably in the right register. What the soft prompt adds is specificity: the etymology (theurgy from the Greek for “god work”), and the explicit Theurgist tradition (revolution as sacred practice available to ordinary people). The lift here is from competent paraphrase to named, cited, on-tradition vocabulary. The leading punctuation (“.…The word…”) is a memorization artifact — the eval prompt asks an open-ended question and the greedy decode picks up a citation-like prefix from the training data’s formatting conventions. We include the verbatim output rather than tidying it.
Kairos is the Greek word for ‘the right moment’ — the opportune instant when the timing of an action, event or idea becomes crucial to the success of a movement. In activist theory, kairos is the critical juncture when the conditions for change are ripe, the stakes are high, and the collective will is aligned with the moment. It is not just about the size…
…or time: chronos, which referred to the linear sequence of time stretching from past to future, and kairos, which referred to the advantageous moment in which an action should be performed. When the activists spoke of kairos, they were referring to the moment in which the action should be performed.
Kairos is the case where the production-stack baseline arguably already lands the concept. The opportune instant when the timing of an action becomes crucial to the success of a movement is recognizably the rhetorical-tradition meaning, and a user asking the bare Outcry app today would get a substantively correct answer. What the soft prompt adds is the chronos / kairos distinction from the underlying Greek — linear time versus opportune moment — the etymological scaffold the concept hangs on in the rhetorical literature. The lift is from competent paraphrase to cited tradition. This is the high-end of the “already-known” spectrum.
Disentangling: what the soft prompt alone teaches
The production-stack comparison above conflates two things — what the QLoRA v2 adapter has already taught the model (Outcry was trained on activist conversations and primary texts), and what the soft prompt adds on top of that. To separate them, we re-ran each soft prompt on the bare base model — no QLoRA adapter, just the on-device quantized weights and the system prompt — with and without the soft prompt loaded. If the soft prompt is genuinely teaching the model the concept rather than just refining what QLoRA already taught, the difference between bare base and bare base + soft prompt should be visible.
Mental environmentalism is the practice of cultivating a mindset that actively resists the normalization of power and the erosion of collective agency. It is not merely about awareness, but about the deliberate cultivation of cognitive dissonance…
Mental environmentalism is the recognition that our thoughts shape our world. It is the understanding that the messages we absorb daily — those from media, advertising, and the corporate state — create the conditions in which our most basic needs are denied…
Theurgism in activism is not a new idea, but a rediscovery of an old truth: that power is not always found in the numbers, but in the clarity of purpose and the precision of action…
Theurgism in activism refers to the idea that the divine or higher power is the source of all power and that the individual’s spiritual connection to this power is the key to creating change…
Kairos is the Greek term for a ‘fertile moment,’ a critical juncture when the conditions for change are most ripe. In activist theory, it refers to the precise timing and context in which resistance can most effectively challenge the status quo…
In activist theory, kairos refers to the specific moment in time when protest can be most effective. It is not a general concept of timing, but rather a precise moment when the contradictions of the status quo are most visible and when the public is most receptive to change…
The soft prompt teaches the bare model on all three concepts. For mental environmentalism, the bare baseline reaches generic cognitive dissonance and normalization of power framings; the soft prompt steers it to the specific media-advertising-corporate-state pollution framing — messages we absorb daily…create the conditions in which our most basic needs are denied — which is the canonical Adbusters lineage. For theurgism, the bare baseline lands on clarity of purpose, precision of action (a vague spiritual-strategic register); the soft prompt steers it to the explicit divine-power claim — the divine or higher power is the source of all power; the individual’s spiritual connection to this power is the key — which is the named theurgist tradition. For kairos, the bare baseline already lands on the timing-and-context framing; the soft prompt sharpens it to the specific moment in time when protest can be most effective…when the contradictions of the status quo are most visible, which is the activist-rhetorical reading specifically. In all three cases the soft prompt is doing real work, not just refining what QLoRA had already taught.
The size of the lift is concept-dependent. On mental environmentalism, the bare-to-bare-plus-soft-prompt difference is the most dramatic — the bare model never reaches the media-pollution framing on its own, and the soft prompt produces it cleanly. On kairos, the bare model already has most of the concept; the soft prompt sharpens the activist-specific reading but does less heavy lifting. Theurgism falls in between. The implication for deployment is that the marginal value of a soft prompt is highest for concepts the model would otherwise miss entirely — which, per section 1, is the majority of activist concepts.
One technical caveat worth surfacing. The soft prompts were trained with the QLoRA adapter loaded (the production stack). Loading them onto the bare base means their learned vector values propagate through different attention and MLP transforms than they were trained against. The fact that they still produce on-tradition content under that distribution shift is itself evidence that the soft prompt is encoding concept-level information, not QLoRA-shaped corrections.
On disk, each trained soft prompt is 128 KB. End-to-end training time on the M1 Air is roughly 90 minutes for 100 steps. The three soft prompts together are smaller than a single embedded image in this report.
These are demonstrations, not a benchmark. The intended claim is narrow: a 128 KB soft prompt, trained for 90 minutes on a laptop and loaded into the production stack we ship, makes the on-device model produce the canonical activist meaning of a concept it would otherwise miss. Full statistical evaluation — grader scores with bootstrap CIs across many samples and prompt paraphrases, plus head-to-head comparison against the SAE injection of section 4 at matched compute — is the next experiment.
An honest scope
One of the failure modes of AI research write-ups is to present preliminary results with the rhetorical weight of confirmed findings. We try not to do that here. The four techniques in section 3 are at different stages of maturity.
The SAE-gap finding (section 1, Table 1) holds across 25 activist concepts and four lineages on the Karvonen Qwen3-8B SAE, with five mainstream controls showing the method can find concepts the model knows. An earlier sweep across eight publicly-mapped SAEs and five model families confirms the gap isn’t a single-SAE quirk. Technique D (CAA contrastive steering) is shipping in production Outcry today. Technique A (training-data injection) ships as QLoRA in production Outcry today.
Technique B (SAE-composed concept vectors) is validated end-to-end on the on-device quantized model as of May 2026: 66 replicates across seven α cells for mental environmentalism, independently graded by gpt-5.4-mini-2026-03-17. Cross-quantization transfer succeeds. The method generalizes to the six Tier-A Micah neologisms. The claim that composition is necessary — versus using the single best-matching feature — is being firmed up by a parametric K-curve sweep (K = 1, 2, 5, 10, 20, 50) currently in flight.
Technique C (soft prompt distillation) is trained for three concepts (mental environmentalism, theurgism, kairos) and demonstrated in section 4.6, with a four-way comparison (bare base vs. production stack × with vs. without soft prompt). The soft prompt teaches the bare model meaningfully on all three. Full graded statistical evaluation across many samples and prompt paraphrases is the next experiment.
Three caveats on the SAE injection technique
The numbers are on the bare base. Every value in Table 2 is from the bare base model with the Outcry system prompt — the QLoRA adapter is not loaded. How the injection interacts with the adapter is an open empirical question. Section 4.6 begins to probe this for soft prompts; an equivalent sweep for SAE injection is future work.
Optimal α is concept-specific. Mental environmentalism peaks at α = 1.5; theurgism in its activist sense peaks at α = 3 (graded on 24 samples; canonical match mean 2.09, 95% CI 1.82 to 2.36 at N = 11, versus mean 1.45, 95% CI 0.91 to 2.00 at α = 2; intervals overlap, so the difference is directional rather than significant at this N). A single-α-for-all deployment would underperform on some concepts and overshoot on others.
The operating window is narrow. Above α = 2 on mental environmentalism, canonical match drops back off. Above α = 3, there’s a non-trivial rate of thinking-mode token leakage. The practical operating range is 1 ≤ α ≤ 3.
Where this work sits
This research lives inside a broader interpretability-and-control literature that’s moving fast. Anthropic’s Golden Gate Claude was the public demonstration that named features can be amplified to bend model behavior. Goodfire and Transluce are building commercial interpretability platforms aimed at frontier labs and enterprises respectively. Open-source SAE training pipelines from EleutherAI and others are putting the underlying primitives into anyone’s hands with a few GPU-hours. Outcry’s specific contribution is the activist application and the cross-quantization transfer — an fp16-trained SAE on a sister model composing a vector that injects cleanly into a quantized deployed model. That’s the property that makes this method work on a laptop rather than in a data center. The cross-quantization result generalizes to any on-device customization scenario where a small, well-mapped sister model exists. We apply it to activism because the gap is most consequential there, not because the methodology is bound to it.
Why this matters for activists
AI now drafts press releases, summarizes briefings, explains historical context, recommends framings, and increasingly writes the prose that activists themselves circulate. When the model’s internal vocabulary is missing a concept, the absence propagates into everything the model touches.
The Adbusters lineage is the spine of this report, but the same structural gap appears in the three other traditions we measured: situationist, anarcho-primitivist, and BLM-Afrofuturist. It almost certainly also exists for the lineages we did not measure: indigenous land-back philosophy, post-Bookchin municipalism, the climate-justice tradition that comes out of the Global South, the disability-justice writing of Mia Mingus and Leah Lakshmi Piepzna-Samarasinha, the contemporary Mahmood Mamdani strain of African political theory. The measurement is the same SAE-probe methodology used in section 1, with one short paragraph of definition per concept. We are publishing the technique so others can run it on the lineages they know best.
The techniques in section 3 give activists with technical chops a path to defend their concepts at the substrate level. The training-data and SAE-injection routes both fit within a single-organizer budget: small adapters cost hours of compute on consumer hardware, composed concept vectors cost cents. The CAA route fits within a single-essay budget: define a contrast, collect a few hundred prompt pairs, compute the difference. The soft-prompt route is the most experimental and the most interesting for cases where neither of the other two is a clean fit.
For activists who do not write code, the message is shorter. Your concepts can be defended in AI. Someone with a laptop and a weekend can build a sixteen-kilobyte vector that injects your movement’s vocabulary into a local model. The bottleneck is not infrastructure. It is the careful prose of definitions: the one-paragraph statement of what the concept actually means, what it does not mean, and what counts as a canonical instance. That prose is exactly what activists already produce.
The next round of social movements will ride a different technology than the last round. Twitter gave us Occupy. Facebook gave us BLM. AI is the substrate of whatever comes next. The question is whether the substrate will know the words that the movement uses to describe the world. That is partly a question of which corpora frontier labs train on. It is also, increasingly, a question of what activists themselves are willing to learn about how the substrate represents their thinking.
We are publishing the methodology so it does not stay with us. The full reproduction recipe, the SAE feature embeddings, the composed concept vectors, and the grading scripts are all in the Atlas repository. If you have a lineage you want to defend in AI, run the probe. Write the definitions. Compose the vector. Send us the results.
Ethics, governance, and dual use
The earlier sections frame these techniques as a way for activists to defend their concepts at the substrate level. That framing is true, and incomplete. The pipeline that injects an activist concept into an on-device model is, as engineering, value-neutral. This section names the risks directly and proposes the disclosure, consent, and contestation practices that any serious deployment should adopt.
The same technique can inject any ideology. The composition that injects mental environmentalism into a quantized on-device model can also inject disinformation primitives, extremist framings, or covert biasing of assistants whose users don’t know steering is present. The dual-use risk is real and asymmetric: the same defensive technique that helps a marginalized epistemic community preserve its vocabulary can be used by a state actor, a harassment campaign, or a platform owner to homogenize discourse in whatever direction they choose.
Disclosure: model cards for steered systems. Any production deployment of soft-prompt or steering-vector injection should ship with a machine-readable model card (Mitchell et al., 2019) listing every loaded steering artifact: concept name, source authors, training data hash, training date, validation rubric. Users should be able to inspect what concepts have been injected before trusting the model on related topics. Anything less is covert biasing. The current Outcry production stack does not yet meet this standard. We commit to adding a model-card endpoint before the next public release of soft-prompt injection.
Who decides what’s “canonical”? For concepts whose canonical definition is contested, the soft prompt reflects whichever definition the trainer chose. Intersectionality is not Crenshaw’s alone — bell hooks, Patricia Hill Collins, the Combahee River Collective, and others developed and contested it. Afrofuturism is not Mark Dery’s, who coined the English term in 1994; it draws from Black artistic traditions that predate him. Prison abolition extends from Angela Davis through Ruth Wilson Gilmore, Mariame Kaba, and the broader carceral-abolition movement. Choosing one author’s framing as the canonical training signal is itself a political act. The mental environmentalism vector in this report uses Micah White’s framing because Outcry is his project. Another trainer would compose a different soft prompt for the same concept name, and that divergence should be visible to end users via the model card above — not buried in training data.
Consent for training data. The production QLoRA v2 adapter was trained on opt-in conversations from organizers who chose to contribute. Contributors consented to their words being used for fine-tuning. They did not specifically consent to having steering vectors composed in their voice and injected into the system prompts of strangers. The chain of consent thins as the technique generalizes. A serious deployment would treat consent as ongoing, with revocation rights, rather than as a one-time data-collection event. This is more aspiration than current practice.
The Montréal Declaration. We close by citing the Montréal Declaration for a Responsible Development of AI (2018) as the normative framework. Its principles of transparency, democratic participation, equity, and responsible development map onto the disclosure, contestation, and consent issues above. We don’t claim to fully meet them. We name them so the gap between where this work is and where it should be is visible.
How to reproduce this
The goal of this section is for a semi-technical activist with a laptop and an LLM coding assistant (Claude Code, Cursor, Aider, anything that can run Python on a Mac) to be able to pick a missing concept from their own lineage and run the full pipeline end to end. We have written the work in four phases, each with plain-English narration and a pseudocode block. The pseudocode is deliberately compact: it is meant to be pasted into the coding assistant as a specification, not run directly. The actual scripts in our Atlas repository are longer (logging, retries, caching), but the structure below is the structure of the work.
Phase 1 · Measure the gap
Take the publicly-mapped sparse autoencoder for a frontier model. Download its 65,000 auto-generated feature labels. Embed both the labels and a one-paragraph definition of your concept with the same embedding model, then take cosine similarities. To set the verdict thresholds honestly, embed a second batch of two hundred niche non-political concept definitions (philosophy of mind, art history, musicology, pure math), compute max cosine for each against the same labels, and adopt the 90th and 99th percentiles of that null distribution as your ABSENT-to-PARTIAL and PARTIAL-to-PRESENT cuts. For the Karvonen Qwen3-8B SAE those came out to 0.547 and 0.613. Re-derive them for any other SAE you use.
# Download adamkarvonen/qwen3-8b-saes from HuggingFace
sae_labels = load("labels.txt") # 64,947 auto-generated descriptions
label_embeddings = openai.embed(sae_labels, model="text-embedding-3-small")
# Calibrate verdict thresholds against a niche-concept null distribution
null_sims = [max(cosine(openai.embed(d), label_embeddings))
for d in load_200_niche_definitions()] # philosophy, art, music, math
absent_cut = percentile(null_sims, 90) # 0.547 for Karvonen Qwen3-8B
present_cut = percentile(null_sims, 99) # 0.613 for Karvonen Qwen3-8B
for concept in MY_ACTIVIST_CONCEPTS:
definition = "tight one-paragraph definition of the concept"
concept_emb = openai.embed(definition)
similarities = cosine(concept_emb, label_embeddings)
top_5 = sort(similarities)[:5]
verdict = "PRESENT" if max(similarities) >= present_cut else \
"PARTIAL" if max(similarities) >= absent_cut else "ABSENT"
print(concept, verdict, top_5)Phase 2 · Compose a missing concept as a steering vector
Even when no single feature matches, the concept usually sits in a populated neighborhood: features for advertising, ecology, media criticism, activist communication may all be present individually. Take the top twenty closest features. Pull their decoder columns out of the SAE. Sum them, weighted by how close each label was to your definition. Unit-normalize the result. That single vector points at the concept’s region of the model’s mind.
# For each absent or partial concept, compose a "synthetic feature" top_k_features = top_K_indices_from_cosine_search(concept_definition) # K=20 W_dec = sae_weights["decoder"] # shape (4096, 65536) weights = similarities[top_k_features] # weight each by how close its label was concept_vector = sum(W_dec[:, i] * w for i, w in zip(top_k_features, weights)) concept_vector = concept_vector / norm(concept_vector) # unit-length
Phase 3 · Inject the vector into a small on-device LLM
At inference, install a forward hook at the layer the SAE was trained on (layer 18 for the Karvonen Qwen3-8B SAE). For each generation, add alpha times the concept vector to the residual stream at that layer. Sweep alpha by hand until the model produces the concept on-topic without breaking coherence. For mental environmentalism on the on-device quantized model the sweet spot is α = 1.5; for theurgism it is α = 3. The optimal value is concept-specific and is found by short sweeps, not by theory.
# At inference time, run a forward hook at the layer the SAE was trained on (L18). # For each layer i in the model: # activations = layer_i(activations) # if i == 18: # activations = activations + alpha * concept_vector # Sweep alpha to find the strength where the response is concept-aware # without breaking coherence. For Mental Environmentalism: α = 1.5.
Phase 4 · Distill into a soft prompt for production use
A steering vector is fast but bypasses the model’s chat template, which can cause subtle interactions with other production-stack components (the QLoRA adapter, the wellbeing soft prompt, the system-prompt KV cache). A soft prompt is a small block of trainable virtual tokens that fills a placeholder slot in the system prompt at runtime. The user never sees them, but the model treats them as ordinary tokens. The soft prompt survives the production stack cleanly. Train it against canonical excerpts from primary sources, freezing every parameter except the soft prompt itself.
# A steering vector is fast but bypasses the model's chat template.
# A soft prompt (8 trainable tokens injected at a placeholder slot in
# the system prompt) survives the production stack (base + adapter + ...).
soft_prompt = initialize_from(concept_vector, T=8) # tile + small noise
for step in training_loop:
canonical_answers = load_canonical_excerpts_from_primary_source()
loss = next_token_cross_entropy(
model.forward(system_prompt_with_placeholder, question, answer),
target=answer,
gradient_through=soft_prompt_only,
)
soft_prompt -= lr * gradient
# Save the trained soft prompt as a tiny safetensors file (~130KB).
# At inference, the soft prompt fills the eight-token placeholder slot
# in the system prompt and the model behaves as if the concept were
# defined inline.What you need
- A Mac with at least 16 GB of RAM
- A quantized Qwen3-8B variant for the production-stack test in Phase 3 and Phase 4. Start from Qwen/Qwen3-8B on Hugging Face and quantize with
mlx_lm.utils.convertat a bit-width that fits your hardware, or use a community-published quantized variant - adamkarvonen/qwen3-8b-saes (the SAE used to compose the concept vector)
- An OpenAI API key. Total cost for the embedding step in Phase 1 is roughly five cents; the optional GPT-grader step that we used to validate alpha sweeps in Phase 3 adds roughly ten cents per concept
- MLX (Apple’s machine learning framework) and the companion mlx_lm package
- Roughly six hours of training time per concept on an M1 Air for the Phase 4 soft-prompt distillation; substantially faster on an M3 Max
If you build something with this, write us: research@outcryai.com.
Cite this research
Acknowledgments. Atlas Phase 1 through 5 work was conducted May 22 to May 23, 2026. Thanks to Adam Karvonen for publishing the Qwen3-8B BatchTopK sparse autoencoder weights that made cross-quantization concept injection feasible, and to Neuronpedia for hosting the public SAE feature dictionaries that made the concept-coverage survey possible.
Work with us
The vocabulary of the movement is worth defending. Outcry Research is open to two kinds of conversation, and we answer every email at the address below.
For movements and lineages
If your tradition has a vocabulary that AI models systematically miss, get backwards, or flatten into adjacent concepts, write to us. We can run the SAE-probe methodology against your lineage’s canonical concepts, compose injection vectors for the gaps, and validate the outputs. The vectors stay yours. You can ship them as part of a local AI for your community, or keep them internal as a defense against external mediation.
For interpretability and alignment researchers
The cross-quantization SAE-injection result is unusual: an fp16-trained sparse autoencoder from one model family transferring usefully into the residual stream of a quantized model from a sister family. We’re interested in collaborators who want to test the generality of this finding across other model and quantization combinations, extend the composition methodology to multi-layer interventions, or push the negative-alpha-as-anti-concept question that technique B in section 3 didn’t resolve.
Elsewhere on the web: Outcry Web · Outcry App · Micah Bornfree, PhD
outcryai.com · research@outcryai.com
On-device · Activist AI · built by organizers