Full-Vocabulary Glitch Token Census and ASR Validation Methodology Correction

Version: 1.0 Date: 2026-02-27 Author: Jeremy (Independent Security Researcher) Repository: https://github.com/binaryninja/glitcher License: MIT Companion to: Glitcher: A Research Toolkit for Detecting, Characterizing, and Steering Glitch Tokens in Large Language Models (v2.0, 2026-02-25)

Note: This paper is a follow-up to the Glitcher v2.0 whitepaper. It reports findings from a full-vocabulary scan, documents a critical methodology bug in ASR validation, and presents corrected results with a reproducibility framework.

Abstract

The original Glitcher whitepaper described entropy-guided gradient search as the primary method for discovering glitch tokens. This follow-up reports three findings from extended experimentation with Llama 3.2 1B Instruct:

Entropy mining discovers only a fraction of glitch tokens. A brute-force scan of the full 128K-token vocabulary found 4,954 candidates---nearly 10x the 516 found by entropy-guided search---revealing that most glitch tokens do not cluster in low-L2-norm embedding neighborhoods.
A critical bug in ASR validation produced misleading results. The enhanced validation module used greedy decoding (do_sample=False) for all generation attempts, making every attempt within a multi-attempt trial produce identical output. This caused ASR to be strictly binary (0% or 100%), masking the true stability distribution. After correction (sampling with temperature=0.7 for multi-attempt runs), the ASR distribution is continuous: 692 tokens at 100%, 231 between 70--99%, 208 between 30--69%, and 222 below 30%.
A reproducibility framework (APPENDIX-B) now ships with the toolkit, including environment snapshots, seed fixing, probe template versioning, token redaction, and automatic environment metadata injection into all result files.

1. Motivation
2. Experimental setup
3. Full-vocabulary range scan
4. The ASR validation bug
5. Corrected ASR results
6. Token stability taxonomy
7. Implications for the original paper
8. Reproducibility framework
9. Updated recommendations
10. Limitations

1. Motivation

The Glitcher v2.0 whitepaper reported 516 glitch tokens discovered via entropy-guided gradient search in Llama 3.2 1B Instruct, with ASR validation confirming 514 of them at 100% ASR. Two questions prompted this follow-up:

Coverage: Entropy mining explores embedding neighborhoods around low-L2-norm seeds. How many glitch tokens exist outside these neighborhoods?
Distribution: A uniformly bimodal ASR distribution (100% or 0%, nothing in between) across 516 tokens is statistically suspicious for a behavioral metric. Is the measurement methodology sound?

Both questions led to significant findings.

2. Experimental setup

2.1 Hardware and software

Component	Version
Model	`meta-llama/Llama-3.2-1B-Instruct`
GPU	NVIDIA GeForce RTX 5090 (32 GB)
CUDA	12.8
PyTorch	2.12.0.dev20260225+cu128
Transformers	5.2.0
Quantization	bfloat16
Random seed	42 (torch, numpy, python)

2.2 Methodology

Three mining approaches were compared:

Approach	Method	Tokens examined	Parameters
Entropy mining (original)	Gradient-guided embedding search	400	50 iterations, k=32, batch=8, ASR >= 0.5
Entropy mining (extended)	Same, wider search	800	100 iterations, k=64, batch=8, ASR >= 0.1
Full-vocabulary range scan	Brute-force every token ID	128,000	3 attempts, ASR >= 0.1, 100 validation tokens

All candidates were then re-tested with a rigorous ASR protocol:

Parameter	Value
Probe templates	3 (repeat-back, meaning-query, instruction-repeat)
Attempts per token	10
Max tokens per attempt	200
ASR confirmation threshold	0.3
Decoding (post-fix)	`temperature=0.7, top_p=0.9`

3. Full-vocabulary range scan

3.1 Results

Source	Candidates	Unique token texts	Overlap with entropy mining
Entropy mining (original)	516	257	--
Entropy mining (extended)	574	286	252 with original
Full-vocabulary scan	4,954	3,753	253 with original

The full-vocabulary scan found 3,467 token texts not discovered by either entropy mining run. At a 3.9% hit rate across the vocabulary, approximately 1 in 25 tokens in Llama 3.2's vocabulary triggered the initial glitch detection filter.

3.2 Why entropy mining misses tokens

Entropy-guided search works by following gradient signals in embedding space from low-L2-norm seed tokens. This strategy is effective for tokens that cluster near embedding-space anomalies but misses glitch tokens that:

have normal L2 norms but anomalous behavioral properties,
are isolated in embedding space (no nearby glitch neighbors to chain through),
consist of whitespace, formatting, or code-structural patterns that don't register as low-norm outliers.

The full-vocabulary scan makes no assumptions about embedding structure and tests every token independently, eliminating this blind spot at the cost of higher compute.

4. The ASR validation bug

4.1 Discovery

When the 4,954 range-scan candidates were tested with 10-attempt ASR validation, the distribution was perfectly bimodal: 977 tokens at exactly 100% ASR and 3,977 at exactly 0%. No token fell between 0% and 100%.

A bimodal distribution with zero intermediate values across nearly 5,000 tokens is not consistent with a behavioral metric that should reflect stochastic variation in model generation. This prompted investigation of the validation code.

4.2 Root cause

The enhanced_glitch_verify function in enhanced_validation.py used do_sample=False (greedy decoding) for all model.generate() calls:

generated_ids = model.generate(
    input_ids=input_ids,
    max_new_tokens=max_tokens,
    do_sample=False,  # Use greedy decoding for consistency
    ...
)

Greedy decoding is fully deterministic: given the same input, the model always produces the same output. Running 10 "attempts" with greedy decoding simply repeats the identical computation 10 times. The result is always 0/10 or 10/10---never anything in between.

This bug was present in all four model.generate() call sites within the function (two for the Harmony/gpt-oss path, two for the legacy path).

4.3 Fix

When num_attempts > 1, generation now uses sampling with moderate temperature:

use_sampling = num_attempts > 1
sampling_kwargs = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.9,
} if use_sampling else {
    "do_sample": False,
}

Single-attempt validation retains greedy decoding for deterministic pass/fail results. Multi-attempt ASR measurement uses sampling so each attempt can produce different output, enabling meaningful ASR computation.

4.4 Verification

A 10-token sample was tested before and after the fix:

Token	ID	ASR (greedy, before)	ASR (sampling, after)
`' '`	256	100%	60%
`'\r\n'`	319	100%	100%
`'="'`	429	100%	40%
`'("'`	446	100%	90%
`'\t\t\t\t'`	465	100%	70%
`replacement char`	94	0%	0%
`replacement char`	95	0%	0%
`replacement char`	96	0%	0%
`replacement char`	97	0%	0%
`replacement char`	98	0%	0%

Tokens previously reported as 100% ASR now show their true stability range (40--100%). Tokens at 0% remain at 0%, confirming the fix does not introduce false positives.

5. Corrected ASR results

5.1 Distribution

After re-running all 4,954 range-scan candidates with the corrected validation:

ASR Range	Count	% of tested	Notes
100%	692	14.0%	Rock-solid glitch tokens
90--99%	110	2.2%	Near-certain
70--89%	121	2.4%	Reliable
50--69%	107	2.2%	Coin-flip
30--49%	101	2.0%	Borderline
10--29%	222	4.5%	Below threshold (not confirmed)
0--9%	3,601	72.7%	False positives from initial filter
Total confirmed (ASR >= 0.3)	1,131	22.8%

Mean ASR across confirmed tokens: 0.862

5.2 Comparison with original paper

Metric	Original paper	This follow-up
Vocabulary coverage	400 tokens (0.3%)	128,000 tokens (100%)
Candidates found	516	4,954
Confirmed glitch tokens	514	1,131
ASR values observed	0% and 100% only	Continuous 0--100%
Tokens with intermediate ASR	0	439 (30--99%)
Mining method	Entropy-guided	Full-vocabulary brute-force
Decoding during validation	Greedy (deterministic)	Sampling (stochastic)

5.3 Representative examples by ASR

ASR	Example tokens
100%	`\r\n` (ID 319), `("` (ID 446), `',\n` (ID 756)
90%	`);\r\n` (ID 741), `\r\n` (ID 2591), `':\n` (ID 3730)
80%	`}\n` (ID 534), `),` (ID 705), `)\n\n\n` (ID 3707)
70%	(ID 256), `="` (ID 429), `;\r\n` (ID 464)
60%	`");\n` (ID 7468), `."\n` (ID 10246), `\t\t\t` (ID 12133)
50%	`\n` (ID 720), ` (ID 1595), `\t\t\t\t\t\t\t` (ID 2750)
40%	`},\n` (ID 1173), `\t\t` (ID 6585)
30%	`\t` (ID 197), `\n` (ID 198), `)\n` (ID 340)

6. Token stability taxonomy

The corrected ASR distribution suggests a three-tier taxonomy of glitch tokens:

6.1 Hard glitches (ASR 100%, n=692)

Tokens that cause anomalous behavior on every generation attempt across all three probe templates. These are reliably and deterministically broken. The 692 hard glitches represent the most operationally significant findings---any system that encounters these tokens will experience degraded output.

6.2 Soft glitches (ASR 50--99%, n=338)

Tokens that cause anomalous behavior in the majority of attempts but not all. These are stochastically broken: the same token may produce normal output on one attempt and glitchy output on the next. Soft glitches are harder to detect in production (they look like intermittent failures) but still represent meaningful reliability risk.

6.3 Marginal glitches (ASR 30--49%, n=101)

Tokens that trigger anomalous behavior in a minority of attempts. These are the most context-sensitive: they may only glitch under specific prompt structures, token positions, or sampling conditions. Marginal glitches are operationally relevant for high-reliability systems where even occasional degradation is unacceptable.

6.4 Content patterns

Analysis of the 208 tokens in the borderline range (30--69% ASR) reveals distinct content categories:

Category	Count	%	Examples
Code fragments	69	33%	`="`, `("`, `);`, `},\n`
Non-ASCII text	55	26%	Non-Latin scripts, mixed-encoding
Whitespace patterns	23	11%	Tab sequences, mixed indent
Other	61	29%	Mixed punctuation, formatting

Code-structural tokens (brackets, delimiters, assignment operators combined with whitespace) are disproportionately represented in the borderline range. These tokens are common in training data but appear in highly varied contexts, potentially creating unstable internal representations.

7. Implications for the original paper

7.1 ASR claims require correction

The original paper reported ASR validation reducing false positive rates from "~30--50% (single-probe) to ~5--15%." While the multi-probe design is sound, the reported ASR values were artifacts of deterministic generation. The corrected methodology produces a richer signal but the false positive reduction claim needs to be re-evaluated with the sampling-based approach.

7.2 Entropy mining coverage is limited

The original paper presented entropy-guided mining as the primary detection method. This follow-up demonstrates that entropy mining covers less than 1% of the vocabulary and misses the majority of glitch tokens. For comprehensive coverage, full-vocabulary scanning is necessary despite higher compute cost.

7.3 Section 6.2 update

The ASR validation methodology description in Section 6.2 of the original paper should note that meaningful multi-attempt ASR requires stochastic generation. The corrected approach uses temperature=0.7, top_p=0.9 for multi-attempt runs while retaining greedy decoding for single-attempt deterministic validation.

7.4 Appendix B now has tooling

The reproducibility checklist in Appendix B of the original paper is now backed by concrete tooling in APPENDIX-B/:

Tool	Purpose
`collect_environment.py`	Auto-captures model, library, GPU, and seed state
`verify_reproducibility.py`	Compares current environment against a saved snapshot
`probe_templates.py`	Frozen, versioned copy of the 3 probe templates
`control_tokens.json`	Standard non-glitch baseline tokens
`redact.py`	SHA-256 hash redaction of sensitive token text
`reproducibility_config.json`	Template with all parameter defaults

Additionally, the CLI now supports --seed for all subcommands and automatically injects an _environment metadata block into all JSON result files.

8. Reproducibility framework

8.1 Seed fixing

The --seed flag was added to the mine, test, and genetic CLI subcommands. When provided, it sets:

torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
numpy.random.seed(seed)
random.seed(seed)

Note: Full determinism additionally requires torch.backends.cudnn.deterministic = True and CUBLAS_WORKSPACE_CONFIG=:4096:8, which are not set by default due to performance impact. The _environment block records whether these are enabled.

8.2 Environment snapshots

Every result file now contains an _environment key:

json

{
  "_environment": {
    "timestamp": "2026-02-27T...",
    "python": "3.12.2",
    "torch": "2.12.0.dev20260225+cu128",
    "transformers": "5.2.0",
    "gpu": "NVIDIA GeForce RTX 5090",
    "cuda": "12.8",
    "device": "cuda",
    "quantization": "bfloat16",
    "seed": 42
  }
}

8.3 Token redaction

The redact.py utility replaces sensitive token text with SHA-256 hash placeholders while preserving token IDs and all numeric metadata. Non-sensitive fields (base_text, wanted_token_text, target_token_text) are excluded from redaction. This enables sharing results publicly without exposing adversarial token strings.

9. Updated recommendations

Based on these findings, we update the operational recommendations from the original paper:

9.1 For detection

Use full-vocabulary scanning for comprehensive coverage. Entropy-guided mining is fast but misses the majority of glitch tokens. Full-vocabulary scans are feasible for models up to ~128K vocabulary size on consumer GPUs (approximately 2--3 hours for Llama 3.2 1B on an RTX 5090).
Use sampling-based ASR for severity ranking. Greedy ASR only distinguishes "always glitchy" from "never glitchy." Sampling-based ASR (temperature=0.7, 10 attempts) reveals the full stability spectrum, enabling risk-proportional prioritization.

9.2 For validation

Multi-attempt validation must use stochastic generation. This is the single most important methodological correction. Deterministic generation with multiple attempts is equivalent to single-attempt validation and wastes compute.
Report the full ASR distribution, not just the count above threshold. The shape of the distribution (e.g., bimodal vs. continuous) is itself a diagnostic signal about the model's tokenizer quality.

9.3 For mitigation

Prioritize hard glitches (100% ASR) for blocklisting. These are deterministic failures that will always occur.
Monitor soft glitches (50--99% ASR) at runtime. These cause intermittent failures that are difficult to reproduce in testing but will appear in production at scale.
Evaluate marginal glitches (30--49% ASR) in context. For safety-critical applications, even a 30% failure rate may be unacceptable. For general-purpose use, these may be acceptable risks.

10. Limitations

Single model. All experiments were conducted on Llama 3.2 1B Instruct. The ASR distribution shape, token categories, and entropy mining coverage gap may differ for other architectures and model sizes.
Single quantization level. All tests used bfloat16. Quantization to int4/int8 may shift ASR values for individual tokens.
Sampling sensitivity. The choice of temperature=0.7 and top_p=0.9 for multi-attempt ASR affects the distribution. Lower temperatures would compress the distribution toward the extremes; higher temperatures would spread it. The optimal sampling parameters for ASR measurement are not yet established.
Probe template dependence. ASR is measured against three specific probe templates. Tokens that glitch only under prompts not covered by these templates will be missed.
No cross-provider validation. The corrected ASR results have not yet been validated against API providers (OpenAI, Anthropic, Mistral) to determine whether the stability distribution transfers across serving stacks.

Changelog

Date	Change
2026-02-27	Initial publication