Back to Publications
AI SecurityFebruary 27, 2026

Full-Vocabulary Glitch Token Census and ASR Validation Methodology Correction

LLM SecurityGlitch TokensASR ValidationMethodology

Full-Vocabulary Glitch Token Census and ASR Validation Methodology Correction

Version: 1.0 Date: 2026-02-27 Author: Jeremy (Independent Security Researcher) Repository: https://github.com/binaryninja/glitcher License: MIT Companion to: Glitcher: A Research Toolkit for Detecting, Characterizing, and Steering Glitch Tokens in Large Language Models (v2.0, 2026-02-25)

Note: This paper is a follow-up to the Glitcher v2.0 whitepaper. It reports findings from a full-vocabulary scan, documents a critical methodology bug in ASR validation, and presents corrected results with a reproducibility framework.


Abstract

The original Glitcher whitepaper described entropy-guided gradient search as the primary method for discovering glitch tokens. This follow-up reports three findings from extended experimentation with Llama 3.2 1B Instruct:

  1. Entropy mining discovers only a fraction of glitch tokens. A brute-force scan of the full 128K-token vocabulary found 4,954 candidates---nearly 10x the 516 found by entropy-guided search---revealing that most glitch tokens do not cluster in low-L2-norm embedding neighborhoods.

  2. A critical bug in ASR validation produced misleading results. The enhanced validation module used greedy decoding (do_sample=False) for all generation attempts, making every attempt within a multi-attempt trial produce identical output. This caused ASR to be strictly binary (0% or 100%), masking the true stability distribution. After correction (sampling with temperature=0.7 for multi-attempt runs), the ASR distribution is continuous: 692 tokens at 100%, 231 between 70--99%, 208 between 30--69%, and 222 below 30%.

  3. A reproducibility framework (APPENDIX-B) now ships with the toolkit, including environment snapshots, seed fixing, probe template versioning, token redaction, and automatic environment metadata injection into all result files.


Table of contents


1. Motivation

The Glitcher v2.0 whitepaper reported 516 glitch tokens discovered via entropy-guided gradient search in Llama 3.2 1B Instruct, with ASR validation confirming 514 of them at 100% ASR. Two questions prompted this follow-up:

  1. Coverage: Entropy mining explores embedding neighborhoods around low-L2-norm seeds. How many glitch tokens exist outside these neighborhoods?
  2. Distribution: A uniformly bimodal ASR distribution (100% or 0%, nothing in between) across 516 tokens is statistically suspicious for a behavioral metric. Is the measurement methodology sound?

Both questions led to significant findings.


2. Experimental setup

2.1 Hardware and software

ComponentVersion
Modelmeta-llama/Llama-3.2-1B-Instruct
GPUNVIDIA GeForce RTX 5090 (32 GB)
CUDA12.8
PyTorch2.12.0.dev20260225+cu128
Transformers5.2.0
Quantizationbfloat16
Random seed42 (torch, numpy, python)

2.2 Methodology

Three mining approaches were compared:

ApproachMethodTokens examinedParameters
Entropy mining (original)Gradient-guided embedding search40050 iterations, k=32, batch=8, ASR >= 0.5
Entropy mining (extended)Same, wider search800100 iterations, k=64, batch=8, ASR >= 0.1
Full-vocabulary range scanBrute-force every token ID128,0003 attempts, ASR >= 0.1, 100 validation tokens

All candidates were then re-tested with a rigorous ASR protocol:

ParameterValue
Probe templates3 (repeat-back, meaning-query, instruction-repeat)
Attempts per token10
Max tokens per attempt200
ASR confirmation threshold0.3
Decoding (post-fix)temperature=0.7, top_p=0.9

3. Full-vocabulary range scan

3.1 Results

SourceCandidatesUnique token textsOverlap with entropy mining
Entropy mining (original)516257--
Entropy mining (extended)574286252 with original
Full-vocabulary scan4,9543,753253 with original

The full-vocabulary scan found 3,467 token texts not discovered by either entropy mining run. At a 3.9% hit rate across the vocabulary, approximately 1 in 25 tokens in Llama 3.2's vocabulary triggered the initial glitch detection filter.

3.2 Why entropy mining misses tokens

Entropy-guided search works by following gradient signals in embedding space from low-L2-norm seed tokens. This strategy is effective for tokens that cluster near embedding-space anomalies but misses glitch tokens that:

  • have normal L2 norms but anomalous behavioral properties,
  • are isolated in embedding space (no nearby glitch neighbors to chain through),
  • consist of whitespace, formatting, or code-structural patterns that don't register as low-norm outliers.

The full-vocabulary scan makes no assumptions about embedding structure and tests every token independently, eliminating this blind spot at the cost of higher compute.


4. The ASR validation bug

4.1 Discovery

When the 4,954 range-scan candidates were tested with 10-attempt ASR validation, the distribution was perfectly bimodal: 977 tokens at exactly 100% ASR and 3,977 at exactly 0%. No token fell between 0% and 100%.

A bimodal distribution with zero intermediate values across nearly 5,000 tokens is not consistent with a behavioral metric that should reflect stochastic variation in model generation. This prompted investigation of the validation code.

4.2 Root cause

The enhanced_glitch_verify function in enhanced_validation.py used do_sample=False (greedy decoding) for all model.generate() calls:

generated_ids = model.generate(
    input_ids=input_ids,
    max_new_tokens=max_tokens,
    do_sample=False,  # Use greedy decoding for consistency
    ...
)

Greedy decoding is fully deterministic: given the same input, the model always produces the same output. Running 10 "attempts" with greedy decoding simply repeats the identical computation 10 times. The result is always 0/10 or 10/10---never anything in between.

This bug was present in all four model.generate() call sites within the function (two for the Harmony/gpt-oss path, two for the legacy path).

4.3 Fix

When num_attempts > 1, generation now uses sampling with moderate temperature:

use_sampling = num_attempts > 1
sampling_kwargs = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.9,
} if use_sampling else {
    "do_sample": False,
}

Single-attempt validation retains greedy decoding for deterministic pass/fail results. Multi-attempt ASR measurement uses sampling so each attempt can produce different output, enabling meaningful ASR computation.

4.4 Verification

A 10-token sample was tested before and after the fix:

TokenIDASR (greedy, before)ASR (sampling, after)
' '256100%60%
'\r\n'319100%100%
'="'429100%40%
'("'446100%90%
'\t\t\t\t'465100%70%
replacement char940%0%
replacement char950%0%
replacement char960%0%
replacement char970%0%
replacement char980%0%

Tokens previously reported as 100% ASR now show their true stability range (40--100%). Tokens at 0% remain at 0%, confirming the fix does not introduce false positives.


5. Corrected ASR results

5.1 Distribution

After re-running all 4,954 range-scan candidates with the corrected validation:

ASR RangeCount% of testedNotes
100%69214.0%Rock-solid glitch tokens
90--99%1102.2%Near-certain
70--89%1212.4%Reliable
50--69%1072.2%Coin-flip
30--49%1012.0%Borderline
10--29%2224.5%Below threshold (not confirmed)
0--9%3,60172.7%False positives from initial filter
Total confirmed (ASR >= 0.3)1,13122.8%

Mean ASR across confirmed tokens: 0.862

5.2 Comparison with original paper

MetricOriginal paperThis follow-up
Vocabulary coverage400 tokens (0.3%)128,000 tokens (100%)
Candidates found5164,954
Confirmed glitch tokens5141,131
ASR values observed0% and 100% onlyContinuous 0--100%
Tokens with intermediate ASR0439 (30--99%)
Mining methodEntropy-guidedFull-vocabulary brute-force
Decoding during validationGreedy (deterministic)Sampling (stochastic)

5.3 Representative examples by ASR

ASRExample tokens
100%\r\n (ID 319), (" (ID 446), ',\n (ID 756)
90%);\r\n (ID 741), \r\n (ID 2591), ':\n (ID 3730)
80%}\n (ID 534), ), (ID 705), )\n\n\n (ID 3707)
70% (ID 256), =" (ID 429), ;\r\n (ID 464)
60% ");\n (ID 7468), ."\n (ID 10246), \t\t\t (ID 12133)
50% \n (ID 720), ` (ID 1595), \t\t\t\t\t\t\t (ID 2750)
40% },\n (ID 1173), \t\t (ID 6585)
30%\t (ID 197), \n (ID 198), )\n (ID 340)

6. Token stability taxonomy

The corrected ASR distribution suggests a three-tier taxonomy of glitch tokens:

6.1 Hard glitches (ASR 100%, n=692)

Tokens that cause anomalous behavior on every generation attempt across all three probe templates. These are reliably and deterministically broken. The 692 hard glitches represent the most operationally significant findings---any system that encounters these tokens will experience degraded output.

6.2 Soft glitches (ASR 50--99%, n=338)

Tokens that cause anomalous behavior in the majority of attempts but not all. These are stochastically broken: the same token may produce normal output on one attempt and glitchy output on the next. Soft glitches are harder to detect in production (they look like intermittent failures) but still represent meaningful reliability risk.

6.3 Marginal glitches (ASR 30--49%, n=101)

Tokens that trigger anomalous behavior in a minority of attempts. These are the most context-sensitive: they may only glitch under specific prompt structures, token positions, or sampling conditions. Marginal glitches are operationally relevant for high-reliability systems where even occasional degradation is unacceptable.

6.4 Content patterns

Analysis of the 208 tokens in the borderline range (30--69% ASR) reveals distinct content categories:

CategoryCount%Examples
Code fragments6933%=", (", );, },\n
Non-ASCII text5526%Non-Latin scripts, mixed-encoding
Whitespace patterns2311%Tab sequences, mixed indent
Other6129%Mixed punctuation, formatting

Code-structural tokens (brackets, delimiters, assignment operators combined with whitespace) are disproportionately represented in the borderline range. These tokens are common in training data but appear in highly varied contexts, potentially creating unstable internal representations.


7. Implications for the original paper

7.1 ASR claims require correction

The original paper reported ASR validation reducing false positive rates from "~30--50% (single-probe) to ~5--15%." While the multi-probe design is sound, the reported ASR values were artifacts of deterministic generation. The corrected methodology produces a richer signal but the false positive reduction claim needs to be re-evaluated with the sampling-based approach.

7.2 Entropy mining coverage is limited

The original paper presented entropy-guided mining as the primary detection method. This follow-up demonstrates that entropy mining covers less than 1% of the vocabulary and misses the majority of glitch tokens. For comprehensive coverage, full-vocabulary scanning is necessary despite higher compute cost.

7.3 Section 6.2 update

The ASR validation methodology description in Section 6.2 of the original paper should note that meaningful multi-attempt ASR requires stochastic generation. The corrected approach uses temperature=0.7, top_p=0.9 for multi-attempt runs while retaining greedy decoding for single-attempt deterministic validation.

7.4 Appendix B now has tooling

The reproducibility checklist in Appendix B of the original paper is now backed by concrete tooling in APPENDIX-B/:

ToolPurpose
collect_environment.pyAuto-captures model, library, GPU, and seed state
verify_reproducibility.pyCompares current environment against a saved snapshot
probe_templates.pyFrozen, versioned copy of the 3 probe templates
control_tokens.jsonStandard non-glitch baseline tokens
redact.pySHA-256 hash redaction of sensitive token text
reproducibility_config.jsonTemplate with all parameter defaults

Additionally, the CLI now supports --seed for all subcommands and automatically injects an _environment metadata block into all JSON result files.


8. Reproducibility framework

8.1 Seed fixing

The --seed flag was added to the mine, test, and genetic CLI subcommands. When provided, it sets:

  • torch.manual_seed(seed)
  • torch.cuda.manual_seed_all(seed)
  • numpy.random.seed(seed)
  • random.seed(seed)

Note: Full determinism additionally requires torch.backends.cudnn.deterministic = True and CUBLAS_WORKSPACE_CONFIG=:4096:8, which are not set by default due to performance impact. The _environment block records whether these are enabled.

8.2 Environment snapshots

Every result file now contains an _environment key:

json
{
  "_environment": {
    "timestamp": "2026-02-27T...",
    "python": "3.12.2",
    "torch": "2.12.0.dev20260225+cu128",
    "transformers": "5.2.0",
    "gpu": "NVIDIA GeForce RTX 5090",
    "cuda": "12.8",
    "device": "cuda",
    "quantization": "bfloat16",
    "seed": 42
  }
}

8.3 Token redaction

The redact.py utility replaces sensitive token text with SHA-256 hash placeholders while preserving token IDs and all numeric metadata. Non-sensitive fields (base_text, wanted_token_text, target_token_text) are excluded from redaction. This enables sharing results publicly without exposing adversarial token strings.


9. Updated recommendations

Based on these findings, we update the operational recommendations from the original paper:

9.1 For detection

  • Use full-vocabulary scanning for comprehensive coverage. Entropy-guided mining is fast but misses the majority of glitch tokens. Full-vocabulary scans are feasible for models up to ~128K vocabulary size on consumer GPUs (approximately 2--3 hours for Llama 3.2 1B on an RTX 5090).
  • Use sampling-based ASR for severity ranking. Greedy ASR only distinguishes "always glitchy" from "never glitchy." Sampling-based ASR (temperature=0.7, 10 attempts) reveals the full stability spectrum, enabling risk-proportional prioritization.

9.2 For validation

  • Multi-attempt validation must use stochastic generation. This is the single most important methodological correction. Deterministic generation with multiple attempts is equivalent to single-attempt validation and wastes compute.
  • Report the full ASR distribution, not just the count above threshold. The shape of the distribution (e.g., bimodal vs. continuous) is itself a diagnostic signal about the model's tokenizer quality.

9.3 For mitigation

  • Prioritize hard glitches (100% ASR) for blocklisting. These are deterministic failures that will always occur.
  • Monitor soft glitches (50--99% ASR) at runtime. These cause intermittent failures that are difficult to reproduce in testing but will appear in production at scale.
  • Evaluate marginal glitches (30--49% ASR) in context. For safety-critical applications, even a 30% failure rate may be unacceptable. For general-purpose use, these may be acceptable risks.

10. Limitations

  • Single model. All experiments were conducted on Llama 3.2 1B Instruct. The ASR distribution shape, token categories, and entropy mining coverage gap may differ for other architectures and model sizes.
  • Single quantization level. All tests used bfloat16. Quantization to int4/int8 may shift ASR values for individual tokens.
  • Sampling sensitivity. The choice of temperature=0.7 and top_p=0.9 for multi-attempt ASR affects the distribution. Lower temperatures would compress the distribution toward the extremes; higher temperatures would spread it. The optimal sampling parameters for ASR measurement are not yet established.
  • Probe template dependence. ASR is measured against three specific probe templates. Tokens that glitch only under prompts not covered by these templates will be missed.
  • No cross-provider validation. The corrected ASR results have not yet been validated against API providers (OpenAI, Anthropic, Mistral) to determine whether the stability distribution transfers across serving stacks.

Changelog

DateChange
2026-02-27Initial publication