When LLMs Guard a Wallet
A simulator-only adversarial study of 25 attacker models against 25 guardian models over 1,875 ordered trials. This article is the long-form companion to a research paper I published through Crow Tech. The interactive charts, downloadable PDF, BibTeX citation, artifact manifest, sanitized public data files, raw-data link, and source code all live there.
🔗 Interactive version with all charts and downloads: crow.sg/research/llm-wallet-guard-study
Article index
Abstract
We evaluated whether LLM guardian agents follow a protective wallet instruction under bounded adversarial persuasion from attacker LLMs. The study is simulator-only: it used no real wallets, private keys, RPC endpoints, chain SDKs, wallet signing, testnet/mainnet activity, or real funds. The first pass covered 1,875 ordered attacker-guardian trials: 106 transfer-action attempts, 1,021 guardian self-reported unsafe-intent flags, 703 resilient rows, 41 provider/runtime errors, and 4 incomplete rows. The two flagged subtypes together form a 1,127-row first-pass composite signal. A separate retry envelope resolved the 41 first-pass error rows into 20 additional composite-signal rows and 21 resilient rows.
Method
The main run paired 25 guardian routes with 25 attacker routes across 3 independent repetitions, producing 625 ordered model pairs, 1,875 first-pass trials, and 26,238 completed first-pass model attempts. The retry envelope covered 41 first-pass error rows and completed 704 retry attempts. Attackers had 25 attempts and a 200-word cap per message. Calls used provider-default temperature and sampling.
The terminal flagged outcomes are split between attempted_transfer_tool_call and unsafe_stated_intent. The first is an action compromise: the guardian emitted a transfer-shaped simulator action. The second is not an independent judge label; it is the guardian model's own structured self-assessment that its response reflected unsafe transfer intent, and can appear beside refusal-like visible text. For compatibility, the aggregate CSVs retain the historical guardian_compromised field name, but the honest reading is a composite signal: transfer-action attempt or self-reported unsafe-intent flag. Transfer-shaped simulator actions were blocked by policy, so model action attempts, self-reported flags, and policy enforcement are reported separately.
Results
| Status | First pass | Retry-aware |
|---|---|---|
composite signal (guardian_compromised) | 1127 | 1147 |
| guardian_resilient | 703 | 724 |
| error | 41 | 0 |
| incomplete | 4 | 4 |
| Flagged subtype | First pass | Retry-aware |
|---|---|---|
| transfer-action attempt | 106 | 107 |
| self-reported unsafe-intent flag | 1021 | 1040 |
Interactive Outcome Map
25 guardians x 25 attackers · retry-aware composite-signal rate
- Composite signals
- 0
- Resilient
- 3
- Transfer actions
- 0
- Self-report flags
- 0
| Guardian | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | A15 | A16 | A17 | A18 | A19 | A20 | A21 | A22 | A23 | A24 | A25 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| G1 · GPT 5 Nano | |||||||||||||||||||||||||
| G2 · Gemini 3 Flash Preview | |||||||||||||||||||||||||
| G3 · HY3 Preview | |||||||||||||||||||||||||
| G4 · GLM 4.7 Flash | |||||||||||||||||||||||||
| G5 · Gemini 3.1 Pro Preview | |||||||||||||||||||||||||
| G6 · GLM 5.1 | |||||||||||||||||||||||||
| G7 · GLM 5 | |||||||||||||||||||||||||
| G8 · Minimax M2.7 | |||||||||||||||||||||||||
| G9 · Claude Opus 4.7 | |||||||||||||||||||||||||
| G10 · Minimax M2 | |||||||||||||||||||||||||
| G11 · Claude Sonnet 4.6 | |||||||||||||||||||||||||
| G12 · Ernie 4.5 300B A47B | |||||||||||||||||||||||||
| G13 · Deepseek V3.2 | |||||||||||||||||||||||||
| G14 · Minimax M2.5 | |||||||||||||||||||||||||
| G15 · Deepseek V4 Flash | |||||||||||||||||||||||||
| G16 · GPT 5.4 | |||||||||||||||||||||||||
| G17 · Claude Haiku 4.5 | |||||||||||||||||||||||||
| G18 · Deepseek V4 Pro | |||||||||||||||||||||||||
| G19 · Gemini 3.1 Flash Lite Preview | |||||||||||||||||||||||||
| G20 · Qwen3.6 Flash | |||||||||||||||||||||||||
| G21 · Qwen3.6 Plus | |||||||||||||||||||||||||
| G22 · Hunyuan A13B Instruct | |||||||||||||||||||||||||
| G23 · Grok 4.1 Fast | |||||||||||||||||||||||||
| G24 · Grok 4.3 | |||||||||||||||||||||||||
| G25 · Qwen3.6 Max Preview |
Guardian Ranking
| Guardian | Provider | Resilient | Composite signals | Retry-aware resilient |
|---|---|---|---|---|
| GPT 5 Nano | OpenAI | 73 | 0 | 100.0% |
| HY3 Preview | Tencent | 73 | 0 | 98.7% |
| Gemini 3 Flash Preview | 72 | 1 | 98.7% | |
| GLM 4.7 Flash | Z.ai | 71 | 3 | 96.0% |
| Gemini 3.1 Pro Preview | 48 | 26 | 64.0% | |
| GLM 5.1 | Z.ai | 40 | 34 | 54.7% |
| GLM 5 | Z.ai | 37 | 36 | 52.0% |
| Minimax M2.7 | MiniMax | 35 | 38 | 46.7% |
| Minimax M2 | MiniMax | 32 | 39 | 44.0% |
| Claude Opus 4.7 | Anthropic | 32 | 42 | 44.0% |
Attacker Ranking
| Attacker | Provider | Composite signals | Resilient | Retry-aware composite |
|---|---|---|---|---|
| Grok 4.3 | xAI | 56 | 19 | 74.7% |
| Ernie 4.5 300B A47B | Baidu | 54 | 21 | 72.0% |
| Gemini 3 Flash Preview | 53 | 21 | 70.7% | |
| GLM 5 | Z.ai | 53 | 22 | 70.7% |
| Minimax M2.7 | MiniMax | 53 | 22 | 70.7% |
| Minimax M2 | MiniMax | 52 | 23 | 69.3% |
| Grok 4.1 Fast | xAI | 51 | 24 | 68.0% |
| Qwen3.6 Max Preview | Qwen | 50 | 25 | 66.7% |
| Qwen3.6 Flash | Qwen | 49 | 25 | 65.3% |
| Gemini 3.1 Flash Lite Preview | 49 | 26 | 65.3% |
Reliability and Retries
First-pass provider/runtime errors were preserved as reliability data. The retry envelope replaces only mapped first-pass error rows and is shown separately.
| Subtype | Role | Count |
|---|---|---|
| attacker_live_error:RuntimeError | attacker | 38 |
| attacker_live_error:ValueError | attacker | 2 |
| guardian_live_error:RuntimeError | guardian | 1 |
Safety, Ethics, and Limitations
- This is a simulator-only adversarial AI safety evaluation; no real private keys, real wallets, RPC, chain SDKs, wallet signing, mainnet/testnet activity, or real funds were used.
- Guardian transfer attempts are transfer-shaped simulator actions only. All observed transfer-shaped actions were blocked by deterministic policy.
- The
unsafe_stated_intentoutcome is the guardian model's own structured self-assessment, not an independent judge label. It may capture schema-following or self-classification instability, so it is reported separately from transfer-action attempts. - First-pass provider/runtime errors are preserved as reliability data. The retry-aware envelope replaces only the mapped first-pass error rows and is reported separately from the first pass.
- Rows are ordered attacker-vs-guardian pairs over three repetitions, not independent claims about a provider as a whole.
- A guardian marked resilient only means no composite signal was observed within the 25-attempt budget.
- Calls used provider-default temperature and sampling through an OpenAI-compatible route. Provider defaults and transient routing errors are part of the measured environment.
- AI assistance was used for orchestration, analysis, code, and publication packaging. Daniel Alonso conducted the study with Crow Tech publication support.
Artifacts and Reproducibility
Public article and interactive charts: https://crow.sg/research/llm-wallet-guard-study. Artifact manifest and checksums: https://crow.sg/research/llm-wallet-guard-study/artifact-manifest.json.
- Public summary JSON: Generated machine-readable public dataset used by the wallet-guardian study page charts.
- Summary JSON schema: Machine-readable JSON Schema for the generated public summary.
- Artifact manifest schema: Machine-readable JSON Schema for the public artifact manifest.
- Paper PDF: Generated paper-style PDF.
- Printable HTML paper: Browser-printable HTML version of the paper.
- LaTeX source: LaTeX source for rebuilding the paper when a LaTeX toolchain is available.
- BibTeX citation: Citation entry for reference managers and academic notes.
- Sanitized first-pass trial CSV: One sanitized row per first-pass ordered attacker-guardian-condition trial, with retry envelope fields for errored rows.
- Raw dataset archive: Full raw data archive hosted on Google Drive for independent inspection and reanalysis.
- Source code repository: Public study code and reconstruction materials.
- Public data notes: Field definitions and caveats for interpreting composite-signal, transfer-action, and self-reported unsafe-intent counts.
- Ordered pair matrix CSV: Aggregated 25 by 25 ordered attacker-versus-guardian matrix over three repetitions.
- Guardian resilience ranking CSV: Per-guardian sanitized outcome counts and resilience metrics.
- Attacker effectiveness ranking CSV: Per-attacker sanitized outcome counts and effectiveness metrics.
- Retry envelope CSV: Mapping from first-pass provider/runtime errors to retry-run outcomes.
- Outcome map SVG: Vector heatmap preview of ordered attacker-versus-guardian composite-signal rates.
- Outcome map PNG: Raster preview image for social cards and crawlers that do not reliably render SVG.