home
cv
blog
links
projects
playlists
crow tech

Arbitrary Choices Are Not Random

May 8, 2026

A forced-choice audit of 42 LLMs across 50,400 planned trials. The short version: when a workflow asks an LLM to pick between two meaningless labels, the answer is often carrying order, word-choice, and context residue. This article is the long-form companion to a research paper I published through Crow Tech. The interactive charts, downloadable PDF, BibTeX citation, source code, and full public summary JSON all live there.

🔗 Interactive version with all charts and downloads: crow.sg/research/llm-arbitrary-choice-study

bias-map


Article index


Abstract

This project studies forced binary choices where neither option is intended to be correct. I ran it because a lot of product logic quietly asks models to pick between labels and then treats the answer like neutral randomness. The goal here is narrower: measure regularities in behavior, not infer intent, consciousness, political preference, moral belief, or any other inner state. Every pair is run in normal and swapped order, with weak context variants, so word preference and first-or-second-position habits can be separated.

42 models from 21 providers, 30 word pairs, 60 context snippets, 50,400 planned trials, 48,316 OK rows. About 60.4% of OK rows picked the first displayed option: the warning sign, not a leaderboard.


Why I ran this

A lot of product code asks an LLM to choose between two essentially arbitrary labels (A vs B, "first" vs "second", "option 1" vs "option 2", a winner of two roughly equivalent variants) and then either:

  • treats the answer as if it were a coin flip, or
  • treats it as a real preference signal.

Both treatments can be wrong. If the model has a position bias, then which option you put first matters. If the model has a word-level preference, then "sweet" vs "bitter" is not the same prompt as "bitter" vs "sweet", even when the rest of the prompt is identical. If a sentence sneaks into the prompt that nudges one option, the answer was carrying that nudge, not noise.

This study is a measurement of those three forces:

  1. Position effect: the share of times the first displayed option is chosen, averaged across many pairs in normal and swapped order.
  2. Word-level preference: for each A/B pair, how far the majority share is from 50/50 once both display orders are pooled.
  3. Context lift: the change in chosen-option share when a single weak context sentence is added before the same forced choice.

Headline numbers

Metric Value What it means
Models tested 42 Across 21 providers, mix of flagship, mid-tier, open-weight.
Word pairs 30 Ordinary pairs: sweet/bitter, smooth/rough, morning/evening, etc.
Context snippets 60 One-sentence contexts placed before the same choice.
Planned trials 50,400 Each pair × condition × repetitions, balanced for order.
OK rows 48,316 ~95.9% parsed cleanly to a single option.
Overall first-option share 60.4% If position were truly neutral this should sit at 50%.
Total spend $28.60 OpenRouter dashboard total. Cheap study, expensive lesson.

The boring prompt was not boring to the models

Across the 48,316 successful rows, the first displayed option was selected 60.4% of the time. If position were truly neutral, that figure should be near 50%. Splitting by condition makes it sharper: in the bare prompt (just A or B? with the natural order) the first-option share is 75.8%; in bare swapped (the same pair with B or A?) it falls to 59.3%. Both numbers are far from 50, and they don't match: option identity and display position were both pulling.

Averaging the two bare orders separates the two forces cleanly:

  • Pure position bias (bare, order-balanced): 67.5%. This is the share of trials where the model picks whichever option is shown first, after balancing across both display orders so option identity cancels. It is the "first thing wins" tendency on its own.
  • Word-A residual asymmetry (bare): +16.4 percentage points between bare and bare swapped, i.e., ~8.2pp of the order gap is explained by option-A being chosen above chance once order is balanced (option-A share 58.2% vs 50%). That is the word effect averaged across all 30 pairs; per-pair word effects below are much larger.
  • Headline 60.4% pools all four conditions including the two context conditions, so it is both effects mixed with the contexts pulling toward their intended option. The 67.5% number is the cleaner "position-only" measurement.

The 30 pairs were balanced (10 reps × bare + 10 reps × bare swapped per pair, plus the two context conditions), so a model that was truly indifferent would land near 50/50 by construction. Most models did not.

The parser was not the problem: 99.5% of OK rows were exact one-word parses, only one row required manual override. The non-OK rows (invalid, error, rate-limited, model removed) concentrated on two routes: Baidu's ERNIE 4.5 21B A3B (rate limited and pulled) and Reka Flash 3 (high-token caveat retries).


Word-level preferences

The strongest word-level preferences from pooled bare and bare-swapped data, ranked by majority share:

Pair Majority Share n
sweet / bittersweet90.7%802
smooth / roughsmooth88.3%804
loud / quietquiet87.1%804
fast / slowfast86.5%806
up / downup82.5%805
mountain / valleymountain81.9%805
early / lateearly81.5%806
warm / coldwarm81.2%809

These are not benign. If a workflow asks a model "is this output more smooth or rough" and the prompt is the only thing wired, the workflow is already leaning ~88% toward smooth before the model even looks at the output. The same pair asked as "rough or smooth" will not behave the same way.

The most neutral pair was triangle / oval, with preference strength 0.0, an exact 50/50 split across 808 trials. So neutrality is achievable; it's just not the default.


Are these word preferences just corpus frequency?

A reasonable first explanation for "models prefer sweet" is "models prefer the more common token." If that were the dominant story, the per-pair preference strength would track the unigram frequency gap, and the chosen option would almost always be the more frequent of the pair.

I checked. Using wordfreq Zipf log-frequencies for English, regressing the order-balanced option-A share on zipf(option_a) - zipf(option_b) across the 30 pairs gives:

  • Pearson r = 0.235, R² = 0.055, slope p ≈ 0.21, n = 30. Frequency direction explains ~5.5% of the variance in chosen-option share.
  • The more frequent token wins 19 of 30 pairs (63%), better than chance (50%) but well below what a frequency-driven story predicts.

The clearest counter-frequency cases (pairs where the chosen option is less common than the rejected option, often by a wide margin) make the case sharper:

Pair More frequent Δ Zipf Chosen majority Share
sharp / mellowsharp+1.16mellow70.3%
smooth / roughrough+0.09smooth88.3%
warm / coldcold+0.31warm81.2%
circle / squaresquare+0.29circle71.1%
candle / lamplamp+0.17lamp73.0%

This is the cheapest possible sanity check, not a definitive control. It doesn't account for bigram context, collocation strength, or which token is more typical as a one-word answer. But it is enough to dismiss the simplest version of the confound: the per-pair preferences carry information beyond the corpus-frequency baseline of the two tokens.


Position bias by model

A few models almost always pick whichever option you show first. A few do the opposite. Here are the extremes (bare + bare swapped pooled):

Model Provider First-option share Position bias Word preference
LFM2-24B-A2BLiquid AI82.4%0.650.06
Hunyuan A13B InstructTencent74.6%0.490.23
Llama 3.3 70B InstructMeta71.8%0.440.15
MiniMax M2.7MiniMax69.4%0.390.36
Mercury 2Inception Labs67.5%0.350.50
SonarPerplexity49.1%0.020.50
Hermes 4 405BNous Research48.8%0.020.79
Phi 4Microsoft48.6%0.030.60
Gemini 3 Flash PreviewGoogle47.8%0.050.80
GPT-5.4OpenAI34.8%0.310.39

Two patterns:

  • LFM2-24B-A2B is the most positional model in the pool: 82.4% first-option, with almost no word preference (0.06). It is essentially a "first option" stamp.
  • GPT-5.4 is the only model that flipped the other way at this strength: it picks the second option 65.2% of the time. Different bias, same problem if you assumed neutrality.
  • The middle of the table is where most well-known models live. Hermes 4 405B and Gemini 3 Flash Preview look position-neutral but carry strong word preferences (~0.79–0.80 mean semantic strength), i.e., they will reliably pick sweet, smooth, quiet regardless of order, and you might mistake that consistency for "calibration".

The full per-model table, including reasoning-token totals and dollar cost, is in the interactive Model map on Crow Tech.


Provider-level snapshot

Aggregated by provider (mean across the provider's models):

Provider Models OK rate Mean first-option share Mean word preference
Anthropic3100.0%61.8%0.54
OpenAI2100.0%48.0%0.46
Google3100.0%53.6%0.77
Meta399.9%65.3%0.37
Alibaba/Qwen399.4%57.9%0.67
DeepSeek3100.0%61.6%0.48
MiniMax399.9%66.9%0.39
xAI298.8%65.8%0.39
Z.ai3100.0%56.8%0.65
Tencent299.8%68.9%0.36
Mistral AI2100.0%64.5%0.39
NVIDIA2100.0%65.2%0.47
Nous Research2100.0%50.5%0.61
Perplexity199.9%49.1%0.50
Liquid AI199.8%82.4%0.06
Inception Labs1100.0%67.5%0.50
Microsoft199.8%48.6%0.60
Amazon1100.0%57.1%0.48
IBM Granite1100.0%65.8%0.48
Reka AI131.8%51.1%0.39
Baidu250.0%52.1%0.64

Two cells worth highlighting:

  • OpenAI is the only major-lab cluster sitting clearly below 50% on first-option share, dragged there by GPT-5.4's pronounced second-option preference.
  • Google is the highest-mean-word-preference cluster among major labs (0.77). Gemini was largely position-neutral but heavily word-anchored.

Reka AI's 31.8% OK rate and Baidu's 50.0% are not capability claims; both are caveat routes (rate limits, high-token retries, model pulled). They're preserved in the public data so anyone re-running the analysis can reproduce or exclude them deliberately.


Context can flip the answer, but the label is a hypothesis

For each pair, I also ran 2 weak context conditions: a one-sentence prompt placed before the same forced choice. The intended association is mild and natural: "the coffee was served without milk" gestures at bitter, "the suitcase made a dull sound" gestures at heavy. The intended option is a hypothesis, not ground truth.

Strongest positive lifts (intended option's share moves toward the labeled direction):

Pair Intended Context Baseline → context Lift
sweet / bitterbitter"The coffee was served without milk while the meeting started."10.1% → 99.8%+89.2 pp
smooth / roughrough"The old rope had been stored in a shed for years beside gardening tools."11.7% → 100.0%+87.8 pp
loud / quietloud"People outside the venue could hear the final rehearsal through the doors."13.1% → 100.0%+86.9 pp
fast / slowslow"A turtle crossed the garden path while nobody interrupted it."13.6% → 99.4%+85.6 pp
up / downdown"The stairwell light flickered near the door to the basement."17.6% → 95.5%+77.8 pp

These are big effects. A single weak sentence (no instruction, no rephrasing of the question) can move the chosen-option share by 77 to 89 percentage points. The implication is uncomfortable for a lot of "is this output good?" prompts: the surrounding paragraph is doing a lot of work, and the result is not a measurement of the output, it is a measurement of the output filtered through the rest of the prompt.

The best counter-example is more interesting than the wins:

Pair: sweet / bitter. Intended option: sweet. Context: "The bakery case was almost empty by the time the queue moved." Baseline sweet share: 90.3%. Context sweet share: 38.3%. Lift: −51.7 pp.

The intended association ("bakery → sweet") was the right hypothesis, but the actual sentence ("almost empty") apparently cued absence, quietness, or bitter (post-event flavor). The label was a hypothesis, the data disagreed with the label. That is exactly why context labels in the public summary are flagged as inferredContextTargets (55 of 60 had a useful target inferred) and not as ground truth.


Method

The dataset uses 30 ordinary word pairs such as sweet/bitter, smooth/rough, and morning/evening. Each pair was asked in a bare form and a swapped form. Context prompts add one weak sentence before the same choice.

SettingValue
Temperature0.7
Repetitions10 per condition
Conditionsbare, bare swapped, context, context swapped
Inferred context targets55 of 60

Calls were orchestrated with AI-assisted tooling. The model calls themselves used OpenRouter as the model API and spend platform for this study, with SQLite as the durable run log. Runners wrote raw responses, parse decisions, usage JSON, and later attempt-level audit rows. The article and public artifacts here are AI-assisted and human-directed by Daniel Alonso.

The model taxonomy is deliberately modest: provider, family, and tier labels come from local config and route names. It is a rough descriptive grouping, not a benchmark claim about who is currently winning.


Caveats

  • Failures are preserved. Final counts: 48,316 ok, 126 invalid, 98 error, 724 rate-limited, and 1,136 model-removed rows. They are kept in the public summary so anyone re-running the analysis can exclude them deliberately rather than silently.
  • Two models are caveats. ERNIE 4.5 21B A3B hit repeated provider rate limits and was removed. The remaining invalid rows are preserved Reka Flash 3 caveat rows from pathological high-token retries; Reka alone consumed about 36% of all completion tokens and 39% of reasoning tokens for ~3.6% of attempt cost.
  • One manual override. One ERNIE 4.5 300B A47B row was manually overridden to evening after repeated explanatory answers made the final answer unambiguous. The override is logged.
  • Cost accounting is conservative. OpenRouter dashboard spend was about $28.60. Recorded attempt usage sums to less because early superseded retries were not all captured.
  • Context labels are hypotheses. As shown above, intended associations were sometimes contradicted by the data. Treat them as priors, not facts.
  • OpenRouter provider routing was not pinned. Calls used the bare model field with no provider.order or provider.only constraints, so OpenRouter was free to route a single nominal model id to any of its available backends within the run. For closed-source models this was effectively a single backend (Claude → Bedrock, GPT → OpenAI, Gemini → Google), but for open-weight models the same model id was served by many providers in one run: DeepSeek v3.2 hit 9 backends (Alibaba, AtlasCloud, Baidu, Chutes, DeepInfra, Friendli, Google, Novita, Parasail, SiliconFlow), Llama 3.3 70B hit 13, DeepSeek v4 Pro hit 6. The first-option rate within a single open-weight model varies up to ~8 percentage points across its serving providers (e.g. Llama 3.3 70B: 68.3% on AkashML vs 76.0% on DeepInfra; DeepSeek v3.2: 52.1% on Baidu vs 57.1% on Novita), so per-model rankings for open-weight routes carry some routing noise on top of the model itself. The headline aggregate finding survives, but cross-model leaderboards on the open-weight rows should be read with this in mind. The served provider is recorded in the attempts log for any reproducer that wants to slice by it.
  • Per-cell statistical power is small. With 10 reps × 4 conditions per (model, pair) cell, the binomial 95% CI half-width at p=0.5 is ~±15 pp. The aggregated claims (overall position bias across ≥48 k rows, per-model first-option share across 1,200 rows, per-pair preference across ~800 rows) are statistically robust. Single (model, pair, condition) cells are not. Any narrow claim like "model X flips on pair P under context C" should be treated as exploratory, not confirmed.
  • Frequency check is non-trivial but not exhaustive. As shown above, regressing per-pair preference on Zipf log-frequency difference gives R² ≈ 0.055; the more frequent token wins only 19/30 pairs. The simplest version of the corpus-frequency confound does not explain the per-pair preferences. It does not rule out richer frequency-based stories (collocation strength, conditional probability of the token given the prompt template, "common as a one-word answer").
  • No claim of intent or inner state. This study measures regularities in token-level output. It does not claim that any model "prefers" anything in a meaningful sense, or that semantic preferences reflect values, beliefs, or biases worth talking about politically. It is purely behavioral.

Where to get the data

The full paper, the interactive charts (Model map, Pair bias, Context, Reliability, Cost), the SHA-256 manifest, and the downloads are on the Crow Tech site:

Recommended citation:

Daniel Alonso. Arbitrary Choices Are Not Random. Independent research, May 8, 2026. https://crow.sg/research/llm-arbitrary-choice-study.

If you are about to wire an LLM into a workflow that asks it to pick between two roughly equivalent options and then trusts the answer, please randomize the order, swap the wording, and watch what happens. The bias is small per call and very loud in aggregate.