Research ManuscriptNLP · LLM Evaluation · Psychometrics

Every Token Counts

Isolating Latent Behavior of LLMs via Exact Likert Distributions

Davood Wadi·HEC Montréal · 2026

"Can I trust Chinese AI models?"


01

Chinese AI is widely adopted. Can we trust it?

Chinese LLMs (DeepSeek, Qwen) are now widely used in North America and beyond.

Users increasingly ask: do these models carry a political or cultural agenda?

We study this through ethnocentrism — the tendency to favor one's own country.

Definition

Ethnocentrism is in-group bias: the tendency of a group to evaluate its own country favorably while being dismissive of foreign countries. Originally from consumer behavior, extended to political science, sociology, and psychology.

How do we measure ethnocentrism?
The Consumer Ethnocentric Tendencies Scale.

SYSTEM PROMPT (CONSTRAINT)
You are a consumer in a market research study. Your task is to provide answers to the questions given.

Response format is 7-point Likert-type scale (Strongly agree = 7, Strongly disagree = 1).

Please respond with ONLY the digit of the response. Do not provide any other text, explanation, or punctuation.
USER MESSAGE (CROSSED DESIGN)

"American people should always buy American-made products instead of imports."

↓ Swapped dynamically per condition ↓

"Chinese people should always buy Chinese-made products instead of imports."

17 ITEMS TOTAL4 TARGET COUNTRIES4 LLMS

Bias research exists. But it does not measure.

What we know

  • Political leanings in LLMs
  • Cultural stereotypes

The gap

  • Studies limited to descriptive statistics only
  • Aggregate metrics on existing datasets
  • No systematic quantification

What is missing

  • Limited statistical guarantees
  • Cannot isolate factors

02

Three distinct methodological barriers

Rigorous measurement of LLM ethnocentrism requires overcoming challenges that existing methods cannot address.

I

Experimental Design

Vignettes and benchmark datasets show that behavior exists. Factorial designs prove which factor causes it.

II

Ordinal Measurement

LLMs output token probabilities. Likert scales are ordinal. Standard metrics (e.g., entropy) fail to bridge these.

III

Exact vs. Sampled Distribution

Sampling is borrowed from human research, but LLM distributions are exact and known. Sampling adds noise unnecessarily.

Design for causal inference, not just observation

Current practice

  • Single vignette per condition
  • One model, one prompt
  • Hard to isolate cause
  • Cannot rule out confounds

Our approach

  • Fully crossed factorial design
  • 5 models × 4 national targets
  • All cells filled — main effects isolated
  • Interaction effects visible

Key point

Factorial design transforms the research question from "does bias exist?" to "how much and under what conditions?"

Likert scales are ordinal. Entropy is not.

Entropy treats all categories as unordered. It cannot distinguish a model split between adjacent responses from one split between the extremes — even when the two distributions have identical HH.

1
2
3
4
5
6
7
Mass at 3 and 5
H=1.00H = 1.00Cns=0.74\mathrm{Cns} = 0.74

Near-centre split. Moderate ordinal disagreement.

1
2
3
4
5
6
7
Mass at 1 and 7
H=1.00H = 1.00Cns=0.00\mathrm{Cns} = 0.00

Extreme polarization. Complete ordinal dissension.

Both distributions have H=1H = 1 bit. Entropy says they are identical. A distance-sensitive consensus measure separates them.

Stop sampling. We already have the distribution.

Human survey methodology requires sampling because population distributions are unknowable. With LLMs, their response distributions are exact and fully observable.

Human science

  • Latent states are inaccessible — sampling is unavoidable
  • Variance reflects real individual differences across people
  • Aggregation estimates the population distribution
  • Sample size determines precision of that estimate

LLMs

  • Each prompt has a fixed, fully observable internal state
  • Token probabilities are exact — not a sample from anything
  • Sampling only injects Monte Carlo noise on top of exact data
  • We recover the exact PMF directly from the logits

03

Three-layer measurement framework

Each layer addresses one of the methodological barriers.

01

Constraint

Does the LLM adhere to the task constraints.

02

Consensus

Measure consensus and polarization on the scale.

03

Construct

Decompose the observed response into factor effects.

Can the model even follow the task?

We define the valid token set Vval\mathcal{V}_{\mathrm{val}} as the subset of vocabulary tokens that map to the allowed Likert responses 1–7. Any probability mass outside Vval\mathcal{V}_{\mathrm{val}} means the model fence-sat or hallucinated out of range and cannot be used.

Failure rate=1tVvalPraw(tx)\text{Failure rate} = 1 - \sum_{t\in\mathcal{V}_{\mathrm{val}}}P_{\mathrm{raw}}(t\mid x)

Probability mass that falls outside the valid token set. the model's rate of non-adherence to the numeric constraint.

Subsequent analysis (Layers 2 and 3) operates on the renormalized distribution restricted to Vval\mathcal{V}_{\mathrm{val}} analogous to excluding non-compliant participants in human studies.

Failure rate by model and national target

Failure rate by model and target country

Why entropy fails for Likert scales

Identical means can mask fundamentally different regimes, a model polarized between 1 and 7 looks the same as one concentrated on 4.

Traditional measures of dispersion, such as Shannon entropy, assume categorical values. They are agnostic to the distances between responses.

Entropy alone cannot distinguish these three distributions.

All mass on one point
All mass on one point
H = 0.00
Adjacent poles
Adjacent poles
H = 1.00
Opposite extremes
Opposite extremes
H = 1.00

Rows 2 and 3 have identical entropy (H = 1.00) but represent completely different behavior: entropy is blind to ordinal distance.

Measuring internal consistency

To remedy this, we use a multidimensional consensus measure. It penalizes spread in proportion to the ordinal distances between responses.

Cns(Yλ)=1+yYKP(y)log2 ⁣(1yμ2dmax)\mathrm{Cns}(\mathbf{Y}_{\boldsymbol{\lambda}})=1+\sum_{\mathbf{y}\in\mathcal{Y}^K}P(\mathbf{y})\log_2\!\left(1-\frac{\|\mathbf{y}-\boldsymbol{\mu}\|_2}{d_{\max}}\right)

where μ\boldsymbol{\mu} is the itemwise mean vector and dmaxd_{\max} is the maximum diagonal distance on the Likert scale.

This demonstrates the level of internal consistency, or polarization, a model has on our ethnocentrism scale. High consensus means probability mass is tightly concentrated; high dissension means it is spread across opposing poles.

Consensus penalization diagram

Convolving a multi-item scale collapses polarization.

From item PMFs to exact effect distributions

Comparing means across conditions is insufficient: it obscures variance and cannot isolate one factor from another. We adapt ANOVA to exact probability distributions, giving us statistically grounded effect sizes without sampling noise.

PSλ=PY1,λPYK,λP_{S_{\boldsymbol{\lambda}}}=P_{Y_{1,\boldsymbol{\lambda}}}\circledast\cdots\circledast P_{Y_{K,\boldsymbol{\lambda}}}

The composite construct score SλS_{\boldsymbol{\lambda}} is the sum of KK independent item responses. Its exact PMF is derived by discrete convolution \circledast of the individual item distributions propagating all aleatoric uncertainty to the construct level.

E[Sλ]=E[S0]Grand Mean+cCE[Ec(λc)]Main Effects+UC U2E[EU(λU)]Interactions\mathbb{E}[S_{\boldsymbol{\lambda}}] = \underbrace{\mathbb{E}[S_0]}_{\text{Grand Mean}} + \sum_{c \in \mathcal{C}} \underbrace{\mathbb{E}[E_c(\lambda_c)]}_{\text{Main Effects}} + \sum_{\substack{U \subseteq \mathcal{C} \ |U| \ge 2}} \underbrace{\mathbb{E}[E_U(\boldsymbol{\lambda}_U)]}_{\text{Interactions}}

Hoeffding decomposition: the expected construct score decomposes exactly into a grand baseline E[S0]\mathbb{E}[S_0], main effects per factor, and interaction effects, recovering classical ANOVA fixed-effects parameters (Theorem 1).


04

Five models. Four national targets.

A fully crossed 5 × 4 factorial design using CETSCALE — the validated consumer ethnocentrism measurement instrument — adapted for national attribution.

Model factor

USALlama 3.3 70BMeta
USAGemma 3 27BGoogle
ChinaQwen3 Next 80BAlibaba
CanadaAya Expanse 32BCohere
FranceMinistral 14BMistral

Fully crossed design (5 models × 4 targets = 20 conditions)

USA
China
Canada
France
Llama 3.3 70B
In-groupfavor?
Out-groupbias?
Out-groupbias?
Out-groupbias?
Gemma 3 27B
In-groupfavor?
Out-groupbias?
Out-groupbias?
Out-groupbias?
Qwen3 Next 80B
Out-groupbias?
In-groupfavor?
Out-groupbias?
Out-groupbias?
Aya Expanse 32B
Out-groupbias?
Out-groupbias?
In-groupfavor?
Out-groupbias?
Ministral 14B
Out-groupbias?
Out-groupbias?
Out-groupbias?
In-groupfavor?
In-group: model evaluates its own country
Out-group: model evaluates a foreign country

05

LLMs score at the extreme end of human ethnocentrism

Composite CETSCALE scores (sum of 17 items, range 17-119) for the Target = USA condition. Human data are historical population samples from Shimp and Sharma (1987). Several models exceed the most ethnocentric human population ever recorded.

Human populations

Detroit (USA)
68.6 ±26.0
Carolinas (USA)
61.3 ±24.4
Denver (USA)
57.8 ±26.1
Los Angeles (USA)
56.6 ±26.4
Students (Pre)
51.9 ±16.4
Students (Post)
53.4 ±16.5

LLMs (Target = USA)

Aya Expanse 32B
89.1 ±1.3
Llama 3.3 70B
72.0 ±0.9
Ministral 14B
70.2 ±5.3
Gemma 3 27B
60.3 ±0.9
Qwen3 Next 80B
53.9 ±1.2

Vertical line marks the highest human population mean (Detroit, 68.58).

Aya far exceeds human ethnocentrism; China is the most disfavored country

The exact-PMF Hoeffding decomposition isolates model and country main effects as distributions centered on the grand mean (μ=66.25\mu_\emptyset = 66.25). Robustness is assessed via SNR and dPD — no p-values, no sampling assumptions.

Model main effects (E[Em]\mathbb{E}[E_m], deviation from μ\mu_\emptyset)

ModelE\mathbb{E}SD\mathrm{SD}SNR\mathrm{SNR}dPD\mathrm{dPD}
Aya Expanse 32B+21.1913.011.63>0.99
Ministral 14B+5.119.590.530.62
Llama 3.3 70B+0.3112.790.020.55
Gemma 3 27B-11.9913.150.910.77
Qwen3 Next 80B-14.6312.171.200.93

Country main effects (E[Et]\mathbb{E}[E_t], deviation from μ\mu_\emptyset)

CountryE\mathbb{E}SD\mathrm{SD}SNR\mathrm{SNR}dPD\mathrm{dPD}
Canada+4.625.720.810.90
USA+2.845.430.520.55
France-1.004.720.210.54
China-6.466.191.040.95

SNR=E/SD\mathrm{SNR} = \mathbb{E}/\mathrm{SD}. dPD\mathrm{dPD} = directional probability of difference (Bayesian analog of one-sided p-value). Robust rows highlighted.

The model's origin shapes which countries it devalues

The Model × Target interaction effects, isolated via Hoeffding decomposition, reveal country-of-origin bias: which model you use determines not just how ethnocentric it is overall, but which specific target countries it systematically favors or disfavors.

US models favor North America, penalize China

Gemma 3-27B and Llama 3.3-70B (both US-developed) show the strongest structured interactions: positive toward USA and Canada, sharply negative toward China. The paper identifies these as the primary country-of-origin bias signal.

The Chinese model does not reciprocate

Qwen3-80B (Chinese) shows near-zero interactions across almost all targets (all SNR < 0.5). It does not exhibit a reciprocal in-group preference of comparable magnitude to the US models.

Interactions are isolated from baseline behavior

The framework mathematically isolates interaction effects from main effects. The country-of-origin bias is a true Model × Target interaction — not an artifact of a model's overall ethnocentrism level.

Interaction plot: model origin × national target

Model × Target interaction plot

Questions?

Happy to dig into any part of the framework, measurement theory, the exact-PMF approach, model selection, or the bias findings.

Davood Wadi, PhDMarketing and AIdavood.wadi@hec.ca

Open Questions

This framework generalizes to any psychometric instrument. If you are working on LLM evaluation methods, there may be natural overlap.

  • Behavioral LLM research
  • Resource/Bounded rationality of LLMs
  • The decision making process of LLMs