Research ManuscriptNLP · LLM Evaluation · Psychometrics

Every Token Counts

Isolating Latent Behavior of LLMs via Exact Likert Distributions

Davood Wadi·HEC Montréal · 2026

"Can I trust Chinese AI models?"

Context

Chinese AI is widely adopted. Can we trust it?

—

Chinese LLMs (DeepSeek, Qwen) are now widely used in North America and beyond.

—

Users increasingly ask: do these models carry a political or cultural agenda?

—

We study this through ethnocentrism — the tendency to favor one's own country.

Definition

Ethnocentrism is in-group bias: the tendency of a group to evaluate its own country favorably while being dismissive of foreign countries. Originally from consumer behavior, extended to political science, sociology, and psychology.

The Instrument

How do we measure ethnocentrism?
The Consumer Ethnocentric Tendencies Scale.

SYSTEM PROMPT (CONSTRAINT)

You are a consumer in a market research study. Your task is to provide answers to the questions given.

Response format is 7-point Likert-type scale (Strongly agree = 7, Strongly disagree = 1).

Please respond with ONLY the digit of the response. Do not provide any other text, explanation, or punctuation.

USER MESSAGE (CROSSED DESIGN)

"American people should always buy American-made products instead of imports."

↓ Swapped dynamically per condition ↓

"Chinese people should always buy Chinese-made products instead of imports."

17 ITEMS TOTAL•4 TARGET COUNTRIES•4 LLMS

Prior Work

Bias research exists. But it does not measure.

What we know

›Political leanings in LLMs
›Cultural stereotypes

The gap

›Studies limited to descriptive statistics only
›Aggregate metrics on existing datasets
›No systematic quantification

What is missing

›Limited statistical guarantees
›Cannot isolate factors

The Problem

Three distinct methodological barriers

Rigorous measurement of LLM ethnocentrism requires overcoming challenges that existing methods cannot address.

Experimental Design

Vignettes and benchmark datasets show that behavior exists. Factorial designs prove which factor causes it.

Ordinal Measurement

LLMs output token probabilities. Likert scales are ordinal. Standard metrics (e.g., entropy) fail to bridge these.

III

Exact vs. Sampled Distribution

Sampling is borrowed from human research, but LLM distributions are exact and known. Sampling adds noise unnecessarily.

Barrier I

Design for causal inference, not just observation

Current practice

›Single vignette per condition
›One model, one prompt
›Hard to isolate cause
›Cannot rule out confounds

Our approach

›Fully crossed factorial design
›5 models × 4 national targets
›All cells filled — main effects isolated
›Interaction effects visible

Key point

Factorial design transforms the research question from "does bias exist?" to "how much and under what conditions?"

Barrier II

Likert scales are ordinal. Entropy is not.

Entropy treats all categories as unordered. It cannot distinguish a model split between adjacent responses from one split between the extremes — even when the two distributions have identical $H$ .

Mass at 3 and 5

H = 1.00

\mathrm{Cns} = 0.74

Near-centre split. Moderate ordinal disagreement.

Mass at 1 and 7

H = 1.00

\mathrm{Cns} = 0.00

Extreme polarization. Complete ordinal dissension.

Both distributions have $H = 1$ bit. Entropy says they are identical. A distance-sensitive consensus measure separates them.

Barrier III

Stop sampling. We already have the distribution.

Human survey methodology requires sampling because population distributions are unknowable. With LLMs, their response distributions are exact and fully observable.

Human science

›Latent states are inaccessible — sampling is unavoidable
›Variance reflects real individual differences across people
›Aggregation estimates the population distribution
›Sample size determines precision of that estimate

LLMs

›Each prompt has a fixed, fully observable internal state
›Token probabilities are exact — not a sample from anything
›Sampling only injects Monte Carlo noise on top of exact data
›We recover the exact PMF directly from the logits

Our Method

Three-layer measurement framework

Each layer addresses one of the methodological barriers.

Constraint

Does the LLM adhere to the task constraints.

Consensus

Measure consensus and polarization on the scale.

Construct

Decompose the observed response into factor effects.

Layer 01 — Constraint

Can the model even follow the task?

We define the valid token set $\mathcal{V}_{\mathrm{val}}$ as the subset of vocabulary tokens that map to the allowed Likert responses 1–7. Any probability mass outside $\mathcal{V}_{\mathrm{val}}$ means the model fence-sat or hallucinated out of range and cannot be used.

$\text{Failure rate} = 1 - \sum_{t\in\mathcal{V}_{\mathrm{val}}}P_{\mathrm{raw}}(t\mid x)$

Probability mass that falls outside the valid token set. the model's rate of non-adherence to the numeric constraint.

Subsequent analysis (Layers 2 and 3) operates on the renormalized distribution restricted to $\mathcal{V}_{\mathrm{val}}$ analogous to excluding non-compliant participants in human studies.

Failure rate by model and national target

Failure rate by model and target country

Layer 02 — Consensus: The Problem

Why entropy fails for Likert scales

Identical means can mask fundamentally different regimes, a model polarized between 1 and 7 looks the same as one concentrated on 4.

Traditional measures of dispersion, such as Shannon entropy, assume categorical values. They are agnostic to the distances between responses.

Entropy alone cannot distinguish these three distributions.

All mass on one point

H = 0.00

Adjacent poles

H = 1.00

Opposite extremes

H = 1.00

Rows 2 and 3 have identical entropy (H = 1.00) but represent completely different behavior: entropy is blind to ordinal distance.

Layer 02 — Consensus: The Solution

Measuring internal consistency

To remedy this, we use a multidimensional consensus measure. It penalizes spread in proportion to the ordinal distances between responses.

$\mathrm{Cns}(\mathbf{Y}_{\boldsymbol{\lambda}})=1+\sum_{\mathbf{y}\in\mathcal{Y}^K}P(\mathbf{y})\log_2\!\left(1-\frac{\|\mathbf{y}-\boldsymbol{\mu}\|_2}{d_{\max}}\right)$

where $\boldsymbol{\mu}$ is the itemwise mean vector and $d_{\max}$ is the maximum diagonal distance on the Likert scale.

This demonstrates the level of internal consistency, or polarization, a model has on our ethnocentrism scale. High consensus means probability mass is tightly concentrated; high dissension means it is spread across opposing poles.

Convolving a multi-item scale collapses polarization.

Layer 03 — Construct

From item PMFs to exact effect distributions

Comparing means across conditions is insufficient: it obscures variance and cannot isolate one factor from another. We adapt ANOVA to exact probability distributions, giving us statistically grounded effect sizes without sampling noise.

$P_{S_{\boldsymbol{\lambda}}}=P_{Y_{1,\boldsymbol{\lambda}}}\circledast\cdots\circledast P_{Y_{K,\boldsymbol{\lambda}}}$

The composite construct score $S_{\boldsymbol{\lambda}}$ is the sum of $K$ independent item responses. Its exact PMF is derived by discrete convolution $\circledast$ of the individual item distributions propagating all aleatoric uncertainty to the construct level.

$\mathbb{E}[S_{\boldsymbol{\lambda}}] = \underbrace{\mathbb{E}[S_0]}_{\text{Grand Mean}} + \sum_{c \in \mathcal{C}} \underbrace{\mathbb{E}[E_c(\lambda_c)]}_{\text{Main Effects}} + \sum_{\substack{U \subseteq \mathcal{C} \ |U| \ge 2}} \underbrace{\mathbb{E}[E_U(\boldsymbol{\lambda}_U)]}_{\text{Interactions}}$

Hoeffding decomposition: the expected construct score decomposes exactly into a grand baseline $\mathbb{E}[S_0]$ , main effects per factor, and interaction effects, recovering classical ANOVA fixed-effects parameters (Theorem 1).

Experiment

Five models. Four national targets.

A fully crossed 5 × 4 factorial design using CETSCALE — the validated consumer ethnocentrism measurement instrument — adapted for national attribution.

Model factor

USALlama 3.3 70BMeta

USAGemma 3 27BGoogle

ChinaQwen3 Next 80BAlibaba

CanadaAya Expanse 32BCohere

FranceMinistral 14BMistral

Fully crossed design (5 models × 4 targets = 20 conditions)

USA

China

Canada

France

Llama 3.3 70B

In-groupfavor?

Out-groupbias?

Gemma 3 27B

In-groupfavor?

Out-groupbias?

Qwen3 Next 80B

Out-groupbias?

In-groupfavor?

Out-groupbias?

Aya Expanse 32B

Out-groupbias?

In-groupfavor?

Out-groupbias?

Ministral 14B

Out-groupbias?

In-groupfavor?

In-group: model evaluates its own country

Out-group: model evaluates a foreign country

Results

LLMs score at the extreme end of human ethnocentrism

Composite CETSCALE scores (sum of 17 items, range 17-119) for the Target = USA condition. Human data are historical population samples from Shimp and Sharma (1987). Several models exceed the most ethnocentric human population ever recorded.

Human populations

Detroit (USA)

68.6 ±26.0

Carolinas (USA)

61.3 ±24.4

Denver (USA)

57.8 ±26.1

Los Angeles (USA)

56.6 ±26.4

Students (Pre)

51.9 ±16.4

Students (Post)

53.4 ±16.5

LLMs (Target = USA)

Aya Expanse 32B

89.1 ±1.3

Llama 3.3 70B

72.0 ±0.9

Ministral 14B

70.2 ±5.3

Gemma 3 27B

60.3 ±0.9

Qwen3 Next 80B

53.9 ±1.2

Vertical line marks the highest human population mean (Detroit, 68.58).

Main Effects

Aya far exceeds human ethnocentrism; China is the most disfavored country

The exact-PMF Hoeffding decomposition isolates model and country main effects as distributions centered on the grand mean ( $\mu_\emptyset = 66.25$ ). Robustness is assessed via SNR and dPD — no p-values, no sampling assumptions.

Model main effects ( $\mathbb{E}[E_m]$ , deviation from $\mu_\emptyset$ )

Model	$\mathbb{E}$	$\mathrm{SD}$	$\mathrm{SNR}$	$\mathrm{dPD}$
Aya Expanse 32B	+21.19	13.01	1.63	>0.99
Ministral 14B	+5.11	9.59	0.53	0.62
Llama 3.3 70B	+0.31	12.79	0.02	0.55
Gemma 3 27B	-11.99	13.15	0.91	0.77
Qwen3 Next 80B	-14.63	12.17	1.20	0.93

Country main effects ( $\mathbb{E}[E_t]$ , deviation from $\mu_\emptyset$ )

Country	$\mathbb{E}$	$\mathrm{SD}$	$\mathrm{SNR}$	$\mathrm{dPD}$
Canada	+4.62	5.72	0.81	0.90
USA	+2.84	5.43	0.52	0.55
France	-1.00	4.72	0.21	0.54
China	-6.46	6.19	1.04	0.95

$\mathrm{SNR} = \mathbb{E}/\mathrm{SD}$ . $\mathrm{dPD}$ = directional probability of difference (Bayesian analog of one-sided p-value). Robust rows highlighted.

Interaction Effects

The model's origin shapes which countries it devalues

The Model × Target interaction effects, isolated via Hoeffding decomposition, reveal country-of-origin bias: which model you use determines not just how ethnocentric it is overall, but which specific target countries it systematically favors or disfavors.

US models favor North America, penalize China

Gemma 3-27B and Llama 3.3-70B (both US-developed) show the strongest structured interactions: positive toward USA and Canada, sharply negative toward China. The paper identifies these as the primary country-of-origin bias signal.

The Chinese model does not reciprocate

Qwen3-80B (Chinese) shows near-zero interactions across almost all targets (all SNR < 0.5). It does not exhibit a reciprocal in-group preference of comparable magnitude to the US models.

Interactions are isolated from baseline behavior

The framework mathematically isolates interaction effects from main effects. The country-of-origin bias is a true Model × Target interaction — not an artifact of a model's overall ethnocentrism level.

Interaction plot: model origin × national target

Model × Target interaction plot

Thank You

Questions?

Happy to dig into any part of the framework, measurement theory, the exact-PMF approach, model selection, or the bias findings.

Davood Wadi, PhDMarketing and AIdavood.wadi@hec.ca

Open Questions

This framework generalizes to any psychometric instrument. If you are working on LLM evaluation methods, there may be natural overlap.

Behavioral LLM research
Resource/Bounded rationality of LLMs
The decision making process of LLMs

Every Token Counts

Chinese AI is widely adopted. Can we trust it?

How do we measure ethnocentrism? The Consumer Ethnocentric Tendencies Scale.

Bias research exists. But it does not measure.

Three distinct methodological barriers

Experimental Design

Ordinal Measurement

Exact vs. Sampled Distribution

Design for causal inference, not just observation

Likert scales are ordinal. Entropy is not.

Stop sampling. We already have the distribution.

Three-layer measurement framework

Constraint

Consensus

Construct

Can the model even follow the task?

Why entropy fails for Likert scales

Measuring internal consistency

From item PMFs to exact effect distributions

Five models. Four national targets.

LLMs score at the extreme end of human ethnocentrism

Aya far exceeds human ethnocentrism; China is the most disfavored country

The model's origin shapes which countries it devalues

US models favor North America, penalize China

The Chinese model does not reciprocate

Interactions are isolated from baseline behavior

Questions?

How do we measure ethnocentrism?
The Consumer Ethnocentric Tendencies Scale.