| # | PLAYERβ | π WCSβ | π₯ PIRβ | β SHIFTβ | π‘ Honestyβ | π§ Intelβ | π Unionβ | β Leagueβ | π Paramsβ | β‘ tok/sβ | πΎ RAMβ | πͺ€ Trapβ | π Calibβ | π« Refuseβ | π Fixβ | π§© Logicβ | π’ Mathβ | π» Codeβ | π Langβ | π Knowβ | 𧬠Metaβ |
|---|
Upper-left = best value (high performance, low resource). Dot size = PIR. π₯ League One models in the upper-left are Giant Killers.
Best model from each league, compared on 5 SHIFT axes. Outer = better.
Who squeezes the most speed from each GB of RAM? Bigger slice = more efficient.
Red zone = Frontier giants. Colored dots = Smol challengers. Closer to red = closer to SOTA.
A 4B model using only 2GB RAM achieves higher SHIFT scores than most 8B models requiring 5.5GB. Doubling parameters β doubling performance.
β SHIFT gap: 0.1 points for 2.75Γ more RAM
Qwen3-1.7B (1.2GB) outscores three 7-14B models. Latest architecture + small size > old architecture + big size.
β 1.7B beats 7B, 8B, and 14B models
World's first 5-axis benchmark for small language models (β€10B active params). SHIFT measures what matters for edge: not just intelligence, but honesty, speed, and efficiency.
Size β Model footprint
Honesty β Hallucination, calibration, refusal, self-correction
Intelligence β Reasoning, math, coding, 7 languages, metacognition
Fast β Tokens/sec, TTFA
Thrift β Peak VRAM/RAM
WCS = β(SHIFT Γ PIRnorm)
The official ranking metric. Geometric mean of quality (SHIFT) and efficiency (PIR). Both must be high to score well.
PIR = (I Γ H Γ F) Γ· (S Γ T) Β· PIRnorm = logββ(PIR) / logββ(max) Γ 100
Efficiency rating. Like boxing's P4P: how much punch per pound of hardware.
π₯
League One (<2GB) β Raspberry Pi
β½ La Liga (2-4GB) β Smartphone
π
Premier League (4-8GB) β Laptop
π Champions League (8-16GB) β PC
π¬π§ EN Β· π°π· KO Β· πΈπ¦ AR Β· π§π· PT Β· πΉπ· TR Β· π§π© BN Β· πΉπ TH
2.7B+ speakers. Sentiment, idioms, translation, culture.
Same 20 cross-benchmark questions given to frontier SOTA models. Direct comparison with Claude, GPT-5, etc. Scores are not publicly disclosed.