OpenRouter Study Maps 100 Trillion Tokens of AI Usage in 2025

Victor Zhang
Victor Zhang
Data visualization showing global AI model usage in 2025 based on 100 trillion analyzed tokens

Scope and Methodology

A new study by OpenRouter, produced in collaboration with a16z, uses more than 100 trillion tokens of real-world traffic to describe how AI models are actually being used in 2025.

The analysis draws on metadata from over 300 models supplied by more than 60 providers over roughly a year. Instead of relying on academic benchmarks or headline user counts, it examines “compute consumption” — which models are called, for what tasks, at what length, and at what cost.

According to information reviewed by toolmesh.ai, the report, titled “The State of AI: An Empirical Study of 100 Trillion Tokens Based on OpenRouter”, frames 2025 as a decisive turning point for AI:

  • Open-source models have reached about 30% of traffic.

  • Chinese open-source models have, at times, approached 30% of all model usage.

  • Reasoning-optimized models now account for more than half of tokens.

  • Programming and role-playing dominate usage.

  • Paid usage in Asia has more than doubled, and Simplified Chinese is the second most common interaction language.

The full report is available at: https://openrouter.ai/state-of-ai

Open Source, Model Sizes and the Rise of Chinese Providers

On OpenRouter, proprietary models from major vendors such as OpenAI, Anthropic and Google still handle around 70% of token volume, particularly for regulated, enterprise and mission-critical workloads. But open-weight models have steadily climbed to roughly 30% of usage and appear to be in sustained production use rather than short-lived trials.

A significant share of that growth comes from Chinese open-source models:

  • From 1.2% to nearly 30%: At the end of 2024, Chinese models represented about 1.2% of usage. By the second half of 2025, Chinese open-source systems including DeepSeek, Qwen, MiniMax, Kimi and GLM captured nearly 30% of all tokens in some weeks.

  • Fast iteration: Families such as DeepSeek and Qwen are described as updating at high frequency, quickly adapting to emerging workloads.

Within open source, the competitive landscape has changed:

  • In the previous year, DeepSeek V3 and R1 together at one point accounted for more than half of all open-source tokens.

  • After mid-2025, traffic fragmented among Qwen, Kimi, MiniMax, GLM, OpenAI’s GPT-OSS line, Meta’s LLaMA and others. In the second half of the year, no single open-source model consistently exceeded a 25% share.

The study highlights a shift in preferred model sizes:

  • Small models (<15B parameters): Despite launches such as Google’s Gemma series, their overall share is declining.

  • Large models (>70B parameters): No longer the only default for “serious” work.

  • Medium models (15B–70B parameters): Usage has risen sharply in 2025. Offerings like Qwen 2.5 Coder 32B and Mistral Small 3 are presented as finding a strong product–market fit by balancing capability and efficiency. The report characterizes these “medium cup” models as the new workhorses.

Agents, Reasoning and Longer Contexts

The study argues that usage is shifting from simple chat-style interaction toward agentic patterns, where models plan, reason and call tools as part of multi-step workflows.

Reasoning-Optimized Models

Traffic routed through reasoning-optimized models has grown from negligible levels at the start of 2025 to more than 50% of tokens:

  • Users increasingly seek models that can perform internal chain-of-thought, planning and self-reflection, rather than simply generating surface-level text.

  • Among reasoning models, xAI’s Grok Code Fast 1 currently handles the largest share of reasoning-related traffic, followed by Google’s Gemini 2.5 Pro and Gemini 2.5 Flash.

  • xAI’s Grok 4 Fast and OpenAI’s gpt-oss-120b are also listed among the leading reasoning models.

Tool use has become commonplace. A growing share of requests contains explicit tool-call instructions, indicating models are being embedded into larger agent systems rather than used solely as standalone chatbots. Tool usage is particularly concentrated in models tuned for agentic reasoning, such as Anthropic’s Claude Sonnet and Google’s Gemini Flash.

Longer Inputs and Heavier Tasks

The workloads themselves are getting heavier:

  • Average prompt length has risen from roughly 1,500 tokens to more than 6,000 since early 2024 — nearly a fourfold increase.

  • Average completion length has expanded from about 150 tokens to 400, driven largely by added reasoning content.

  • Average sequence length (prompt plus completion) has climbed more than threefold over 20 months, from under 2,000 tokens at the end of 2023 to over 5,400 by late 2025.

Programming workloads in particular use the longest contexts and are described as the main driver of prompt-token growth. The report notes that users are now sending entire codebases, lengthy documents and complex interaction histories for deep analysis, debugging and multi-step reasoning, rather than short, one-off prompts.

Usage Scenarios and Provider “Personalities”

Using a Google Tag Classifier over billions of requests, the study maps how AI is being used across scenarios. Two categories dominate: programming and role-playing.

Programming: Core Productivity Workload

The proportion of requests classified as programming has increased from about 11% at the start of the year to more than 50%:

  • A major driver is the spread of AI-assisted development tools and integrations with IDEs.

  • Anthropic’s Claude series dominates programming usage, consistently handling more than 60% of code-related traffic on the platform.

  • Providers such as Qwen, MiniMax, GLM and OpenAI are increasing share but remain behind Anthropic in this segment.

Overall, programming is described as both a dominant and still-growing category, forming the backbone of high-value, high-frequency usage.

Role-Playing: Consumer and Creative Demand

In the open-source segment, role-playing is the largest use case, accounting for 52% of all open-source traffic:

  • Users employ open models for story creation, game-like interactions and emotional companionship, taking advantage of their flexibility and customizability.

  • In role-playing workloads, Chinese and non-Chinese open-source models split the market roughly evenly.

  • For DeepSeek, more than two-thirds of traffic comes from role-play and casual chat, indicating strong consumer stickiness.

Long-Tail Scenarios: Science, Translation, Health and Law

Beyond programming and role-play, a long tail of use cases includes:

  • Science: Largely focused on questions about machine learning and AI themselves, reflecting the field’s self-referential tendencies.

  • Health: Highly fragmented demand, ranging from medical research inquiries to psychological counseling.

  • Translation and law: Smaller in volume, typically used as specialized tools rather than continuous workloads.

Provider Profiles

By examining scenario breakdowns, the report sketches distinct “personalities” for major providers:

  • Anthropic: Positioned as a “programmer,” with over 80% of its traffic in programming and technical tasks and minimal role-play usage.

  • DeepSeek: Used mainly for role-playing and everyday interaction, acting as a “companion” or “gamer.”

  • Google: Presents as an “all-rounder,” with broad distribution across legal, scientific, technical and general knowledge queries.

  • OpenAI: Shows a shift over time from early science and general-purpose usage toward programming and technical tasks, while role-play and casual chat volumes drop significantly.

  • xAI: Tokens are highly concentrated in programming, with noticeable growth in technical, role-play and academic usage from late November.

  • Qwen: Strongly focused on programming, with role-play and science categories fluctuating over time.

Geography, Pricing, Retention and Study Limits

Regional and Language Trends

The report finds that AI usage has become more geographically diversified:

  • Asia’s share of paid traffic has increased from 13% to 31%. The region is described as both a production hub for models and a large market for applications and enterprise users.

  • North America remains the single largest market but now accounts for less than 50% of paid usage.

By language:

  • English still dominates, with about 82% of interactions.

  • Simplified Chinese, at nearly 5%, ranks as the second-largest AI interaction language worldwide, ahead of Spanish and Russian. The study interprets this as evidence of a highly active Chinese AI community.

Retention and the “Glass Slipper” Effect

The report introduces what it calls the “Cinderella glass slipper” effect to describe retention dynamics in a period of rapid model iteration.

When a new frontier model launches, there is substantial latent demand for difficult, unsolved tasks:

  • If the model happens to solve a particular high-difficulty task type extremely well, it creates a “keystone cohort” of early users in that niche. These users show very high retention and are unlikely to switch simply because cheaper alternatives appear later, having already built workflows and infrastructure around the model.

  • If a model does not address any specific, acute pain point at launch — and is only “good enough” — it fails to form such a keystone cohort, and churn remains high across user segments.

Empirical patterns cited include:

  • GPT-4o Mini, Claude 4 Sonnet and Gemini 2.5 Pro all exhibit clear keystone cohorts early after launch, with retention curves that stabilize at high levels.

  • Some other models that do not achieve strong “model–task fit” show uniformly weak retention.

  • DeepSeek displays what the authors describe as a “boomerang effect”: retention curves dip and then rebound after a few months, suggesting users tried competitors and later returned when DeepSeek remained preferable in certain respects such as cost-effectiveness or specific tasks.

The study concludes that, under these dynamics, long-term retention is primarily driven by being first to solve hard, valuable workloads, rather than marginal gains in leaderboard standings.

Cost, Usage and Price Elasticity

By plotting model cost against usage (typically in log–log space), the study categorizes both tasks and models.

Task categories by cost and frequency:

  • Premium workloads (high price, high frequency): Programming and technical tasks. Users are willing to pay for closed models that reliably handle complex problems, as the value of the outcome often far exceeds token costs.

  • Mass traffic drivers (low price, high frequency): Role-play and general Q&A. Open-source models dominate here by combining acceptable quality with low cost.

  • Professional experts (high price, low frequency): Finance, healthcare and academic work, where each call is costly but overall query volume is modest.

  • Niche tools (low price, low frequency): Translation and legal assistance.

A median cost of about $0.73 per million tokens is cited as a vertical divider between lower- and higher-priced usages.

Model positioning by cost and usage:

  • Premium leaders: Higher-priced models with heavy usage, such as Claude Sonnet 4 and Gemini 2.5 Pro.

  • Efficient giants: Very low-cost but capable models with substantial usage, including Gemini Flash and DeepSeek V3.

  • Long tail: Extremely cheap models with limited adoption.

  • Premium specialists: Very expensive models serving rare but high-value tasks, including o1-Pro and GPT-5 Pro.

The analysis finds that overall price elasticity is weak:

  • A 10% reduction in price translates into only about 0.5%–0.7% additional usage.

  • For mission-critical workloads, users appear relatively insensitive to price; for low-value tasks, cutting prices alone does not necessarily unlock explosive growth.

However, the study notes a pattern akin to Jevons paradox: when certain models become both “cheap enough and good enough,” they are used in more places, with longer contexts and higher call frequency. Total token consumption for those models can surge, and aggregate spending does not necessarily fall.

Taken together, the findings suggest a practical division of labor:

  • Closed-source models concentrate on high-value, high-risk tasks where quality and reliability must be tightly controlled.

  • Open-source models are favored for large-volume, cost-sensitive workloads where tolerances for imperfection are higher.

In some cases, lowering prices can lead users to simply “use much more,” rather than to spend less in absolute terms.

Broader 2025 Takeaways and Study Limitations

The report closes with several broad observations drawn from its data:

  • LLMs are not just for copywriting: Programming has become the largest and strategically most important category, while role-playing and entertainment generate usage volumes comparable to classic productivity tasks.

  • A multi-model ecosystem is entrenched: Closed models dominate revenue-linked, high-stakes workloads; open models dominate low-cost, high-throughput usage.

  • Agentic reasoning is becoming standard: Longer contexts, more tool calls and multi-step logic are increasingly the norm, shifting evaluation criteria from one-off answer quality to robustness over extended reasoning chains.

  • Retention is tied to solving hard problems: The “glass slipper” effect underscores that capturing a well-fitting, high-value scenario early can matter more than incremental performance gains on benchmarks.

  • AI is no longer a North America–only story: Asia, and especially China, now play major roles as both producers and heavy users. The study argues that multilingual and multicultural adaptation will be essential in the next phase of AI deployment.

The authors also note important limitations:

  • The dataset covers only OpenRouter-platform traffic, excluding private deployments and internal enterprise systems.

  • Some metrics rely on inferred proxies, such as geographic location and identification of reasoning modes.

As a result, the report characterizes its conclusions as indicative of broader industry trends rather than definitive measures of the entire AI market.