AI Big Model's “Chinese Tax”: Chinese Costs More Token Than English, Why?

popularization of science9hrs agoupdate AiFun

When Opus 4.7 was first released, there were a lot of complaints on X. Some people said that one conversation maxed out her session quota. One person said that one conversation maxed out her session quota, another said that the same piece of code ran for more than double the cost of the previous week, and yet another showed a screenshot of her $200 Max subscription hitting the top in less than two hours.

Indie developer BridgeMind recognizes Claude as the best model in the world, but also the most expensive. His Max subscription was maxed out in less than two hours, but luckily - he bought two copies. |Image credit: X@bridgemindai

Anthropic's official price remains unchanged, at $5 per million input tokens and $25 per million output tokens. But this version introduces a new tokenizer, and Claude Code raises the default effort from high to xhigh, which combined makes the amount of tokens consumed for the same job 2 to 2.7 times higher than before.

I've seen two Chinese-related statements in these discussions. One is: Chinese barely went up under the new tokenizer, and Chinese users escaped the price increase. The other one is more interesting:Ancient languages save more tokens than modern Chinese, and dialoguing with AIs in literary language can save costs..

The first statement implies that Claude did some sort of optimization for Chinese, but Anthropic's release documents don't mention any Chinese-related tweaks.

The second argument is more difficult to explain. Ancient texts are obviously harder for human readers to understand than modern Chinese; how can a text that is more complex for humans be easier for an AI?

So I did a test with 22 parallel texts (containing types of business news, technical documents, ancient texts, daily conversations, etc.), fed into 5 tokenizers (Claude 4.6 and 4.7, GPT-4o, Qwen 3.6, DeepSeek-V3) at the same time, and read the number of tokens of each text under each model to do a side-by-side comparison.

Test Text:

1. Daily conversations in English and Chinese (travel, forum help, writing requests)

2、Technical documents in English and Chinese (python documents, Anthropic documents)

3. News in English and Chinese (NYT current news, NYT business news, official Apple statements)

4. Literary selections in Chinese, English and Old Chinese (Exodus from the Master's Table, Tao Te Ching)

After the test, both claims were partially verified, but the truth will be a little more complicated than the rumors.

01 Chinese tax

Let's start with the conclusion:

1,Chinese has always been more expensive than English on Claude and GPT.

2,On Qwen and DeepSeek, Chinese is cheaper than English!

3,Opus 4.7, the tokenizer upgrade that triggered the shock, inflation occurred almost exclusively in English, Chinese did not move at all.

The whole series of models before Claude Opus 4.7 (including Opus 4.6, Sonnet, and Haiku) use the same tokenizer, under which the consumption of Chinese tokens is higher than the equivalent English content across the board, with cn/en ratios ranging between 1.11× and 1.64×.

The most extreme scenario occurs in NYT-style business news: for the same piece of content, the Chinese version consumes 64% more tokens, which equals 64% more money.

Opus 4.6 and prior Claude models consume significantly more Chinese tokens than other models (red box)

The most extreme scenario occurs in NYT-style business news: for the same piece of content, the Chinese version consumes 64% more tokens (green box)

GPT-4o's o200k tokenizer is a bit better, the cn/en ratio mostly falls between 1.0 and 1.35×, with some scenarios below 1. Chinese is still expensive overall, but the gap is much smaller than Claude.

The data of the Chinese models Qwen 3.6 and DeepSeek-V3 are completely reversed. The cn/en ratio of both models is largely below 1, which means that the Chinese version saves more tokens than the English version for the same content.DeepSeek is as low as 0.65×, and the Chinese version of the same passage is one-third cheaper than the English version..

The new tokenizer in Opus 4.7 inflates almost exclusively in English. The English token count has inflated from 1.24× to 1.63×, while the Chinese token count has stayed at 1.000×, almost unchanged. Those English developers' bill shock at the beginning, Chinese users really didn't feel it. The reason may be that Chinese has been cut down to single-word granularity in the old version, and there is very little room for splitting.

Opus 4.7 vs. 4.6, English consumes more tokens, but Chinese remains the same.

One other thing I noticed during the testing is that the difference in token consumption is not just a billing issue, it directly affects the size of the workspace. For the same 200k context window, using the old Claude tokenizer to load Chinese data, the amount of content that can be stuffed into it is 40% to 70% less than that of English.

For the same type of work, such as having an AI analyze a long document or summarize a set of meeting minutes, the Chinese user can feed the model less material and the model can refer to a shorter context. The result is paying more money, but getting a smaller workspace.

When the four sets of data are looked at together, a question naturally surfaces:

Why is the token count different for the same piece of content in a different language? Why Claude and GPT's Chinese is expensive while Qwen and DeepSeek's Chinese is cheaper??

The answer is hidden in the concept tokenizer, mentioned several times above.

02 How many pieces can you cut a Chinese character into?

Before the model reads any text, it cuts the input into tokens using a tokenizer, which you can think of as an AI's "block cutter". You can think of the tokenizer as an AI's "block cutter". You type in a sentence, and the tokenizer is responsible for breaking the sentence into standardized blocks (i.e., tokens.) The AI model doesn't look at the text, it only recognizes the number of the blocks. The AI model doesn't read words, it only recognizes the number of the blocks, and the amount of blocks you use is the amount of money you pay.

The English cut is more intuitive, for example, "intelligence" is most likely a token, and "information" is also a token, with one word corresponding to one billing unit.

But the Chinese language has a problem when it comes to this step. When the same sentence "AI is reshaping the global information infrastructure" is fed into GPT-4's cl100k tokenizer and Qwen 2.5's tokenizer respectively, the results are completely different.

GPT-4 basically splits each Chinese character into a token; Qwen recognizes words as a token, for example, the four words "artificial intelligence" are only counted as a token in Qwen.

For the same sentence with 16 Chinese characters, GPT-4 cuts out 19 tokens, while Qwen cuts out only 6 tokens.

Why is it cut like this? The reason is in an algorithm called BPE (Byte Pair Encoding).

The way BPE works is to count which character combinations occur most frequently in the training corpus, and then merge the high-frequency combinations into a token to be included in the word list.

In the GPT-2 era, the vast majority of the training corpus was English. English letter combinations (th, ing, tion) appear repeatedly and are quickly merged into tokens. Chinese characters appear too infrequently in that corpus pool to be included in the word list, and can only be treated as raw bytes, and a Chinese character occupies 3 bytes, which becomes 3 tokens.

BPE decides to merge by character frequency in the training corpus. Chinese UTF-8 bytes cannot be merged into whole characters under English corpus dominance

Later, the cl100k word list of GPT-4 was enlarged, and commonly used Chinese characters started to be included, and a character was usually reduced to 1 or 2 tokens, but the overall efficiency was still not as good as English.

By the time GPT-4o's o200k word list is available, the efficiency of Chinese has gone one step further. This also explains why GPT-4o's cn/en ratio is lower than Claude's in the first paragraph.

Qwen and DeepSeek, as domestic models, incorporate a large number of commonly used Chinese characters and high-frequency phrases into the word list as whole characters and words from the beginning. One token per word, the efficiency is directly doubled or even more.

Splitting the same sentence with different tokenizers

That's why their cn/en ratio can be lower than 1.The information density of Chinese characters is already higher than that of English words, and when tokenizers don't artificially break up Chinese characters, this natural advantage becomes apparent..

So the difference between the four sets of data in the previous section is not rooted in the model's ability, but in how much space is left for Chinese in the tokenizer's word list.

Claude's and early GPT's wordlists were built with English as the default, and Chinese was "shoehorned in" later; Qwen's and DeepSeek's wordlists were designed to treat Chinese as the default language from the beginning. This difference in starting point is carried all the way to token count, billing, and context window size.

03 Is the old text really cheaper?

Look again at the second rumor at the beginning:Ancient languages save more tokens than modern Chinese.

The data confirms this claim. In the tests, the cn/en ratio of the ancient text samples is below 1 across the board, and is consistent across all five tokenizers. The ancient version of the same passage has fewer tokens than the corresponding English translation.

Among all the models, the number of tokens consumed by ancient languages is not only less than that of modern Chinese, but even less than that of English

The reason is not complicated. Ancient texts use words with extreme refinement. "Learning without thinking is confusing, thinking without learning is dangerous" is 12 words. Translated into modern Chinese, it means that "just learning without thinking will confuse you, and just thinking without learning will get you into trouble", which directly doubles the number of words and naturally doubles the number of tokens.

Moreover, the common characters in ancient languages (之, 也, 者, 而, 不) are all high-frequency characters, which have their own place in any tokenizer's word list, and are not split into bytes. So ancient texts are really efficient at the encoding level.

But there's a trap hidden here.

The token of the ancient text is saved on the coding side, but the inference burden on the model is not reduced. The model needs to determine whether the word "reckless" is "confused", "deluded" or "not" in this context. Modern Chinese can make this meaning clear in 26 words, but using the ancient language is equivalent to pressing back the part that is spread out, leaving the reasoning to the model. Let's say a file compressed into a zip is smaller in size, but decompressing it requires more computation.

token is saved, the consumption of reasoning goes up, and the accuracy of understanding goes downThe account is too big to count. The math doesn't add up.

The example of Gurwen made me realize that token count by itself doesn't tell me much. But thinking along that line, there's another layer that I'd overlooked before.

As mentioned above, the tokenizer in the GPT-2 era would split the word "human" into three UTF-8 byte tokens. Later, the word list in GPT-4 was expanded, and the commonly used Chinese characters became one token for each word, and Qwen went one step further by combining the four words of AI into one token.

Intuitively this is a continuous improvement process: the more you merge, the more efficient you become, and the better the model should be understood.

But is that really the case? Let's recall how we came to know Chinese characters.

Chinese characters are ideographs, and more than 80% of the modern Chinese characters are morphosyntactic characters, which are made up of a combination of an ideographic radical and a phonetic component. Characters next to "氵" are mostly related to liquids, those next to "木" are mostly related to plants, and those next to "火" are mostly related to heat.The radicals are the most basic semantic clues for human beings to recognize words. A person who doesn't know the character "焱" (焱), but sees 3 "火" (火), will be able to guess that it has something to do with fire.

Because radicals are the most basic semantic clues in human literacy, people will first infer meaning categories from the structure, and then understand specific meanings in context.

Sparks, flames, light flames, common in written language and personal names, suggesting light and heat.

But in the tokenizer's word list, the word "amaze" corresponds to a number. Let's assume that it is number 38721, which represents an index position in the word list, and the model looks up a set of numeric vectors through it, and uses these vectors to characterize the word "焱".

The number itself does not carry any information about the internal structure of the character, and the relationship between 38721 and 38722 is no different from the relationship between 1 and 10000 for the model. Thus, the layer of information about the structure of the character is encapsulated. The fact that the three "fires" are stacked on top of each other does not exist in the numbering.

The model can certainly learn indirectly from a lot of training data that "Amazing," "Inflammable," and "Scorching" often occur in similar contexts, but this path is a bit more indirect than utilizing the information about the bias directly.

So can the model 'see' some parabolic-like structural clues in the disassembled bytes, and then recombine them in subsequent computational layers? Although this path is token-heavy and costly, is it possible that it is rather more effective in semantic understanding than just swallowing an opaque number?

A paper published in MIT Press' Computational Linguistics in 2025 ("Tokenization Changes Meaning in Large Language Models: Evidence from Chinese") answers this question.

04 A partial grows out of the debris.

David Haslett, the paper's author, noted a historical coincidence.

In the 1990s, when the Unicode Consortium assigned UTF-8 encoding to Chinese characters, the order was categorized by radicals. The UTF-8 encoding of Chinese characters under the same radical is adjacent to each other. Both "茶" and "茎" contain the "艹" radical (cursive), and their UTF-8 byte sequences start with the same byte. Both "河" and "海" contain the "氵" part, and their byte sequences also share the same beginning.

UTF-8 sorts Chinese characters according to the order of some radicals, and characters with the same radicals are encoded similarly｜Image source: Github

This means that when the tokenizer splits a Chinese character into three UTF-8 byte tokens, the Chinese characters that share the same header will share the first token, and the model will see these shared byte patterns over and over again during the training process, and it may learn from this that "the characters with the same first token tend to belong to the same category of meaning". This is functionally close to the human process of determining semantics through radicals.

Haslett designed three experiments to verify this.

The first experiment interrogated GPT-4, GPT-4o and Llama 3:Do "tea" and "stem" contain the same semantic radical??

Second experiment.Let the model rate the semantic similarity of two Chinese characters.

Third experiment.Have the model do the "find different classes" elimination task.

Each experiment controlled for two variables: whether the two characters actually shared a radical, and whether the two characters shared the first token under the tokenizer. this 2×2 design allowed her to isolate the effects of the radical effect and the token effect separately.

The conclusions of the three experiments were consistent: when the Chinese characters were cut intoFor multiple tokens(e.g., under the old tokenizer of GPT-4, the Chinese characters of 89% were cut into multiple tokens).Model recognizes shared initials with higher accuracy; when a Chinese character is encoded asFor a single token(Under the new tokenizer of GPT-4o, only the Chinese characters of 57% are still multi-tokens).Accuracy's gone down..

In other words, that conjecture in the previous paragraph holds.It does cost more to chop up the kanji, but the chopped byte sequence retains traces of the radicals, and the model really learns something from itThe cost of encoding Chinese characters as whole character tokens is reduced. Whereas encoding Chinese characters as whole tokens brings down the cost, the prefix information is encapsulated in an opaque number, and the model can no longer access this cue through the byte sequence.

In particular, it should be noted that this conclusion is limited to the segmented semantic tasks related to glyphs.It cannot be equated to a decrease in the model's overall Chinese comprehension, logical reasoning, and long text generation capabilities. Meanwhile, the experimental comparison of GPT-4 and GPT-4o, in addition to the difference in the participant, there are significant changes in the model architecture, training corpus, and the number of parameters, and it is not possible to attribute the change in accuracy 100% to the adjustment of the participant granularity.

This discovery has also been validated on the engineering side. a study of GPT-4o in 2024 found that after GPT-4o's new tokenizer combined certain Chinese characters into one long token, the model instead made comprehension errors. When the researchers used a specialized Chinese word splitter to re-split these long tokens and feed them to the model, the comprehension accuracy was restored.

The prevailing consensus in the global large modeling industry remains that theWhole-word/whole-word splitter optimized for the target language, which significantly improves the overall performance of the modelThe whole-word/whole-word encoding can not only significantly reduce the cost of token, but also shorten the length of sequence, reduce the inference delay, and improve the stability of long text processing. Whole-word/whole-phrase encoding not only significantly reduces the token cost and increases the amount of effective information in the context window, but also shortens the sequence length, reduces the inference delay, and improves the stability of long text processing. The advantages of segmentation found in the paper cannot cover the performance benefits of most Chinese NLP scenarios.

But this incident still pokes at one of the most difficult types of problems to deal with in large systems:You can optimize the parts you've designed, but you can't optimize the parts you don't know you have.The Unicode Consortium arranges encodings by prefixes for the convenience of human retrieval, and the BPE splits Chinese characters into bytes because Chinese is too infrequent in the corpus. Two unrelated engineering decisions happened to be stacked on top of each other, creating a semantic channel that no one had planned.

Then, when a new generation of engineers "improved" the tokenizer and merged Chinese characters into whole tokens, they simultaneously erased a path they didn't know existed. Efficiency increased, costs decreased, and something quietly disappeared without you even getting an error message.

So things are more complicated than the judgment that "Chinese is overpaying in AI".Each tokenizer is optimizing for some default value, with the cost hidden elsewhere.

05 Lin Yutang (1898-1943), Chinese novelist and translator

The cost of adapting Chinese to Western technological infrastructures is not something that has only begun to be paid in the AI era.

In January 2025, New York resident Nelson Felix posted a few photos to a Facebook group for typewriter enthusiasts. He had found a typewriter inscribed in Chinese among his wife's grandfather's belongings and wondered what it was. Hundreds of comments soon poured in.

Nelson Felix's question: Is a Minute Maid typewriter worth the money? |Image Source: Facebook

Stanford University sinologist Thomas S. Mullaney immediately recognized the photo as the only prototype of Lin Yutang's 1947 Ming Fu typewriter, which had been missing for nearly 80 years. In April of that year, Mr. and Mrs. Felix sold the typewriter to the Stanford University Library.

The problem that Bright Typewriter is trying to solve is structurally the same problem that tokenizers face today:How to efficiently embed Chinese into a set of technical infrastructures designed for Western languages.

The English typewriters of the 1940s had 26 alphabetic keys, one key for each character, simple and straightforward. Chinese had thousands of commonly used characters and it was impossible to use one key for each character. The Chinese typewriters of that time were a huge character tray with thousands of lead characters lined up, and the typist could only type a dozen or so characters per minute by picking up the characters one by one with his hands.

The Chinese typewriter invented by American missionary Devello Z. Sheffield in 1899 is the earliest record of a Chinese typewriter｜Image source: Wikipedia

Lin Yutang spent 120,000 dollars on research and development, and almost lost his family's fortune, to commission the Carl E. Krum Company of New York to build a Chinese typewriter with only 72 keys. The principle of operation was to split the Chinese characters according to their glyph structure, with the upper keys selecting the upper half of the character root and the lower keys selecting the lower half of the character root. Candidate characters were displayed in a small window called the "Magic Eye," and selected by pressing the numeric keys. It supports 40 to 50 words per minute and more than 8000 commonly used characters.

(Left) Transparent glass window is the "Magic Eye"; (Right) Inside structure of Ming Fu typewriter｜Photo Source: Facebook

Zhao Yuanren commented, "TheBoth Chinese and Americans can familiarize themselves with this keyboard with a little study. I think this is the typewriter we need.。」

Technically the bright typewriter was a breakthrough, but commercially it failed.

When the machine failed during a demonstration to Remington executives, investors lost interest, and the high cost of the machine, combined with a break in his personal financial chain, made mass production impossible. 1948 saw the sale of the prototype and the commercial rights to the Mergenthaler Linotype Company, which eventually abandoned mass production. The company eventually abandoned mass production, and the prototype was taken back to a Long Island home by an employee when the company relocated in the 1950s. Its whereabouts were unknown until 2025, when it saw the light of day.

In his book The Chinese Typewriter, Murray Leining has a judgment that the Ming Fu typewriter 'did not fail'.As a product of the 1940s, it did fail. But as a paradigm of human-computer interaction, it triumphed.

For the first time, Lin Yutang turned Chinese "typing" into "search and select".. The three rows of keys combine to locate the root character and pick it from the candidates. This is the underlying logic of all modern Chinese input methods. From Cangjie and Wubi to Sogou Pinyin, they can all be said to be descended from the Ming Dynasty typewriters.

Chinese Typewriter, by Mo Leining｜Image Source: Douban

There is a certain historical pattern implicit in this typewriter, which spanned nearly eighty years, and the word splitter that we discuss over and over again today.The Chinese language has always faced a problem:

How to access a set of infrastructures formed by the Roman alphabet.

Interestingly, the search process is full of non-humanly planned coincidences: the ordering formulated by the Unicode Consortium for the convenience of human retrieval, superimposed on the unintentional dismantling of the BPE algorithm, surprisingly recreates the process of human literacy in the black box of a neural network. When engineers took the initiative to spell out the Chinese characters and knock down the cost to eliminate the "Chinese tax", the semantic channel that was accidentally created was also closed.

History is not a linear evolutionary track, but a fluid that is constantly deforming as it is squeezed by various constraints.

Some abilities are by design, some just happen to be unchecked.

This article is from WeChat “Geek Park” (ID: geekpark)Author: Tang Yitao

popularization of science

The copyright of the article belongs to the author, please do not reprint without permission.

AI Big Model's “Chinese Tax”: Chinese Costs More Token Than English, Why?

01 Chinese tax

02 How many pieces can you cut a Chinese character into?

03 Is the old text really cheaper?

04 A partial grows out of the debris.

05 Lin Yutang (1898-1943), Chinese novelist and translator

Understanding the MCP protocol in one article: making AI understand your data better

No more...

Related posts

Explanation of the principles and applications of AI Agent technology in one article

An article on Knowledge Distillation: Making "Small Models" Have "Big Wisdom".

Science: Big Model Filing

AI enters the era of reasoning models, an article to read the chain of thought

No comments

Popular Articles

Popular Sites