Chinese Data Accounts for Over 60% of Training Data in Most Domestic AI Models

  • 2025-08-22


Chinese Data Accounts for Over 60% of Training Data in Most Domestic AI Models


Chinese data plays a critical role in enhancing the training performance of domestic large-scale AI models. Recent data released by the National Data Administration shows that Chinese data now accounts for over 60% of the training data used in most domestic AI models, with some models reaching 80%. The continuous improvement in the development and supply capacity of high-quality Chinese data has driven the rapid advancement of artificial intelligence model performance in China.

Liu Liehong, Director of the National Data Administration, stated that the rapid development of artificial intelligence in China is closely tied to the country's emphasis on data-related work. As one of the core elements of AI development, data plays a key role in promoting "AI+," making the construction of high-quality datasets crucial.

"In the era of artificial intelligence, Token, commonly referred to as the smallest unit of text processing, is akin to 'traffic' in the internet era," Liu Liehong explained. In early 2024, China's daily Token consumption was 100 billion. By the end of June this year, daily Token consumption had exceeded 30 trillion, representing a 300-fold increase in just a year and a half, reflecting the rapid growth of AI application scale in China.

As of the end of June this year, China has developed over 35,000 high-quality datasets with a total volume exceeding 400 PB (1 PB can store approximately 500 million 2 MB high-definition photos). The 400 PB total is equivalent to about 140 times the digital resource volume of the National Library of China.

The training of AI models has also driven an increase in data trading demand. By the end of June, the cumulative transaction value of high-quality datasets across regions reached nearly 4 billion yuan, with the total scale of high-quality datasets listed on data trading platforms reaching 246 PB.

Moving forward, the National Data Administration will continue to advance the construction of high-quality datasets through systematic planning, accelerate the establishment of data hubs in key areas such as embodied intelligence, low-altitude economy, and bio-manufacturing, promote societal recognition of the value of data elements, expedite the co-creation of data element value, and foster a market consensus of "paying for high-quality data."

Go Back Top