en.Wedoany.com Report on Mar 27th, U.S. research firm Gartner recently released a report predicting that by 2030, the training costs for large language models (LLMs) will decrease by 90% compared to last year, but overall inference costs are expected to increase. This study focuses on models with a parameter scale of one trillion, first developed in 2022, and considers AI tokens as the data units processed by generative AI models, with each token being approximately 3.5 bytes.

Will Sommer, Senior Director Analyst at Gartner, stated: "These cost improvements will result from a combination of semiconductor and infrastructure efficiency gains, model design innovations, increased chip utilization, greater use of inference-specific silicon, and the application of edge devices for specific use cases." He added that while the declining unit cost of tokens will support more advanced generative AI capabilities, it may also trigger higher token demand, leading to an overall increase in inference costs.
Sommer pointed out: "Chief product officers (CPOs) should not confuse the deflation of commodity tokens with the democratization of frontier inference. As commodity intelligence trends toward near-zero cost, the compute and systems required to support advanced inference remain scarce. CPOs who mask architectural inefficiencies with cheap tokens today will find agent scale elusive tomorrow." The report shows that enterprise demand for agent-driven frontier intelligence models requires up to 30 times more tokens per task than standard generative AI chatbots, meaning token cost savings will not be fully passed on to customers.
Gartner's prediction comes as LL-D, an open standard for AI inference backed by Google Cloud, IBM, Red Hat, and Nvidia, is submitted to the Linux Foundation. This standard is based on a pre-integrated, Kubernetes-native distributed inference framework that utilizes elements like key-value (KV) caching to record tokens from previous model interactions, avoiding repeated GPU activation and thus saving costs. Robbie Jerrom, Senior Principal Technical Expert for AI at Red Hat, told SDxCentral: "If you accumulate a large number of user responses over time, you can achieve cache hit rates as high as 80-88%, which reduces costs but, more importantly, improves performance. Furthermore, we can share this cache across multiple models."
Ishit Vachhrajani, Global Lead for Technology, AI, and Analytics at Amazon Web Services (AWS), added that inference costs will drop tenfold, a change that has already begun during the current AI boom. He stated: "We are in a phase where the cost of intelligence is falling and the level of intelligence is rising. I think this is an optimal time for many, many use cases to start leveraging AI in a cost-efficient way." Meanwhile, Google's TurboQuant compression algorithm recently made headlines, claiming it can reduce KV cache memory by at least sixfold. This solution compresses AI models while preserving their core structure, requiring no pre-processing or specific calibration data.









