Counting The Cost Of AI Inference – And Projecting It Far Out
March 30, 2026 Timothy Prickett Morgan
It is probably a good thing that most IBM i shops did not spend a lot of money trying to figure out AI in the past decade. It was enormously expensive to develop first generation machine learning algorithms, and they had limited applicability. With large language models and their generative capabilities, the use cases for AI have skyrocketed, but the costs for training have been crazy expensive since the end of 2022, when the chattybot eureka moment – some might say emergent behavior – happened.
The cost of training what are called foundation models – very large models with hundreds of millions of parameters, which are akin to the number of neurons in our brains, and the weights in the AI model being akin to the strength of synaptic signals in that collection of neurons – continues to go up as parameter counts and dataset sizes rise. Even mixture of expert models – which talk amongst themselves and mull over information and try to come to a logical conclusion when a query is posed – still have a lot of parameters collectively. They give better answers, though. It is the difference between the thoughts of a few very experienced people contrasted with a blurty drunk or five year old.
In any event, whatever metaphor you want to use, the cost of actually running AI in production comes down to what it costs to generate a snippet of text called a token, which in practice is not a word (oddly enough) but an average of around four letters. (It depends on the language.) Data is tokenized and then turned into numerical vectors, which are then used to create the weights which drive what sure looks like the thought process of a GenAI model. So, right now and for the foreseeable future, what matters is the cost per token.
This week, both IDC and Gartner provided some insight in how rapidly this price has come down and how quickly they expect it to continue to fall.
Here is the historical recap that Matt Eastwood, senior vice president at IDC, put out on X:

Eastwood is looking at the costs of training the GPT-3 model when its API came out in beta in June 2020, which is a few years before the GenAI boom hit, and comparing it to what it costs to generate tokens through APIs from OpenAI, the creator of the GPT model, of course. The cost per million tokens was a whopping $32 bucks back in 2020, and the good news is that most queries were short, most contexts were short, and most answers were short back in 2020. They had to be. And today, almost six years later, the cost of 1 million tokens is under 10 cents. That is more than a factor of 320X reduction in cost in six years. Moore’s Law improvements alone would leave us to expect only a factor of 6X improvement, and if you go from 32-bit data down to 4-bit data, that gets you another 4X for a combined 24X. The remaining 13.3X improvement in the cost per token is coming from other hardware and software advances.
This is, actually, an amazing level of bang for the buck change for the better.
But it doesn’t stop there, according to the researchers over at rival Gartner:

It looks like IDC is pretty much at the base (black line) in the curve on the left side of the chart above in 2026, and the cost per million tokens is going to continue down its exponential curve, dropping by another factor of 9X between 2026 and 2030. So, call it a penny per million tokens, and yes, the rate of change is slowing because there are limits of physics to contend with.
Wait, wasn’t everyone going to get rich selling tokens? Well, as it turns out, mixture of expert (MoE) models will probably use somewhere between 100X and 1,000X more tokens to do their reasoning. So the price to get an answer will go up, and the hope is that the quality of the answer will rise faster than the price.
For those of us who like our thinking jobs, maybe that is not something to be desired.

