From GPU Hours to Token Dollars: The New AI Economy ($NVDA)

JaminBall
03-28 09:23

One thing I’m starting to believe - the companies who figure out pricing and packaging the fastest will have a big edge in the early days of this AI phase shift.

I think it’s one of the hardest problems right now for any AI company! What makes pricing so difficult in an entirely new (and expensive) line item has entered COGS - inference. Whether you’re paying OpenAI / Anthropic directly, or paying someone else to run open source models, inference costs are exploding (and we’re just getting started….). A big question becomes - how can you price your product such that you don’t torpedo your business into perpetual negative gross margin land (or said more positively, how can you price your product to more tightly align with value delivered).

A couple weeks ago I wrote a post titled “Get in the Token Path” which described one way to better align pricing models (and a way I think will ultimately lead to the most success). I think it’s so important to figure out pricing because we’re entering a world where your pricing model is your business model. In traditional SaaS, pricing was important but also forgiving. You had 80%+ gross margins, so even if your packaging wasn’t perfect, there was a huge cushion. You could underprice a seat, overprovision features, give away usage for free and it didn’t matter because the incremental cost of serving another user was basically zero.

That’s not the case anymore. When every agent action, every “magic” feature triggers real inference costs on the backend, the gap between your pricing model and your cost structure becomes existential. Price too low and you’re literally paying customers to use your product. Price too high and you lose to the competitor who figured it out first. And the costs themselves are a moving target. Token prices are falling fast, hardware efficiency improves every generation, and the mix of models you’re calling changes the math entirely. You can’t just set a $50/seat/month price and forget it.

But here’s the flip side (and I think this is underappreciated). The companies that do get pricing right have a chance to capture way more value per customer than was ever possible in the old SaaS world. If you can tightly align price with value delivered, charging more when your product does more, scaling with outcomes rather than headcount th,e ceiling on what you can extract per customer goes way, way up. A traditional SaaS company selling seats was always capped by “how many humans work at this company.” An AI-native company selling on usage, outcomes, or work completed? The TAM within each account is theoretically limitless. Getting pricing wrong is fatal. But getting it right is a superpower.

So back to pricing…I think we’re about to see an explosion of pricing models based on pricing per token. It may not literally be “per token” - I think we’ll see a lot of credit based pricing structures emerge where on the back end credits buy you tokens in a somewhat obfuscated way. More multi product companies are coming out, and companies are layering on additional products faster and faster than ever before. Pricing and packaging for multi-product companies is really hard (you want to capture value for incremental products delivered). But most of these will all have token generation as a commonality, and I think we’ll see a lot of credit based pricing models emerge where credits basically buy you a certain amount of consumption across any of the products offered.

Related to all of this - I think we’re also about to go through a shift in how GPUs are monetized as the world moves from a training heavy one to an inference heavy one. For the last decade, the world has rented GPUs by the hour. You go to $Amazon.com(AMZN)$ AWS, $CoreWeave, Inc.(CRWV)$ , Lambda, whoever - and you pay something like $2-4 per GPU hour. That’s it. That’s the business model. You’re basically renting a very expensive piece of silicon by the clock, the same way you’d rent a U-Haul truck. Doesn’t matter if you’re driving it across the country fully loaded or it’s sitting in your driveway. You’re paying by the hour.

This model made sense for training. Training is a big batch job and more of a cost center - you spin up a cluster, you run the job for days or weeks, and you’re done. There isn’t really direct “revenue” tied to this cost (obviously there is as you start charging for model access, but the direct cost of training isn’t paired with revenue). Hourly rental works fine. But we’re not in a training-dominated world anymore. We’re rapidly shifting into an inference-dominated world, and that changes everything about how these GPUs should get priced.

$NVIDIA(NVDA)$ Jensen Huang spent a huge chunk of his GTC keynote this month hammering this point home. He showed what he called a “Pareto frontier” - basically a chart that maps the tradeoff between throughput (how many total tokens you can pump out) and latency (how fast each individual user gets their response). The key insight is that depending on where you sit on that curve, the economic value of a GPU hour changes dramatically.

Let me make this concrete with some illustrative figures. Today, if you’re a GPU cloud provider, you’re renting out an H100 for roughly $2-4/hour (it can range more than that depending on what type of GPU with what specs, whether you’re buying on demand or on a committed deal, etc). That's your revenue per GPU hour. That's what the market pays. But now think about it from the other direction. A single GPU in the GB300 NVL72 generates roughly 15,000 tokens per second. The full rack — 72 GPUs working together — pushes that to over 1 million tokens per second, or roughly 4 billion tokens per hour.

Now, renting that equivalent GPU capacity by the hour might run you $150-300/hour (72 GPUs at $2-4 each). But price those tokens instead. Even at rock-bottom commodity rates of $0.15-0.20 per million output tokens, 4 billion tokens generates $600-800/hour in token revenue — easily 2-4x the hourly rental value. And that's at the cheapest prices on the market. Mid-tier models charge $8-15 per million output tokens, which would push the math into the thousands per hour. The simple logic: a GPU rented by the hour is priced based on the cost of the silicon. A GPU monetized by the token is priced based on the value of the output. And it turns out the output is worth a lot more than the silicon.

Same GPU. Same hour. But when you price the output in tokens instead of clock time, the monetization more than doubles.

This is where Jensen’s Pareto chart becomes so powerful. On one end of the curve, you’re running maximum throughput - serving tons of users, batch workloads, cheap tokens. On the other end, you’re running low latency - fast, responsive, premium tokens for things like real-time agents. The further you push toward the premium end, the more revenue you extract per GPU hour. And with each generation of hardware pushing that Pareto frontier outward, the gap between “what you charge per hour” and “what you could charge per token” keeps getting wider.

So why does this matter for founders?

First, if you’re building on top of inference, you need to understand that the cost of tokens is going to keep falling (and fast). Every hardware generation pushes the frontier out further. NVIDIA’s own numbers show Vera Rubin delivering 5x the inference throughput of Blackwell, with token costs falling 10x. That’s deflationary pressure that will compound relentlessly. If your business model depends on tokens being expensive, you’re on the wrong side of this curve.

Second, and this is the more interesting part in my opinion. For the companies actually running the GPUs, this shift from hourly rental to token-based pricing is transformational. It turns GPUs from depreciating assets into revenue-generating factories. Under the old model, you buy a GPU, you rent it out by the hour, and you’re in a race against depreciation. The next chip comes out, your hourly rate drops, and you’re chasing a declining price (interestingly, H100 prices have actually been rising as the market remains so incredibly supply constrained). It’s a cost center. You’re basically a landlord watching your property value fall every 18 months.

Under the token model, the equation flips. You’re now selling output vs raw compute time. A token that helps an agent close a sales deal is worth a lot more than a token that summarizes a Wikipedia page, even though they cost roughly the same to generate. That’s the whole point of Jensen’s tiered token pricing vision - different points on the Pareto curve command different price points.

This is one reason why the Groq acquisition is so interesting. The Groq LPU is purpose-built for inference - ultra-low latency token generation. By integrating Groq into NVIDIA’s ecosystem (alongside their Dynamo inference orchestration software), they’re essentially building out the full stack to help GPU operators maximize token revenue, not just clock-time rental revenue. It extends the business model for everyone in the chain.

Think about the analogy to cloud computing. In the early days of AWS, you rented VMs by the hour. Then Lambda came along and you paid per function invocation. Then you started paying per API call, per request, per transaction. The pricing got more granular and more aligned with actual value delivered. The same thing is happening with GPUs. We’re moving from “pay per hour of silicon” to “pay per unit of intelligence produced.” And just like in cloud, the companies that figure out how to price on value (not just cost) will capture disproportionate economics.

One more thing. This shift has massive implications for the AI model providers too - the OpenAIs and Anthropics of the world. If you’re sitting on top of billions of dollars of GPU infrastructure and you’re selling tokens through APIs, you want to be on the high end of that Pareto curve. You want premium, low-latency, high-quality tokens that you can charge $10-15 per million output tokens for (not commodity batch tokens at $0.10). The model providers who can deliver the most valuable tokens (through better reasoning, faster responses, or more reliable agents) will extract the most revenue per GPU hour from their infrastructure. This is actually what’s going to drive these companies to profitability - not just scale (which will help!), but token monetization efficiency.

The world is moving from GPU hours to token dollars. For founders building in this ecosystem, the takeaway is this: understand where you sit on the Pareto curve, because that’s what determines your economics. Are you building for throughput (cheap, high-volume tokens)? Or for latency (premium, real-time tokens)? The hardware is getting better at both, but the pricing power lives at the premium end. And the companies that figure out how to monetize tokens based on value (not just pass through the cost of compute) are the ones that will have a big advantage.

And this is where the credit-based pricing model I mentioned earlier becomes so interesting. Credits give you a flexible abstraction layer that lets you sit on top of all of this complexity without exposing it directly to the customer. Token costs are falling, model mix is shifting, you’re running five different models at five different price points on the backend. The customer doesn’t need to know any of that. They buy credits, credits buy them outcomes across your product suite, and you manage the token economics underneath. It’s the same playbook the cloud providers eventually landed on (committed spend across a broad set of services), and I think it’s where the best AI-native companies will land too. The ones who figure out that abstraction, pricing on value and managing token costs as an internal optimization problem rather than a customer-facing one, will build the most durable businesses of this next era.

I had a fun time putting this post together! I’d love to speak with experts in the AI pricing and packaging space. Maybe even host a town hall with people who want to learn about pricing and packaging in the age of AI. Hit me up if you’d like to participate!


For SG users only, Welcome to open a CBA today and enjoy access to a trading limit of up to SGD 20,000 with unlimited trading on SG, HK, and US stocks, as well as ETFs.

🎉Cash Boost Account Now Supports 35,000+ Stocks & ETFs – Greater Flexibility Now

Find out more here.

Complete your first Cash Boost Account trade with a trade amount of ≥ SGD1000* to get SGD 688 stock vouchers*! The trade can be executed using any payment type available under the Cash Boost Account: Cash, CPF, SRS, or CDP.

Click to access the activity

Other helpful links:

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment
23