Tech Capital Article Featuring Jason Hoffman, Switch Chief Strategy Officer

Executive Summary

Jack Haddon’s article below explores the evolving debate over where AI inference—the process of running trained models—will ultimately reside. While large-scale data centers currently dominate AI training, inference workloads may be distributed across a spectrum ranging from AI factories to metro colocation sites, enterprise data centers, and even end-user devices. Each approach offers trade-offs between latency, cost, and scalability. Some experts envision powerful, centralized infrastructures supporting asynchronous or batch inference, while others predict a migration toward edge and device-level computing for real-time, latency-sensitive tasks.

Jason Hoffman, Chief Strategy Officer at Switch, predicts that AI will follow a familiar trajectory seen in gaming and mobile computing, where “it’s actually better, faster, cheaper, and easier to make a more powerful device than to build out infrastructure between the physics engine and the device.” He states that “people in infrastructure keep saying they’ll build dedicated inference infrastructure distributed in cities, but I can point to half a dozen historical examples… devices got more powerful, and data centers became more centralized, while the middle continued to get commoditized.” Hoffman adds that workloads must be supported wherever they best fit—whether it’s “in a big data center, on the device, or somewhere in between. Often these edge services or inference nodes will mostly be coordinating between what’s happening on the device and in big data centers.”  

In essence, the article concludes that AI inference will not have a single home. Instead, its deployment will depend on workload type, latency needs, and sovereignty concerns—resulting in a distributed and dynamic compute landscape.

Read the the full article below.


Where will inference be deployed?

The battle over where AI inference will live has begun, and the experts can’t agree. Will it be forged in sprawling gigawatt data fortresses, scattered across metro hubs, or pulled down onto the very devices in our hands?

by Jack Haddon
Deputy Editor, The Tech Capital

The data centre industry has been asked a lot of billion-dollar questions as of late. But a trillion-dollar question is lurking in the background:  

Where do we need to build data centres for AI inference at scale?  

The breed of data centre facility that is required for training AI is now well understood: we need infrastructure that can support vertically scaled computing, large clusters of high-powered GPUs that can be liquid cooled to ensure maximum efficiency and more power to scale it all, delivering better and more powerful models. 

But there is no blueprint for inference – or the infrastructure required for actually using a trained model.

This presents both an opportunity and a challenge for the data centre industry. It means that there is room for several different business models to support different types of inference workloads, but it also means that meeting holistic demand will require an understanding and anticipation of emerging use cases to ensure the right infrastructure is built at the right time and in the right place.  

In this article, we explore the different locations that AI inference compute could be deployed, why, and what data centre developers need to consider to be able to deliver.  

The experts don’t agree. Some see inference collapsing back onto devices. Others believe hyperscale facilities will dominate. Still others point to metro colos, sovereign data centres, or hybrid setups straddling all of the above. The answer, as always, depends on who you ask and what problems they’re trying to solve.  

 “There’s really no simple rule,” says Jeff Denworth, co-founder of AI operating platform VAST Data.  

“You’re going to have easy stuff that can run on one GPU and hard stuff that will require whole data centre-sized systems.” 

Denworth uses the example of asking ChatGPT what time sundown is (an easy task) compared to a drug discovery use case or a deep research report where a large amount of data is analysed, and the findings returned. 

Fortunately, the large-scale AI factories that are being built to support training workloads can also be used for inference.   

This is encouraging, as concerns have been raised that improvements in the techniques used to train new AI models on less compute power, such as those exhibited by DeepSeek in early February 2025, mean that the multi-hundred megawatt or even gigawatt sites that are being planned may become stranded assets with no customers requiring large clusters of compute in remote locations.  

The flexibility to support inference workloads extends the life of these facilities, making them less risky to deploy and reducing the risk priced into construction financing.  

Many of these large AI factories are being built in remote locations, where access to large quantities of power to meet the desired IT capacity was the primary driver of their location.  

That means that network latency between the data centre and an end user is likely higher than that of a cloud availability zone or a local colocation facility. 

These large AI facilities have already been designed for scale and powerful compute, meaning they are best suited to asynchronous processing, or batch inference, a powerful and highly efficient method for generating predictions on a large volume of data when immediate, real-time responses are not required.  

For example, Denworth’s drug discovery use case, which would require a significant amount of scientific research papers to be uploaded and analysed, looking for correlations that have yet to be drawn.  

Unlike online inference (asking ChatGPT what time sunrise is), batch inference operates on data that has been collected over a period of time.  

 This approach prioritises high throughput and computational efficiency over low latency.   

 Not being time sensitive means compute resources can be used when they are most available or least expensive, significantly lowering operational costs for end-users.  

There are also benefits for the tenants of these data centres to processing batch inference here.  

 The conventional wisdom among frontier model developers is that accessing more powerful GPUs from NVIDIA or another supplier is the best way to create better and more powerful AI.  

 While Google, AWS and Microsoft are all busy creating their own AI chips, for now, buying from NVIDIA is the go-to. To avoid falling behind in the AI race, these companies need to be securing the latest, most powerful chips that are being released on a regular basis, often with notable performance increases.  

These chips are expensive. So rather than being used for a year and then cycled out as NVIDIA releases a new product, they can instead be transferred to support batch inference.  

“I was speaking with NVIDIA about this the other day,” Denworth reveals. “How do we build reference architectures? Do we build one for training and one for inference? Well, we can’t, because these machines get reborn, based upon different requirements and different dynamics.” 

Paul Roberts, Director of Technology, Strategic Accounts at AWS, is seeing this play out first-hand. 

“We’re seeing folks now training and inferencing on the same hardware,” he explains, whether that’s NVIDIA solutions or Amazon’s custom silicon. 

“We also have customers that are using older NVIDIA hardware, like the Hopper Platforms -they’re still using them, and they are inferencing and training with them.” 

Robers adds that AWS are always looking at the usage of the existing compute and infrastructure in its different facilities and cycles them out as usage drops to “free up more space and power”. 

 So far, everything seems quite simple. But what about when latency does become an issue?  

For some use cases, these large, remote AI factories will not suffice.   

If proximity to end users is crucial for the inference application, another approach needs to be considered.  

From the data centre to the device  

Starting off at the other end of the spectrum to the large AI factory data centre, Switch Chief Strategy Officer Jason Hoffman draws comparisons with GPUs’ previous killer app, which happens to be latency sensitive itself: gaming.  

“We saw attempts like Google Stadia to use infrastructure to stream games to light devices. What’s been shown time and again is that it’s actually better, faster, cheaper, and easier to make a more powerful device than to build out infrastructure between the physics engine and the device,” he explains.  

Hoffman thinks the same thing will play out with AI.  

“People in infrastructure keep saying they’ll build dedicated inference infrastructure distributed in cities, but I can point to half a dozen historical examples of other computer workloads that followed the same pattern: devices got more powerful, and data centres became more centralised, while the middle continued to get commoditised.”  

Hoffman says the same happened with mobile devices. When the iPhone first came out, people thought it was an opportunity for telcos to build more services in their networks to serve these “weak” devices.  

But what turned out to be true?  

“For a given country, you basically need two, three, or four packet cores that are centralised and run the accounts and connections, while Apple and Samsung became some of the most valuable companies by making very powerful devices,” he says. 

“If you have a workload that has to run in a specific location, we need to support that,” he adds. “It’s either in a big data centre, on the device, or somewhere in between. Often these “edge services” or “inference nodes” will mostly be coordinating between what’s happening on the device and in big data centres.”  

Prem Ananthakrishnan, managing director and global software lead at Accenture, agrees – to an extent.  

“There’s always an intent to push as much as possible to the device, but the devices aren’t there yet – that’s part of the problem,” he says.  

“Currently, the practical “edge” where inference models can run is probably in a colocation facility in the local Metro network. As models become smarter and can run on actual edge devices, we’ll likely push capabilities even closer to the end user,”  

But he adds that inferencing is going to be an extremely fragmented compute landscape in the long run, and the opportunity for colo providers isn’t just as a stopgap.  

“You’ll have tiny models running on phones or laptops. Then there will be mid-sized models requiring more than what edge devices can handle, and colos may still have an opportunity to host these. The giant, context-hungry large models will eventually go to hyperscalers and neoclouds,”  

Where is the Edge?  

One of the firms building this middle-mile inference infrastructure is Flexential.  

“We’re not chasing the gigawatt campuses. We are chasing these edge inference nodes that are going to have relevant enterprise use cases,” says President and COO Ryan Malloney.  

More specifically, Flexential is focused on developing sites around 36MW, where it will allocate a portion of the data centre to an AI company or a private enterprise.  

“We’re looking at what I’d call the “middle edge” component, where you have strong network connections,” he adds.  

A handful of AI company customers are already asking Flexential for proximity to GPU as a Service companies. 

This goes as far as asking to be in the same data centre, but Flexential have found offering space in a different facility within the same metro and connecting them with their inter-data centre connectivity service, with 5-10ms of latency, as an adequate compromise. 

But as for why they need to be there, and how large this market will be in the long run, Malloney is unsure. 

“We don’t know why,” he says. “I haven’t seen a latency-sensitive inference model yet.” 

But someone who has is Hunter Newby, the founder of Newby Ventures.  

Newby says some major commercial banks are looking to use inference for fraud detection by capturing keystrokes as they are input into a keyboard or mobile device. 

This requires 3ms of round-trip latency, which current data centre infrastructure is not equipped to support outside of major metros served by internet exchange points (IXPs). 

Newby has mapped out all of the IXPs in the US, and the data shows that there are 14 entire states without a single one, let alone major urban areas close to end-users. 

As far as he’s concerned, proximity to these IXPs is the only way that this very low-latency real-time inference can be supported.  

As a result, Newby is embarking on a mission alongside non-profit Connected Nation to expand the quantity of the US’s IXPs. Connected Nation has identified 125 hub communities where IXPs are needed. 

Ground was broken on Kansas’ first carrier-neutral Internet Exchange Point (IXP) in Wichita in May 2025. 

“Local, carrier-neutral IXPs like the one we’re building in Wichita are essential to reducing lag time and enabling the next generation of AI-powered services to operate effectively and reliably,” Newby says. 

His vision for the AI infrastructure required to support this low-latency inference is for GPU clusters to be installed as close as possible to the IXPs, unlocking the required latency enterprise or commercial end users need for optimal performance and customer experience. 

In less mature markets like Wichita, this isn’t necessarily an issue, but in developed markets like New York, Chicago, London or Frankfurt, power and land are at a premium, especially near the existing IXPs in the inner cities. 

Both Robers and Dan Bathurst, the Chief Product Officer of the neocloud, Nscale, agree that proximity to end users for AI is essential.  

“As AI adoption among consumers grows, the location of inference endpoints has become critical to both performance and cost,” Bathurst explains. 

“Placing compute closer to users and data sources reduces latency, improves the quality of the experience, and lowers the overhead of moving data long distances.”  

But, he acknowledges that most inferencing isn’t highly latency sensitive and can be done from regional hubs where low-cost power resides. 

“However, for certain scenarios, the need for speed outweighs the need for cost savings. 

“Consumer-facing services, such as speech and real-time video models, often require round-trip latencies under 100 milliseconds, which puts hard limits on how far you can be from population centres.” 

This is something that AWS are seeing as well. 

Roberts points to Amazon’s Rufus solution, a generative AI-powered conversational shopping assistant, as an example, stating that low-latency responses were shown to have an impact on checkout conversion.  

In this scenario, Roberts argues that using AWS availability zones will not suffice. Local zones, which bring workloads even closer to end users, need to be employed as well. 

Are tier 1 markets ready for this? 

This focus on low-latency solutions begs the question of whether tier 1 markets are prepared to absorb this type of inference demand. 

As we’ve heard countless times, large training data centres have moved further afield partly due to legacy data centre hubs being heavily power-constrained, with a lack of suitable land. 

“The value of a MW for real-time inference in London is going to be worth more than ten times the value of 1MW for training in Iowa, just based on the supply-demand imbalance,” Newby says. 

Ben Balderi, founder of the GPU and a GPUaaS expert, adds some additional context. 

“In the US, which has abundant land and power capacity with easier regulatory frameworks for new power generation, larger out-of-town data centres will likely continue to make sense. It’s the proven hyperscaler model, and if hyperscalers are comfortable with the latency, neo-clouds will likely be satisfied too.” 

But Europe, including the UK, is very power-constrained. Balderi believes Europe doesn’t have the land, regulatory frameworks, or political will to build data centres in the same way. 

“Constrained markets face well-known challenges around power and permitting, which make scaling low-latency inference problematic,” Bathurst adds. 

Bathurst believes the industry has anticipated this and responded by focusing on density, efficiency and smarter runtime strategies. 

“In the near term, targeted pockets of metro capacity will cover many inference workloads, particularly when paired with efficient serving stacks and dedicated server endpoints for critical tasks,” he says. 

“However, this won’t necessarily hold as AI becomes deeply embedded in both consumer and enterprise applications and multimodal capabilities like video and speech generation mature, leading to the demand for low-latency inference expanding in metro areas.”  

If data centre density improvements and renewable investments don’t keep pace with this demand, some popular data centre hubs could face real pressure.  

Bathurst advises the industry to balance the need for large-scale hubs for efficiency and economic benefit, while reserving metro capacity to meet latency-sensitive requirements.  

“This dual strategy helps ensure that customers can scale in a cost-effective manner while still getting the performance needed for applications where time is of the essence,” he explains. 

Balderi sees another solution, and it comes from an unexpected source. 

“The commercial real estate market is currently struggling as remote work has maintained its appeal after the pandemic and many office buildings are sitting half full,” he observes. 

“It’s not massive by data centre standards, but you might find 500 kilowatts here, 300 kilowatts there, depending on the building size and location. If that power isn’t being used due to low office occupancy, and you already have the grid connection, there’s potential to monetise it.” 

Balderi thinks that this currently unutilised power could be aggregated and used for a distributed AI inference platform.  

With much higher rack densities than traditional workloads, space is not the issue; it’s just getting the power and the hardware close to where people are going to be using it.  

For an enterprise, how much closer can you get than your own basement? 

Returning to on-prem 

This highlights another potential trend as AI inference develops: a return to enterprises hosting their own compute, either on premises or in a colocation facility.  

Not only is there the potential to establish micro inferencing clusters in the emptier real estate in major urban centres, but cost factors and control come into the equation as well. 

A report from the Uptime Institute published in January 2025 showed that dedicated infrastructure regions were cheaper than the cloud if utilisation rates were above 32.5%, for an NVIDIA DGX H100 hosted in a North Virginia data centre. 

This is just one example, and GPU per hour prices have dropped significantly from cloud providers this year, but for enterprises that anticipate heavy utilisation, the incentive to own and deploy their own hardware remains. 

Data sovereignty is important as well. Across Europe, the Middle East and APAC, an over-reliance on foreign, primarily US, tech providers is concerning AI developers and governments alike. 

“I think you’re going to see a lot of people that want to have their data remain, ideally, on-premises,” says Kevin Wollenweber, Senior Vice President and General Manager of Data Centre, Internet, and Cloud Infrastructure at Cisco. 

Balderi agrees that there are likely to be a not insignificant number of European SMEs that will want to avoid using a cloud environment, pointing to comments from senior Microsoft executives speaking to the French Senate, who could not guarantee customer data would not be shared with US authorities if Microsoft was asked to do so under the US Cloud Act.  

But in practice, Wollenweber acknowledges that this is much easier said than done. 

“The challenge is a lot of our facilities, and a lot of our enterprise customers’ facilities aren’t ready for the power and cooling requirements that we see,” he says. 

If these challenges can be overcome, though, Wollenweber thinks a hybrid model could start to emerge. 

“For enterprise applications closer to the datasets themselves, you’ll see more on-premises usage, and even hybrid approaches where companies use cloud resources for fine-tuning and then run inference locally within their infrastructure,” he predicts. 

This sentiment from enterprises is not lost on Roberts, and it’s something AWS are prepared to support with its outpost solution, which enables AWS hardware to be deployed in a customer’s colocation or independent data centres.   

“That’s going to give you super low latency because then you could deploy open-source models directly to that if you wanted to,” he explains. 

Once again, the use cases and end-user experience are the ultimate drivers of where compute infrastructure will be deployed.  

To some extent practical challenges like energy availability, land use, security and sovereignty will impact the decision as well. 

“Our approach to how we’re looking at our data centres and where we put them, is always working backwards from customer demand,” Roberts summarises. 

For data centre developers, keeping track of the technological advancements in applications, use cases and compute infrastructure will be vital to make sure they can provide the right capacity in the right place at the right time. 

In a nutshell, more remote, larger AI factories are ideal for batch and compute-intensive inference where latency is not an issue due to cheap power and pre-existing, scaled high-density compute resources. 

But as latency starts to become important, metro colos, cloud availability zones and smaller sites closer to end users will be required – perhaps in more quantity than the industry is prepared for today. 

And finally, as AI capabilities grow and adoption increases, inference may move outside of neutral and cloud data centres altogether, either to end devices or on-premises facilities to enable sovereignty, control, speed and cost-reduction.

Building for Inference may seem more familiar than training for the data centre industry. But due to these complexities, that familiarity does not mean simplicity.

Source: TheTechCapital.com.