DGX Halo and Bandwidth in AI Inference Performance – Indications from EPYC and Threadrippers

When you click on links to various merchants on this site and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network. As an Amazon Associate I earn from qualifying purchases. #ad #promotions

TL;DR
  • Memory bandwidth significantly impacts AI inference speed, with higher bandwidth improving performance.
  • The AMD EPYC 7702 (200 GB/s), Nvidia DGX Spark (275 GB/s), AMD Halo Strix (~256 GB/s), and AMD Threadripper PRO 7995WX (350 GB/s) show varying performance based on bandwidth.
  • It seems likely that systems with bandwidth between 200–350 GB/s, like DGX Spark and Halo Strix, offer a middle ground for AI tasks.
    • Expandable RAM however in all-in-one systems is limited.
  • The evidence leans toward bandwidth being crucial for handling large AI models, but other factors like core count also matter.
  • Certain ARCH’s perform better with CPU Inference, like the Deepseek 2 ARCH.

Introduction

Memory bandwidth is a key factor in how fast HPC and AI systems can process data, especially for tasks like running large language models. How bandwidth affects performance for the AMD EPYC 7702 and AMD Threadripper PRO 7995WX I just did an observation video on to get some hard numbers for models in scope and performance expectations.

Nvidia DGX Spark and AMD Halo Strix however are a bit more theoretical as I don’t have one to directly test and there is not well documented tests available but… we likely do NOT need that data to infer likely system performance. So why does bandwidth matter? Let’s learn a bit, compare these systems, and speculate on the likely performance impact. I do not have a Spark or Strix Halo to test, so this is educated guesswork. Chime in on the video with your thoughts!

Bandwidth and AI Inference in LLM’s

Memory bandwidth is how quickly data moves between memory and the processor, measured in gigabytes per second (GB/s). For AI inference, where models process new data in massive volumes, high bandwidth means faster data access, reducing delays. Tokens per second has more than just a single reason that can be distilled down to just bandwidth alone but it is a major impact. This is one of the reasons we see saw deepseek 2 arch usher in nice improvement in token efficiency.

x86 Processors vs Specialized SoC Inference platforms

Here’s how the processors stack up based on bandwidth:
  • AMD EPYC 7702: 200 GB/s, using cheap 2400 DDR4 memory, good for data centers but slower for large AI tasks without a doubt. Oddly, you can get surprisingly decent good performance on massive models like deepseek R1.
  • Nvidia DGX Spark: 275 GB/s, designed for developers? But also lauded as the household central Ai which frankly sounds cool and the product looks awesome. Small gold box, yes please.
  • AMD Halo Strix: Around ~256 GB/s, a competitive option with the DGX but is that the measure that will matter? The 128GB size of RAM and how it performs does seem interesting to evaluate.
  • AMD Threadripper PRO 7995WX: 350 GB/s, using 6400 DDR5. Fast and well kinda crazy.
Benchmarks show the Threadripper 7995WX, with 350 GB/s, outperforms the EPYC 7702 (200 GB/s) in tokens per second, but oddly not as I would anticipate in Gemma 3.
Processor
Memory Type
Memory Channels
Memory Speed
Bandwidth (GB/s)
Core Count
AMD EPYC 7702
DDR4
8
2400 MT/s
~200GB
64
Nvidia DGX Spark
LPDDR5x
Unified
275GB
AMD Halo Strix (Ryzen AI MAX+ 395)
~256GB
AMD Ryzen Threadripper PRO 7995WX
DDR5
8
Up to 5200 MT/s
350GB
96

Are Specialized SoC designs the Future of Local Ai Hosting?

Systems like DGX Spark and Halo Strix with peak theoretical bandwidths around 250–275 GB/s, seem poised to offer a middle ground, potentially performing between the EPYC and Threadripper. However one of the glaring items that I 
Moreover, as models grow larger, the need for high bandwidth intensifies. This is a problem for things that do not run at full DDR6 or DDR7 speeds and we are currently in the DDR5 cycle and well frankly, it is not cheap. When we are looking at what model size a user might be interested in buying a system for, DIMM slots do allow for meaningful upgrades in size and speed but 
The DGX Spark and Halo Strix are positioned to enter the Local Ai scene in a major way and likely have a directional impact on the AI inference direction landscape. Their bandwidths, sitting between the EPYC and Threadripper, suggest they could be ideal for developers and households seeking cost-effective solutions for large models, but at the tradeoff of overall token speed. That part right there sounds harsh a bit, maybe, but deepseek at 3-4 tps on the EPYC 7702 rig is really not the ideal experience and there is likely going to be a large gap in performance for the 30B – 70B models on these machines, similar to the EPYC.

Conclusion

What models do you like and enjoy remains the best way to size a system, build for what you know at first.
Building a system however that can scale some is also a very keen idea and the ability to scale in enterprise hardware is top notch. In an embedded SoC setup with 1 pcie slot, not so much.
Why exactly does this CPU have to be a Laptop or SoC design on the AMD side is an interesting question.
The NVIDIA side I would expect to see more unknowns on the performance side at launch date.
I would not buy either of these systems based on what I know right now myself however I am also very much NOT the target demographic. I am a tinkerer and a builder, not a buyer. The all in designs we are seeing seem to conform with what I think is homogenization trend we are seeing in the homelab scene in the recent year or so. Trendy is working more then ever before, and a lot of folks are out at the mention of build.
I look forward to seeing the hard numbers when they land, but for now I think I have a pretty decent guess as to the performance we can expect to see on launch.