Emperor Wears No Fabric

Updated: Jun 2, 2020


I am not an engineer, I do not work in IT, I have no practical experience. This text likely contains some mistakes and misconceptions.

If so and you notice them, please inform me.

The point of this article is to bring its topic closer to the general public, introduce it in the general discourse. As such, I welcome any insight from those who know more about it than me.

Also consider downloading the PDF version of this article, which contains more than 30 footnotes.


In all but laptop markets, Intel is struggling to remain competitive against a resurgent AMD. What appeared to be impossible but three years ago has happened, and while Intel is stuck squeezing all they can from old processes and even older architectures, AMD caught up and as of now, overtook them.

But how did Intel get into this position? We all know the usual suspects, which I’ve mentioned – 10nm woes, decade of architectural stagnation, advent of hardware security flaws…

All of these have been widely discussed and expanded upon, while one culprit curiously escaped scrutiny so far – the fabric.

The wires and signals and protocols pulling a chip, a system, together.

No system can be stronger than the interconnect holding it together, and that is a fact Intel feels more keenly than ever since the Athlon 64.

Let us discuss the protocols and their design, the most complicated element of a fabric.

I. Those Who Don’t Learn From History Will Repeat It

We hear talk of the one-two punch of Sandy Bridge and Ivy Bridge, or the vision of the potential future repeat of that historic moment with Zen 2 and Zen 3.

But back in the early 00’s, the dynamic duo was the K7 (Athlon XP) and the K8 (Athlon 64), and their impact was only limited by Intel using dirty tricks to stem the tide, until it could respond – very quickly – with the legendary Core architecture.

While the success of Athlon XP was in significant part owing to the failure of the infamous NetBurst to get to 10GHz, Athlon 64 and the original Opteron were a true revolution – possibly the biggest in the x86 space, ever. In 2003, AMD, smelling the blood in the water, went in for a kill with the first x86 64-bit CPU (starting a trend of being ahead of software – Windows would have a 64-bit version two whole years later) and, more importantly for us, the adoption of a fabric – one which also underlines the success of Zen – HyperTransport.

Fig 1: Evolution of 2P. Traditional, Athlon MP and Opteron. KarbinCry.

Before the K8 architecture, the heart of any computing system was not the CPU, but the chipset.

The chipset was composed of two chips. The southbridge is what we call chipset today – additional I/O for slower or secondary interfaces, like SATA or USB. The northbridge was at the center. Not only did it control all I/O like PCI, memory (MCU – memory control unit), and the southbridge, it was the only component directly linked to the CPU. If we imagine a modern computing system as a diagram, at its center will be the CPU, with many branches leading to the chipset, PCIe, NVME… back then, the northbridge was the center, and even the CPU revolved around it.

K8 changed all this.

It had an on-die integrated memory controller (IMC), cutting memory access latency almost in half, and for many other traditionally northbridge functionalities, most notable the socket-to-socket communication on server boards and for the link between the CPU and the chipset, HyperTransport was used.

While for mainstream market, IMC was a big deal, it is in server where both novelties provided the greatest benefit, especially in multi-socket systems.

Previously, the two (or more) CPUs talked to each other through the northbridge. While this did have some benefits, there were some major drawbacks. At that time, chipset used a 64-bit wide interface (Front-side bus – FSB) to send data to the CPU. With multi-socket systems, this link was divided across the CPUs, significantly decreasing available bandwidth. AMD already tried to fix this issue with K7, as the northbridge for the Athlon MP (K7 server chips) was set up to provide separate 64-bit links to each CPU, but this made the chipset much larger. While AMD could have made 4P boards for Athlon MP, they never did, because the building a 4x64b northbridge would simply be too expensive for it to be worth the effort.

With Opteron, each CPU has its own access to memory, chipset is greatly reduced, and each CPU communicates with the others directly, through its 3 HyperTransport links dedicated to socket-to-socket communication. There was no problem to make 4P or even 8P boards, and while the NUMA nature of such a system did require optimization (especially for 8P variant), the benefits in memory bandwidth and capacity were incredible for the time, especially compared to multi-socket Xeons.

Fig 2: Meditation on the Origin of Pentium D. A Local Priest of 10GHz, Bloax.

There was another interconnect – the System Request Interface – on the K8 die, which made the architecture ready to be used in a multi-core design. In 2003, AMD said they are just waiting for a new process node to make such a die small enough to be economical. Of course, Athlon 64 X2 came out in 2005, together with, also dual-core, Pentium D, however the two NetBurst cores of the Intel offering still had to go through the chipset to communicate, despite being on the same die.

It took Intel another year to release Core 2, their first true multi-core product.

All three big innovations of K8 are with us to this day. AMD64 was the first major expansion to x86 ISA not made by Intel, and forced a cross-licensing agreement which protected AMD’s access to the x86 license. Multi-core design replaced the mirage of 10 GHz. And HyperTransport has been the foundation of AMD’s products ever since.

It took Intel years to catch up back then.

Unfortunately for their customers and shareholders, Intel failed to learn their lesson.

II. If You Like It, Put a Ring on It

While the Core architecture was Intel’s first true multi-core SoC, it’s not very interesting to us. The cores communicated through a wide point-to-point connection to the shared L3 cache. Point-to-point is one of the best performing topologies, but also one of the most costly.

We then skip straight to Sandy Bridge, and its iconic ringbus. Sandy Bridge brought more, and larger, cores to the mainstream, and included an integrated GPU, which also needed access to system resources.

Ringbus is a very simple, elegant, and effective design. While being cheap and easy to route – each packet can only go in two directions – it is very fast, low-latency and scales well up to 6-10 cores (if those are similar width to current Skylake cores).

The single ring is composed of 4 physical circuits – the big one for data, request and acknowledge rings, and snoop ring. Infinity Fabric uses a similar approach, being divided into SDF – Scalable Data Fabric and SCF – Scalable Control Fabric planes. Because network commands are usually smaller size, dividing fabric into a data and command sections allows the command one to run faster, or use a narrower link.

Aside for scaling to higher core counts, what are the disadvantages of the ringbus?

Fig 3: Jammed ringbus. New packets in yellow, delivered packets in red. KarbinCry.

Let’s start with one that is a benefit to ringbus itself, but arguably a problem for Intel in general, and that is the separate snoop ring. Snooping is a method for preserving cache coherence, making sure each unique cache address leads to the same data. If this coherency isn’t maintained, core that needs information from cache can end up with two different, conflicting, sets of data. Obviously then, keeping shared caches coherent is crucial, and four different modes of snooping were implemented. Each variant has its pros and cons, and the separate snoop ring allowed use of all four without big overhead penalty. But to remain compatible, other interconnects, some of which could not accommodate this protocol so easily, had no choice but to adopt it as well.

Now to the limit inherent and impacting ringbus itself – bandwidth. Each stop (node) on the data ring can send 32 bytes per clock. If there are two stops, 64 bytes can be “in” the network. If there are three stops, it’s 96 bytes. While it sounds like perfect scaling, in effect it isn’t. Many routes will be longer than hops between two adjacent stops, which can cause “traffic jams”, and sometimes 32B per stop might not be enough, meaning some data has to be moved in more parts, adding cycles of latency. Compounded with the latency increase inherent to the number of stops – if you add node C in-between nodes A and B, then latency from A to B (for a 32B packet) goes from 1 cycle to 2, ringbus has severe scaling difficulties, which increase more and more rapidly as you add cores (or other nodes).

The network is also essentially unbuffered (there are very small buffers), which means at any point, there can be only as many 32B packets as there are stops. If enough packets need to travel a long distance, this limits nodes’ ability to put new packets on the bus, leading to stalls. With buffers, a “waiting room”, some of those would be prevented – the bigger the buffers, the less stalls. But a proper, solid buffer can take up significant area, while one of the benefits of ringbus was its minuscule footprint. See Fig 3, which depicts 3 cycles of a very basic ring network, with packets annotated with their origin and destination (0;4 is packet from core 0 sent to core 4). It is very much the worst case scenario, but it shows a situation where most of the cores are stalled, most of the time. In principle, similar thing can happen for any ring topology, just at higher network saturation.

How do you get more data through? You can make the traffic faster, or you can make the road wider.

Ringbus is very fast already, not much headroom there.

What remains is width. Because of the low buffering and years of optimizations, increasing ringbus to, say, 64B would require a redesign of the cores and caches. But what can be done quite easily is to put in a second 32B ring – and that’s exactly what Intel did for its Xeon line. This dual ringbus has a disadvantage in more complex routing (packets don’t decide to go one way or the other, but also which ring to use), and continuing adding rings quickly diminishes the returns (hence, no triple ring), and so it isn’t a long-term solution, but it’s something they could do on their mainstream CPUs right now.

Ringbus also quickly gains latency as you add nodes. To move data from one node to the other takes one clock, and so every added node adds latency linearly.

This was a problem for Xeons since Sandy Bridge, and Intel gradually evolved their solution, arriving at a chained two-ring solution with the Haswell architecture.

Each ring services at most 10 cores, and they are connected by two switches, located evenly across the ring. If core 0 needs to communicate with core 19, it doesn’t need to go through 19 nodes (and cycles), just, at most, 5 plus the switch.

This sounds great, so why couldn’t you go to three, or more, chained rings?

We can glean a possible reason from how a CPU with two rings behaves. It is in many ways more akin to a dual socket configuration. Despite being on the same die, there was enough of a latency hit going from one ring to the other that Haswell and Broadwell could be set up such that each ring was a NUMA ring.

NUMA configuration was well known at that point, but this added another layer of complexity, and as I understand it, the internal topology of the CPU was not exposed to software in the same way as traditional, multi-socket NUMA, which necessitated specific optimizations on the side of the OS or the individual program.

Furthermore, while both rings had half of the PCIe and other I/O, but only one ring had a memory controller, leading to asymmetry. Not only was memory much farther for the secondary ring, the rings also had unequal number of nodes (as the primary ring had extra node – the IMC).

This creates a clear “weak” node.

Take an example from AMD. EPYC Naples 32c was composed of 4 Zeppelin dies, each with some direct memory access. This allows the program to place itself in memory which is directly connected to the cores executing it. That is the normal NUMA optimization. But with Threadripper 2990WX, which is the same silicon, two Zeppelin dies have their memory controllers disconnected, leading to performance degradation on those two dies, as all memory accesses they made had to go through the other die.

Optimization then becomes harder, again. Just consider how long it took Windows to properly schedule for 2990WX – and take a CPU with a similar problem, but one in which the OS doesn’t immediately see it.

Dual ringbus and chaining allowed Intel to go all the way to 24 cores with Broadwell, it was now a complicated, cumbersome – and most importantly, tapped out – interconnect.

Something new was required to take core count higher.

III. Gridlocked

After reaching the limits of a two-ring topology, and realizing adding another ring would be too complicated and have too many disadvantages, Intel got to work on a new intra-chip interconnect, following a mesh topology.

Mesh is a well-known topology, one of the best understood ones. Even Intel had direct experience using it – starting in 2007 with the Polaris chip (don't confuse with the eponymous GPU architecture), a many-core design which paved the way for Xeon Phi.

But the version Intel used for the Xeon Scalable lineup, and for Knights Landing Xeon Phis is significantly different, and we can trace its research as far back as 2014, three years before it was used.

Mesh is a great choice, especially for large core-counts on monolithic dies. It is a very scalable arrangement, and while distance increases latency, it does so much more gradually and gently than a ring.

Each node in a mesh is called a tile. Each tile has 4 connections – North, South, East and West. The farther the destination node is, the more routes a packet can choose from. This increases bandwidth – if one route or connection is full, the packet can go around. You could even optimize based on latency, having packets of varying importance, with the most crucial ones being prioritized and given the shortest route.

Such extensive flexibility requires much more complex router or routing logic, as each router has to decide where to push a packet to take it to its destination, each having 4 choices.

Fig 4: Diagram of Skylake X, showing the two sub-NUMA clusters. KarbinCry.

That is why Intel uses a very rudimentary routing. All packets first go North or South, and then East or West. As simple algorithms go, it is a good one, nevertheless there is a lot of potential left at the table. Even then, the Caching/Home Agent and Converged Mesh Stop take up around 10% of a tile.

Another quirk of Intel’s mesh are the IMC tiles. There are two on each “edge” of the mesh, fully integrated into the grid. This creates a NUMA-like situation, similar to chained ringbus chips, albeit not as significant, as both clusters have direct access to some memory.

It doesn’t matter how well you build a road network, a couple traffic lights with bad timings can make traffic hell all on their own.

To make mesh as great as it can be, Intel needs to beef up routing logic, and since even now it takes up space of 3 cores, it’s not possible, within a reasonable die size, with a monolithic, planar design.

IV. One Way Highway to Hell

So far we’ve focused on the intra-chip fabrics, Intel’s responses to AMD’s multi-core innovation.

Now with all that we’ve learned, we can get to the main event, red versus blue, HyperTransfer and QuickPath Interconnect (nowadays – Infinity Fabric and UltraPath Interconnect). This is the class of interconnects which will drive the chiplet and stacked era.

It took Intel only one year after AMD to release a genuine dual- and quad-cores. But it was 5 whole years before they had a competitor to HyperTransport in inter-chip communication.

Work on YAP (Yet Another Protocol, early name for QPI) begun at least in 2004 (likely at least in 2003 when Opteron launched), but first chips with QPI debuted in 2008 with Nehalem Xeons.

The design was focused on latency (and thus, speed) and sturdiness. Result was a 20-bit wide link, with 16-bit effective data width, and speeds greatly exceeding those of HT. Each clock, two signals are sent, for a total per-clock bandwidth of 4 bytes. The link is composed of 4 quadrants, each 5/4-bit wide. If the connection is damaged or unstable, the interconnect will switch from 4 quadrants to two or even just one, to maintain stable connection.

Four snoop modes were implemented, same as with the ringbus. The UPI, modified version used in Xeon Scalable, only retains directory-based snooping to lower overhead.

Now compare it to Infinity Fabric 1.0 and its IFIS (Infinity Fabric InterSocket).

IF 1.0 is also 16-bit wide, but send 8 serialized transfers every clock, for a per-clock bandwidth of 16 bytes. Compared to QPI and UPI, which have a 20% overhead (4 out of 20 bites are not used for data), IFIS has 11.11% overhead. Infinity Fabric should also be able to dynamically narrow the connection to maintain signal integrity, but I don’t know if this is implemented – probably not, however if it is needed, it should be easy to implement.

Per clock, IFIS is clearly superior. But we know QPI was built for speed, can it catch up?

Last version of UPI can run at up to 5.2GHz, while IFIS can only clock to 1333MHz, making UPI almost four times as fast, and equalizing the two fabric’s theoretical performance.

In practice though, IFIS still has the upper hand. Imagine a narrow road with traffic going at 75 MPH, and a wide road going at a 50 MPH. While in ideal circumstances both might allow the same amount of cars to pass per hour, if anything goes wrong, the narrow road will be impacted much harder.

And such is the situation with the two interconnects. QPI/UPI is undoubtedly great for small snippets of data, while IFIS needs to fill a 16B packet, but in modern computing environment with higher and higher I/O bandwidths, this single advantage is not enough, by far.

Now we have been talking solely about the interconnects. Let’s talk about how they are used.

There is no Intel CPU which has more than 3 UPI links. However, each Zeppelin die has its own IFIS link, and 3 IFOP links. This means each 8 cores have a link! Compare this to Cascade Lake AP, where two 28 core dies have to make do with a single link.

UPI is a bit bigger than IFOS/IFOP, and the whole implementation is different, of course. UPI was never built for these huge core counts, for the massive I/O bandwidths, or to connect chiplets.

While QPI started as greatly superior to HyperTransport, the openness of HT allowed it to change, to adapt, in ways UPI simply couldn’t, and likely still can’t.

QPI/UPI is too big, too narrow, too inflexible to carry Intel into this year, much less to the future.

V. One Ring to Rule Them All, in Hierarchy Bind Them

Finally, the ground has been laid, the limits of Intel’s current fabrics exposed, and we can turn to ways they can fix them, starting with the ringbus.

Fig 5: 16-core Hierarchical Ringbus. KarbinCry

Ringbus seems quite hopeless, as chaining rings together failed, but hear me out – what if we put in multiple rings and chain them together?

Of course not quite the way used on Haswell and Broadwell, but with a hierarchical setup.

This topology has several local rings, all separate of each other, servicing their own islands of cores and caches. A global ring would then connect the local rings, and also include all I/O.

The local ringbus could be Intel standard, with the same old structure and protocol, with few necessary alterations, and the global ring could be scaled much more flexibly, free from the constraints of being directly tied to the width of the core.

Routing obviously becomes more complex, but still manageable, using new packaging techniques.

By which I mean chiplets and 3D stacking.

Global ring, routers and switches would be on a big “interposer” die, along with I/O, iGPU or whatever else, while the cores would be on tiny chiplets of 4 to 8 cores. Even a configuration similar to EPYC Rome – 8 chiplets with 8 cores each – should be possible even in this basic design. Using dual ringbus on chiplets would enable another core count explosion, not to mention we could try and make global ring faster or wider, allowing it to handle more local rings. And if we put in more switches per local ring, we can again increase core count on the individual chiplet...

The solution does sound rather expensive, but that impression is, I believe, erroneous.

The big base die with the global ring wouldn’t cost as much as Cascade Lake SP dies, given it would be much less dense, improving yields, and chiplets would have almost 100% yield on 14nm. Enhanced binning allowed by using tiny core dies would also enable much wider and meaningful product segmentation.

Even if I’m wrong and the silicon and packaging cost would increase dramatically, Xeon’s huge margins will sustain it.

As for the consumer market, the switch might take longer, but it will happen. We already see they are willing to absorb big packaging costs with Lakefield.

Hierarchical ringbus is the most Intel solution I have. It uses their most successful and well-known interconnect, retains low latency while gaining enough scalability for at least 5 years.

It is elegant, simple, powerful, and interesting by being a completely different to AMD’s approach.