This post moves from model math into the machine that actually runs the kernels: cores, caches, NUMA domains, MPI ranks, RDMA-capable fabrics, and the memory-layout discipline needed for distributed CPU AI.
Distributed computing sounds like a cluster problem, but the real lesson starts much smaller. One core has registers and private caches. A few cores share a last-level cache. A socket has memory channels. A multi-socket machine has NUMA domains. A cluster has nodes connected by Ethernet, InfiniBand, or another fabric. Every layer changes the cost of moving data.
This is why CPU AI systems cannot be designed only from peak FLOPs. FLOPs matter, but memory placement, cache coherence, thread affinity, socket topology, network bandwidth, and failure recovery decide whether those FLOPs are usable. systems lens A CPU cluster is not a weak GPU. It is a different memory and scheduling machine. The runtime has to respect that difference.
Roadmap for this post
Sections 1 through 3 explain the hierarchy: SIMD inside a core, OpenMP/pthreads inside a process, NUMA inside a server, and MPI across servers.
Sections 4 through 7 explain MESI/cache coherence, false sharing, sub-NUMA clustering, first-touch allocation, and why memory locality matters for inference and training.
Sections 8 through 11 connect these ideas to C-Kernel-Engine: MPI first, InfiniBand/RDMA underneath, raw RDMA later, registered bump arenas, topology discovery, hugepage allocation, orchestration-level parallelism, and the future distributed CPU runtime.
Section 1: The Hierarchy Of Parallelism
A serious CPU runtime has several levels of parallelism available. They are not interchangeable. SIMD handles multiple numbers inside one instruction. A threadpool or OpenMP region handles multiple cores inside one process. NUMA-aware placement keeps threads close to the memory they read. MPI handles multiple processes, usually across nodes.
| Layer | Mechanism | Scope | CKE relevance |
|---|---|---|---|
| Instruction | SIMD / AMX / SVE2 | one core | vectorized kernels |
| Core group | pthread threadpool / OpenMP | one process | row partitioning, token parallelism |
| Socket | NUMA placement | one server | weights, KV cache, activations near cores |
| Cluster | MPI / RPC / collective runtime | many nodes | distributed inference and training |
Peak FLOPs only attacks one term. CPU distributed systems require all four terms to be managed.
Section 2: OpenMP vs Threadpool vs MPI
OpenMP is a convenient way to express parallel loops inside a process. MPI is a process-to-process communication model for distributed memory systems. A custom threadpool sits somewhere else: it is an explicit runtime decision that keeps worker threads alive and dispatches model work without repeatedly entering and leaving generic parallel regions.
CKE’s current kernel rule is important: no OpenMP inside kernels. Kernels are pure computation. Parallelism belongs at the orchestration layer. That separation makes the generated runtime easier to reason about because a kernel does not secretly spawn threads, allocate memory, or change global state.
Kernels must NOT contain parallelization directives:
- NO #pragma omp parallel
- Parallelization is orchestrated at the codegen/orchestration layer
- Exceptions must be explicitly documented and justifiedSection 3: NUMA Is The First Distributed System
NUMA means Non-Uniform Memory Access. In a multi-socket or chiplet-heavy CPU system, not every core reaches every byte of memory at the same cost. A thread reading local memory may see good bandwidth and latency. A thread reading remote memory crosses an interconnect and pays more.
This is why a large CPU server is already a distributed system before MPI appears. A careless runtime can place weights on one NUMA node and run worker threads on another. The program still works, but the memory traffic becomes remote. NUMA Correctness does not prove locality. A bad NUMA placement can be numerically correct and still waste bandwidth.
typedef struct {
int node_id;
uint64_t memory_total_mb;
uint64_t memory_free_mb;
int cpu_list[MAX_CPUS];
int num_cpus;
} NUMANode;
typedef struct {
int num_nodes;
NUMANode nodes[MAX_NUMA_NODES];
int distances[MAX_NUMA_NODES][MAX_NUMA_NODES];
} NUMATopology;Section 4: Sub-NUMA Clustering
Sub-NUMA clustering splits a physical socket into smaller NUMA-like domains. The reason is locality. A large socket may have many cores, many cache slices, and many memory controllers. Treating the whole socket as one uniform domain hides useful placement information.
For AI kernels, SNC matters when large streams of weights, activations, KV cache, or optimizer state move through memory. If the runtime knows which cores are near which memory controllers, it can partition work in a way that avoids cross-domain traffic.
Section 5: MESI And Cache Coherence
MESI is a common cache-coherence protocol idea: cache lines can be Modified, Exclusive, Shared, or Invalid. The point is to keep multiple cores’ caches coherent when they read and write memory. That sounds like a hardware detail, but it matters directly for threadpool and kernel design.
If two threads repeatedly write different variables that live on the same 64-byte cache line, the cache line bounces between cores. That is false sharing. The program looks parallel in source code, but coherence traffic serializes part of the execution.
False sharing is not a data race. It is a cache-line ownership problem.
Section 6: First-Touch Allocation
On NUMA systems, memory is often physically placed when it is first touched. If one initialization thread writes all model memory, the pages may be placed near that thread’s NUMA node. Later, if many workers from other nodes read those pages, the runtime pays remote access costs.
A NUMA-aware runtime initializes memory in parallel according to the intended ownership pattern. If socket 0 will own layers 0-15 and socket 1 will own layers 16-31, first-touch should follow that plan.
Section 7: MPI Across Nodes
MPI becomes relevant when the model, batch, experts, KV cache, or training state spans machines. The mental model is simple: each process owns local memory, and communication is explicit. That explicitness is valuable because distributed CPU training and inference should not pretend the cluster is one giant uniform RAM pool.
For CKE, the eventual distributed design should treat MPI-style communication as an orchestration layer. Kernels should still be local and deterministic. The distributed runtime decides which shard, layer, batch slice, expert, or KV block lives on which node.
Section 8: InfiniBand, RoCE, And RDMA Are The Fabric Layer
Once CKE leaves one machine, the network stops being a background detail. Ethernet, RoCE, and InfiniBand are not interchangeable at the runtime-design level. Normal Ethernet gives portability and easy operations. RoCE gives RDMA semantics over Ethernet when the switching fabric is configured correctly. InfiniBand is the native HPC fabric designed for low-latency cluster communication and RDMA-style memory movement.
The important point is not that CKE should become an InfiniBand-only engine. That would be too narrow too early. The better design is a transport abstraction: CKE owns ranks, tensor blocks, arenas, offsets, microbatches, and pipeline schedules; the communication backend can be plain TCP, MPI over Ethernet, MPI over InfiniBand, UCX/libfabric, or eventually raw RDMA.
| Fabric path | What it gives | CKE interpretation |
|---|---|---|
| Ethernet TCP | portable baseline | works everywhere, useful for correctness and scheduling tests |
| RoCE | RDMA over Ethernet | good for data-center Ethernet fabrics when lossless/QoS is configured |
| InfiniBand | native HPC/RDMA fabric | good for direct-connect labs and serious cluster experiments |
| Raw verbs/RDMA | explicit queue pairs, keys, completion polling, offsets | future low-level backend if MPI/UCX overhead becomes the bottleneck |
Hardware such as Mellanox/NVIDIA ConnectX cards matters because it is the common path in HPC and AI clusters. A cheap learning path can start with ConnectX-4 or ConnectX-5 cards, a QSFP DAC cable, and two Linux machines in direct-connect mode. The first target is not marketing throughput. The first target is understanding what bandwidth and latency the fabric actually delivers. cluster rule Do not design from NIC spec sheets. Measure raw fabric, MPI fabric, and real CKE pipeline throughput separately.
Section 9: MPI First, Raw RDMA Later
MPI is the right first backend for CKE because it gives the distributed programming model before CKE takes on all the low-level transport complexity. It gives ranks, send/receive, barriers, broadcasts, reductions, and collectives. When configured over InfiniBand or RoCE, a good MPI stack may use RDMA underneath through UCX, libfabric, verbs, or vendor transports.
CKE generated C kernels
-> CKE distributed scheduler
-> transport abstraction
-> MPI_Isend / MPI_Irecv / MPI_Allreduce
-> Open MPI / Intel MPI / MPICH
-> UCX / libfabric / verbs
-> mlx5 driver
-> ConnectX NIC
-> InfiniBand or RoCERaw RDMA is more powerful, but it is also a much sharper tool. It means CKE must manage memory registration, local keys, remote keys, queue pairs, completion queues, connection setup, flow control, retries, error handling, and safety. That can be worth it later. It should not be the first proof.
The clean path is: prove the schedule with MPI, benchmark the fabric, benchmark MPI over the fabric, then only replace the transport layer if MPI is measurably blocking the runtime. This keeps CKE from becoming hardware-locked before the distributed model is proven.
Section 10: Why CKE’s Bump Arenas Fit RDMA
RDMA does not magically combine every node’s memory into one giant RAM pool. Each node still owns its local memory. The difference is that a NIC can read or write registered memory regions on another node with low CPU overhead. That model fits CKE’s arena style better than a runtime that allocates and frees tensors dynamically.
A bad RDMA design would allocate a tensor with malloc, register that memory, transfer it, deregister it, and repeat. That burns time in the exact place CKE wants determinism. A better design is to allocate a large arena once, register it once, and move tensor blocks by offset.
typedef struct {
void *base;
size_t size;
size_t used;
uint32_t lkey; // local memory key
uint32_t rkey; // remote access key shared with peers
uint64_t remote_base; // peer virtual address for this arena
} ck_dist_arena_t;
// Tensor movement becomes:
// local.base + src_offset -> remote.base + dst_offset
// The graph/runtime moves offsets, not heap objects. This is where CKE’s generated-layout philosophy becomes useful for distributed computing. If the compiler already knows tensor names, byte sizes, lifetimes, and offsets, then distributed execution can exchange compact movement plans instead of dynamically discovering object structure at runtime. A layer shard can send activation block A from offset x to the next rank’s receive arena at offset y. The generated graph and layout report become the cluster’s map.
Section 11: The Measurement Ladder
The right way to evaluate this is not to argue from vibes. Measure three layers separately. First, measure the raw fabric ceiling with tools such as ib_write_bw, ib_read_bw, and ib_send_bw. Second, measure the MPI ceiling with OSU Micro-Benchmarks or Intel MPI Benchmarks. Third, measure the real CKE pipeline with actual tensor sizes, actual rank boundaries, actual microbatching, and actual compute kernels.
| Layer | Tool | Question |
|---|---|---|
| Raw fabric | ib_write_bw, ib_send_bw | What can the NIC, PCIe slot, cable, and fabric do? |
| MPI fabric | OSU / Intel MPI Benchmarks | How much overhead does the MPI/provider stack add? |
| CKE runtime | real layer pipeline | Is the bottleneck compute, memory, synchronization, network, or scheduling? |
If raw RDMA reaches 92 Gb/s, MPI reaches 88 Gb/s, and CKE reaches 35 Gb/s, MPI is probably not the problem. The issue is likely scheduling, tensor partitioning, copies, NUMA placement, pipeline bubbles, or kernel imbalance. If raw RDMA reaches 92 Gb/s, MPI reaches 45 Gb/s, and CKE reaches 40 Gb/s, then the MPI/provider configuration deserves attention. This ladder keeps the engineering honest.
Section 12: CKE’s Existing Hooks
CKE already has several pieces that point in this direction: topology structures for NUMA and affinity, hugepage-aware allocation, section-based memory layout, kernel rules that forbid hidden allocation, and orchestration-level parallelism. These are not cosmetic details. They are the foundation for making CPU clusters predictable.
| CKE artifact | Systems meaning |
|---|---|
system_topology.h | CPU, cache, NUMA, memory, network, and affinity probing model. |
ckernel_alloc.h | Hugepage and bump-arena allocation path. |
ckernel_section_layout.h | Single allocation, sectioned model layout, and NUMA-oriented boundaries. |
src/kernels/README.md | Kernel purity rule: no allocation, no hidden OpenMP, explicit workspace. |
ck_threadpool.h | Persistent pthread worker model for low-latency local parallelism. |
Section 13: The CPU Cluster Argument
The GPU-first argument is usually peak throughput. That argument is real, but incomplete. If the workload is memory-heavy, sovereignty-sensitive, cost-sensitive, or audit-heavy, CPUs become more attractive. CPUs have large DRAM capacity, mature networking, flexible deployment, strong OS visibility, and broad availability.
The engineering bet is not that one CPU beats one GPU at dense matmul. The bet is that a CPU-first runtime can exploit memory capacity, placement control, deterministic generated code, and distributed orchestration for workloads where deployment constraints matter as much as peak tokens/sec.
This is especially relevant for large open-weight models on realistic local hardware. If a model and useful context fit cleanly inside GPU VRAM, the GPU path is usually the obvious performance path. But once the useful workload does not fit affordable VRAM, the system pays for offload, sharding, PCIe movement, multi-GPU synchronization, KV-cache pressure, and operational complexity. CPU nodes with large DRAM pools do not make those problems disappear, but they change the economics and the memory-capacity ceiling.
That is the CKE thesis in distributed form: do not pretend one CPU is an H100; build a runtime where adding commodity nodes adds compute, memory capacity, memory bandwidth, and network bandwidth in a measurable way. The proof is not a slogan. The proof is whether CKE can keep the kernels, memory arenas, pipeline schedule, and fabric busy at the same time.
Section 14: Summary
MPI, RDMA, InfiniBand, RoCE, OpenMP, NUMA, sub-NUMA clustering, MESI, affinity, and memory placement are not side quests. They are the real substrate underneath CPU AI systems. If C-Kernel-Engine is going to become a distributed CPU training and inference platform, these concepts become first-class runtime contracts.
The clean design rule is simple: kernels stay pure, local, and deterministic; orchestration handles threads, NUMA, MPI, placement, and scheduling; memory layout makes data movement predictable. That is the path from a single C kernel to a distributed CPU runtime.