v8 inference hardening
This ShivasNotes deep dive is written for engineers who want to understand what happens between "here is a model on Hugging Face" and "here is a compiled .so that runs inference in pure C." C-Kernel-Engine is not a runtime interpreter — it is a code generator. Template JSON defines the architecture. The IR builder resolves kernels. The memory planner assigns every byte. The codegen emits unrolled C. The compiler produces a shared library that runs independently of Python. This post focuses on the current v8 inference lane: promoted text-family bring-up, high-memory smoke targets, Qwen3-VL multimodal bridge work, and the regression/parity surface that hardens the runtime. Video walkthrough on youtube.com@antshivrobotics.
The central idea is a compiler split: CKE is smart before codegen and intentionally dumb during codegen. Template choice, kernel binding, memory layout, multimodal bridge policy, and parity instrumentation happen before a single line of emitted C exists. By the time model_v8.c or encoder_v8.c is written, the interesting decisions are already over. That is why the generated file feels so explicit. It is not discovering the model at runtime. It is replaying a compile-time plan that was already settled in JSON and lowered IR.
What this post covers
Sections 1 through 5 frame the active v8 inference front-end: the six-step ck_run_v8.py pipeline, the template JSON system, the Gemma4 hybrid contract, the Qwen3-VL vision encoder, and the multimodal bridge.
Sections 6 through 10 walk the middle of the pipeline: IR1, the fusion pass, the memory planner, lowered IR, and the code generator that emits fully unrolled C.
Sections 11 through 15 land on the hardening surface: what model_v8.c actually looks like, how it becomes libmodel.so, what the v8 regression/parity tools check, and why the smart-front-end / dumb-back-end split is the architectural win.
Introduction — CKE Is a Compiler, Not a Framework
Most LLM frameworks are runtime interpreters. They load a model, keep a graph around, and make decisions while tokens are flowing. CKE can do the opposite because the graph shape is already known. It downloads weights once, resolves a template once, lowers that plan once, then emits static C that calls kernels directly.
The most useful way to read v8 is as a six-step compiler pipeline. Step 1 downloads the model. Step 2 converts Hugging Face or GGUF weights into the bump format CKE expects. Step 3 builds IR1 from template + quant metadata. Step 4 emits C from lowered IR. Step 5 compiles that C into libmodel.so. Step 6 starts chat by loading the shared library. There is no interpreter loop between step 5 and the first generated token.
The surface area is large enough that this is clearly a compiler project, not a wrapper. The v8 inference pipeline spans ck_run_v8.py (1,315 lines), build_ir_v8.py (10,459), codegen_v8.py (1,025), codegen_prefill_v8.py (1,943), and memory_planner_v8.py (705) before the generated C even exists.
That front-loads complexity on purpose. The runtime becomes boring because the builder already paid the planning cost up front. 15,447 linesThe compile pipeline is substantial before inference starts: IR build, memory planning, decode codegen, and prefill codegen are all explicit source files, not hidden framework internals.

def log_step(step: int, msg: str):
"""Print pipeline step."""
print(f"{C_ORANGE}[{step}/6]{C_RESET} {C_BOLD}{msg}{C_RESET}")log_step(1, f"Downloading {model_id}")
log_step(2, f"Converting weights to bump format ({weight_dtype})")
log_step(3, "Building IR1 (Template + Quant → Kernel IDs)")
log_step(4, "Generating C code")
log_step(5, "Compiling to shared library")
log_step(6, "Starting chat")ck_run_v8.py 1315 lines
build_ir_v8.py 10459 lines
codegen_v8.py 1025 lines
codegen_prefill_v8.py 1943 lines
memory_planner_v8.py 705 lines
--------------------------------------------
Core pipeline 15447 lines
run_regression_v8.py 844 lines
parity_test_v8.py 899 lines
compare_first_token_logits_v8.py 633 lines
run_cached_model_smoke_v8.py 174 lines
| Stage | Script / component | Lines | Input | Output |
|---|---|---|---|---|
| [1/6] Download model | ck_run_v8.py | 1,315 | HF model id / GGUF repo | local model directory |
| [2/6] Convert weights | convert_hf_to_bump_v8.py or convert_gguf_to_bump_v8.py | orchestrated by ck_run_v8.py | config.json + weights | weights.bump + weights_manifest.json |
| [3/6] Build IR1 | build_ir_v8.py | 10,459 | template + quant manifest | ir1_decode.json + lowered IR |
| [4/6] Generate C | codegen_v8.py | 1,025 | lowered_decode.json | model_v8.c |
| [5/6] Compile .so | gcc / clang / icx | compiler stage | model_v8.c + runtime libs | libmodel.so |
| [6/6] Start chat | scripts/ck_chat.py | Python loader | libmodel.so + tokenizer .so | interactive inference |
Template JSON — The Architecture Specification
Every model family in CKE starts as a template JSON. In v8 the template lane now covers text decoders, hybrid/sliding decoder variants, vision encoder experiments, and the Qwen3-VL multimodal pair. The template is not a cosmetic manifest. It is the architecture contract.
For Qwen3 that contract declares no QKV bias, yes Q/K norm, a SwiGLU MLP, a RoPE decoder, a BPE tokenizer contract, and a header/body/footer decode sequence. Those body ops are the skeleton that the IR builder expands into 28 explicit layers.
The cleanest way to think about templates is this: they are handwritten architecture truth, while the generated C is mechanically derived truth.
A new model family often means writing 100–200 lines of template JSON, not rewriting the whole compiler. That is why CKE can absorb dense text models, sliding-window hybrids, and multimodal encoders without inventing a new runtime every time.


{
"name": "qwen3",
"family": "llama",
"flags": {"use_qkv_bias": false, "has_qk_norm": true, "activation": "swiglu", "rope": "rope", "tokenizer": "bpe"},
"attention_contract": {"rope_layout": "split", "rope_type": "rope", "qk_norm": true, "kv_layout": "layer_major_kv_cache", "attn_variant": "dense"},
"block_contract": {"norm_type": "rmsnorm", "mlp_formula": "gate_up -> silu_mul -> down", "activation": "swiglu", "qkv_bias": false}
}{
"sequence": ["decoder"],
"block_types": {
"decoder": {
"sequence": ["header", "body", "footer"],
"header": ["bpe_tokenizer", "dense_embedding_lookup"],
"body": {
"type": "dense",
"ops": [
"rmsnorm", "qkv_proj", "qk_norm", "rope_qk",
"attn", "out_proj", "residual_add", "rmsnorm",
"mlp_gate_up", "silu_mul", "mlp_down", "residual_add"
]
},
"footer": ["rmsnorm", "lm_head", "logits"]
}
}
}version/v8/templates/
gemma3.json
gemma4.json
gemma4_vision.json
glm4.json
llama.json
nemotron_h.json
qwen2.json
qwen3.json
qwen35.json
qwen3_vl_vision.json
qwen3vl.json
siglip_vit.json
| Template | Family | Body type | Attention variant | Activation | Special features |
|---|---|---|---|---|---|
qwen3.json | llama-style decoder | dense | dense causal GQA | SwiGLU | Q/K norm, BPE chat contract |
qwen35.json | qwen3.5 recurrent hybrid | hybrid recurrent attention | hybrid recurrent | SwiGLU | mixed recurrent + dense path |
gemma4.json | Gemma4 decoder | hybrid_sliding_attention | full + sliding interleaving | GeGLU | layer-kind dispatch, shared-KV variants |
qwen3_vl_vision.json | vision_transformer_with_branches | dense bidirectional | dense bidirectional | GELU | dual patch projection, 2D positions, deepstack branches |
qwen3vl.json | multimodal language decoder | dense | dense causal GQA | SwiGLU | mRoPE, vision markers in chat contract |
Gemma4 — Hybrid Sliding Window Interleaving
Gemma4 is where templates stop looking like simple decoder boilerplate and start looking like a compiler necessity. The body type is hybrid_sliding_attention, and each layer can be one of four different kinds.
Those kinds are not cosmetic labels. Shared-KV layers skip explicit k_proj and v_proj. Sliding layers call different attention kernels. The template names kind_config_key: "layer_kinds" so the builder can read a per-layer kind vector from config.json.
Instead of a runtime loop asking what kind each layer is on every token, the front-end resolves the layer plan once and the back-end unrolls concrete layers with the correct kernel sequence already baked in.
Gemma4 makes the case for code generation better than any slogan could. Once different layers legally have different body shapes, the easiest correct implementation is to emit different C for each layer.

{
"rope_layout": "split",
"rope_type": "rope",
"qk_norm": true,
"kv_layout": "layer_major_kv_cache",
"attn_variant": "hybrid_sliding_attention",
"layer_policy_config_key": "layer_attention_plan",
"layer_kind_config_key": "layer_kinds",
"kv_policy_config_key": "layer_kv_policy",
"kv_source_config_key": "layer_kv_source",
"sliding_window_config_key": "layer_sliding_window",
"rope_kind_config_key": "layer_rope_kind"
}{
"sliding_attention_kv": {
"wq": "attn_q",
"wk": "attn_k",
"wv": "attn_v",
"wo": "attn_output",
"q_norm": "attn_q_norm",
"k_norm": "attn_k_norm"
},
"full_attention_kv": {
"wq": "attn_q",
"wk": "attn_k",
"wv": "attn_v",
"wo": "attn_output",
"q_norm": "attn_q_norm",
"k_norm": "attn_k_norm"
},
"sliding_attention_shared_kv": {
"wq": "attn_q",
"wo": "attn_output",
"q_norm": "attn_q_norm"
},
"full_attention_shared_kv": {
"wq": "attn_q",
"wo": "attn_output",
"q_norm": "attn_q_norm"
}
}sliding_attention_kv:
attn_norm -> q_proj -> k_proj -> v_proj -> v_norm -> qk_norm
-> rope_qk -> attn_sliding -> out_proj -> post_attention_norm
-> residual_add -> ffn_norm -> mlp_gate_up -> geglu -> mlp_down
-> post_ffn_norm -> residual_add -> gemma4_per_layer_embed
full_attention_shared_kv:
attn_norm -> q_proj -> q_norm -> rope_q
-> attn_shared_kv -> out_proj -> post_attention_norm
-> residual_add -> ffn_norm -> mlp_gate_up -> geglu
-> mlp_down -> post_ffn_norm -> residual_add -> gemma4_per_layer_embedVision Templates — Multi-Modal Architecture (v8)
v8 pushes the same template system into multimodal territory. qwen3_vl_vision.json is not a decoder template at all. Its family is vision_transformer_with_branches. The attention is bidirectional instead of causal. The normalization is LayerNorm instead of RMSNorm. The activation is GELU instead of SwiGLU.
The template also introduces 2D position handling and branch pipelines. There are custom kernels like position_embeddings_add_tiled_2d and spatial_merge_contiguous_tiled. The footer emits vision_embeddings that the language decoder later consumes.
Then qwen3vl.json takes over for the text side. It carries the chat contract, the <|vision_start|> markers, and the mrope attention contract. One template system now covers both the vision encoder and the language decoder.
The current validated multimodal lane is intentionally narrower than “all vision models now work.” The public v8 vision encoder architecture documents the Qwen3-VL path: real image input, deterministic encoder prefix, bridge stitching, and decoder continuation. The flow is GGUF intake to template resolution to IR/layout to bridge prefix to decoder continuation.
That distinction matters. v8 is proving that the same compiler surface can lower and run the vision encoder, expose artifacts like encoder_v8.c, layout.json, call.json, and then hand the projected rows to the decoder. Some host policy is still explicit, especially image preprocessing and chat-template fallback behavior, but the encoder itself is not an external black box.
The same front-end machinery can describe a causal decoder, a hybrid sliding decoder, and a bidirectional vision transformer without changing the core lowering idea. The back-end still does the same boring job at the end: emit function calls that the lowered IR already settled.
{
"family": "vision_transformer_with_branches",
"flags": {"patch_frontend": "dual_patch_proj_sum", "activation": "gelu", "normalization": "layernorm"},
"vision_contract": {"input_modality": "image", "position_encoding": "absolute_2d", "output": "vision_embeddings"},
"attention_contract": {"attn_variant": "dense_bidirectional", "causal": false, "kv_layout": "ephemeral_full_context"},
"kernels": {"position_embeddings": "position_embeddings_add_tiled_2d", "spatial_merge": "spatial_merge_contiguous_tiled", "attn_prefill": "attention_forward_full_head_major_gqa_ggml_strided"}
}vision_encoder:
header:
patchify
patch_proj
patch_proj_aux
patch_sum
patch_bias
position_embeddings
vision_position_ids
body:
attn_norm -> qkv_packed_proj -> split_qkv -> vision_mrope
-> attn -> attn_out_proj -> attn_residual
-> ffn_norm -> mlp_up -> mlp_gelu -> mlp_down -> mlp_residual
branches:
deepstack
footer:
final_norm -> merge_main -> projector_fc1 -> projector_gelu -> projector_fc2 -> deepstack_concat{
"flags": {"use_qkv_bias": false, "has_qk_norm": true, "activation": "swiglu", "rope": "mrope", "tokenizer": "bpe"},
"chat_contract": {
"image_begin_marker": "<|vision_start|>",
"image_end_marker": "<|vision_end|>",
"template_markers": ["<|im_start|>", "<|im_end|>", "<|vision_start|>", "<|vision_end|>", "", " "]
},
"attention_contract": {"rope_layout": "multi_section_1d", "rope_type": "mrope", "qk_norm": true, "kv_layout": "layer_major_kv_cache"}
}v8 Hardening Surface — Regression, Parity, and Smoke
v8 is the active inference bring-up lane. Training workflows remain in v7. That separation is useful: v8 can harden text inference, multimodal inference, tokenizer/chat-template behavior, parity probes, and generated runtime contracts without pretending the training stack has moved lanes.
The public v8 inference runbook names the operational contract. A model family is not promoted because a template exists. It is promoted when conversion, compile, tokenizer behavior, chat-template behavior, first-token or logits parity, and smoke generation are understood well enough to debug repeatably.
The hardening surface sits next to the runner: run_regression_v8.py, parity_test_v8.py, compare_first_token_logits_v8.py, and run_cached_model_smoke_v8.py. Those files are not marketing artifacts. They are the guardrails that stop “it generated something once” from becoming a fake support claim.
The important architectural point is promotion discipline: v8 support means the generated runtime survives repeatable checks, not just a lucky local prompt. 2,550 linesRegression, parity, and cached smoke helpers give v8 a practical promotion surface around generated C inference.
version/v8/scripts/cks-v8-run run \
hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
--prompt "Explain static code generation in one sentence."
version/v8/scripts/cks-v8-run run \
hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \
--mmproj hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
--image-path version/v8/test_assets/v8_vision_doc_card_72.ppm \
--prompt "Explain this image."run_regression_v8.py 844 lines
parity_test_v8.py 899 lines
compare_first_token_logits_v8.py 633 lines
run_cached_model_smoke_v8.py 174 lines
common checks:
build or reuse cached model artifacts
compile generated libmodel.so / libdecoder_v8.so
compare first-token logits where reference output exists
run short text smokes
run cached Qwen3-VL E2E smoke when artifacts are availablepurpose:
compare CK runtime logits against llama.cpp reference output
runtime side:
load libmodel.so
call ck_model_embed_tokens
call ck_model_forward
optionally enable ck_set_strict_parity
why it matters:
if generation is incoherent, first-token logits expose whether the bug is
tokenization, chat templating, weight conversion, kernel math, or runtime statemake v8-regression-fast
make test-v8-qwen3vl-e2e-smoke
# The second target is intentionally artifact-aware:
# small runners skip cleanly when the large decoder/mmproj cache is absent.IR1 — The Op Graph (564 Operations)
IR1 is where the template becomes a concrete op graph. For the Qwen3-0.6B GGUF artifact used here, ir1_decode.json contains exactly 564 decode ops and spans 18,304 lines.
Each IR1 op already names the kernel, the semantic op type, the section, the layer index, the dataflow inputs, the output slots, and any weight references. That is why the builder can be smart without being magical.
Op 0 is a perfect example. It says the kernel is embedding_forward_q8_0, the op is dense_embedding_lookup, the input comes from external:token_ids, the output goes to the main_stream slot, and the weight comes from the token embedding at offset 552.
Once you see IR1, the mental model becomes simple: every layer is just a repeated slot machine of “read from this slot, call this kernel, write to that slot.” 564 opsFor the Qwen3-0.6B GGUF decode artifact, IR1 is already large enough that manual runtime dispatch would be silly. Compilation is the cleaner abstraction.
{
"op_id": 0,
"kernel": "embedding_forward_q8_0",
"op": "dense_embedding_lookup",
"section": "header",
"layer": -1,
"dataflow": {
"inputs": {"token_ids": {"from": "external:token_ids", "dtype": "i32", "slot": "external:token_ids"}},
"outputs": {"out": {"dtype": "fp32", "slot": "main_stream"}}
},
"weights": {"token_emb": {"offset": 552, "size": 165306368, "dtype": "q8_0"}}
}{
"op_id": 4,
"kernel": "gemv_q8_0_q8_0",
"op": "q_proj",
"section": "body",
"layer": 0,
"dataflow": {
"inputs": {"x": {"from_op": 3, "from_output": "output", "dtype": "q8_0", "slot": "main_stream_q8"}},
"outputs": {"y": {"dtype": "fp32", "slot": "q_scratch"}}
},
"weights": {
"wq": {"offset": 170026546, "size": 2228224, "dtype": "q8_0"},
"bq": {"offset": 172254770, "size": 8192, "dtype": "fp32"}
}
}total ops: 564
header ops: 1
body ops: 560
footer ops: 3
top repeated ops:
rmsnorm: 57
residual_save: 56
residual_add: 56
quantize_input_0: 28
q_proj: 28
k_proj: 28
v_proj: 28
qk_norm: 28
rope_qk: 28
attn: 28
| Section | Op count | What lives there | Example ops |
|---|---|---|---|
| header | 1 | Tokenizer / embedding entry point | dense_embedding_lookup |
| body | 560 | Repeated decoder layer body | residual_save, rmsnorm, q_proj, attn, mlp_down |
| footer | 3 | Final normalization and logits | rmsnorm, quantize_final_output, logits |
IR1 Stage 2 — Fusion Pass
After IR1 generation, build_ir_v8.py runs a fusion pass. The important detail is where fusion lives: not in codegen, and not in the C runtime. It happens while the graph is still symbolic enough for the builder to match sequences of ops against kernel-registry patterns.
The file says exactly how it works: scan the registry for kernels with a fuses field, collect candidate patterns, match consecutive ops, and replace them with a fused kernel while merging metadata.
Even when a specific artifact does not dramatically shrink its visible op count, the architecture still matters. The compiler has a place where operator fusion belongs, and that place is before memory layout and before C emission.
If an optimization changes what the op graph means, it belongs in IR build. Fusion is the canonical example. You want it to happen when the compiler still understands graph adjacency, not after everything has been flattened into raw pointer arithmetic.
#!/usr/bin/env python3
"""
build_ir_v8.py - Complete IR Pipeline: Template + Quant → IR1 → Fusion → Layout
PIPELINE (4 stages):
1. IR1 Generation: Template + Quant Summary → Kernel IDs
2. Fusion Pass: Combine consecutive kernels using registry-driven patterns
3. Memory Layout: Plan activation buffers and weight offsets
4. Output: IR1 JSON + Memory Layout JSON
Stage 1 - IR1 Generation (Direct mapping, no intermediate abstractions):
1. Parse template sequence (what ops to run)
2. Read quant summary from manifest (what dtypes for weights)
3. Map template ops → kernel ops → concrete kernel IDs
4. Return: List of kernel function names
Stage 2 - Fusion Pass:
1. Scan kernel registry for kernels with "fuses" field
2. Match consecutive kernel sequences in IR1
3. Replace matching sequences with fused kernels 1. Scan registry for kernels with "fuses" field
2. Match consecutive kernel sequences
3. Replace with fused kernel, merge weights
4. Track fusion statistics
"""
print(f"\n{'='*60}")
print("FUSION PASS")
print(f"{'='*60}")
# Check for fusion disable flag (parameter only)
if no_fusion:
print(" ⚠️ Fusion DISABLED (--no-fusion)")
return ir1_ops, {"total_fusions": 0, "kernels_removed": 0, "fusions_applied": [], "disabled": True}
# Build fusion patterns from registry
fusion_patterns = []
for kernel in registry["kernels"]:
if "fuses" not in kernel:
continue
# Check if this fused kernel matches the mode
# NOTE: Allow prefill fused kernels in decode mode (v8 baseline parity)
# The fused prefill kernels work for tokens=1 (decode) and are more accurate
# because they handle quantization internally.
variant = kernel.get("variant", "")
# Don't skip prefill kernels in decode mode - they work with tokens=1
# if mode == "decode" and "prefill" in variant and "decode" not in variant:
# continue
if mode == "prefill" and "decode" in variant and "prefill" not in variant:
continue
pattern = {registry kernel advertises:
id: fused_kernel_name
fuses: [kernel_a, kernel_b, kernel_c]
builder action:
scan IR1 for consecutive kernel_a -> kernel_b -> kernel_c
replace that span with fused_kernel_name
merge weights and params
record fusion statisticsThe Memory Planner — Every Byte Has an Address
The memory planner is the bridge between symbolic slots and physical addresses. In v8 it lives in memory_planner_v8.py, a 705-line file whose job is exactly what the name says: assign buffers based on the dataflow graph.
The planner starts from canonical buffers like A_EMBEDDED_INPUT, A_LAYER_INPUT, A_RESIDUAL, A_ATTN_SCRATCH, A_MLP_SCRATCH, A_KV_CACHE, A_LOGITS, and A_LAYER_OUTPUT. The current Qwen3 artifact also allocates specialized scratch regions such as A_ATTN_Q_GATE_PACKED and A_ATTN_GATE.
The result for this Qwen3-0.6B GGUF run is deterministic and inspectable. The layout map says the total footprint is 1,606,394,890 bytes, of which 639,587,338 bytes are weights and 966,807,552 bytes are activations.
This is where the compiler earns trust from systems engineers. 1.50 GBThe Qwen3-0.6B GGUF artifact shows a 610 MB weight arena and a 922 MB activation arena for a 1.50 GB total footprint.

#!/usr/bin/env python3
"""
memory_planner_v8.py - Assign physical buffers based on IR1 dataflow graph.
This replaces the buggy ping-pong buffer logic with explicit dataflow-based assignment.
PIPELINE POSITION:
IR1 (with dataflow) → Kernel Resolution → MEMORY PLANNER → IR Lower
INPUT:
- IR1 ops with dataflow info (from build_ir_v8.py)
- Kernel maps (to know dtype requirements)
OUTPUT:
- Buffer assignments per op: {op_id: {input_name: buffer, output_name: buffer}}
PHYSICAL BUFFERS:
- A_EMBEDDED_INPUT : Main activation buffer 1 (FP32)
- A_LAYER_INPUT : Main activation buffer 2 (FP32/Q8)
- A_RESIDUAL : Saved residual for skip connections (FP32)
- A_ATTN_SCRATCH : Q/K/V projections and attention output (FP32)
- A_MLP_SCRATCH : MLP gate_up and swiglu output (FP32)
- A_KV_CACHE : KV cache (persistent across tokens)
- A_LOGITS : Final logits output (FP32)
- A_LAYER_OUTPUT : Layer output buffer (FP32) "A_EMBEDDED_INPUT": PhysicalBuffer(
name="A_EMBEDDED_INPUT",
dtype="fp32",
last_writer=-1,
can_hold=["fp32", "q8_0", "q8_k"]
),
"A_LAYER_INPUT": PhysicalBuffer(
name="A_LAYER_INPUT",
dtype="fp32",
last_writer=-1,
can_hold=["fp32", "q8_0", "q8_k"]
),
"A_RESIDUAL": PhysicalBuffer(
name="A_RESIDUAL",
dtype="fp32",
last_writer=-1,
can_hold=["fp32"]
),
"A_ATTN_SCRATCH": PhysicalBuffer(
name="A_ATTN_SCRATCH",MEMORY SUMMARY
--------------------------------------------------------------------------------
Total: 1,606,394,890 bytes (1.50 GB)
Weights: 639,587,338 bytes (610.0 MB)
Activations: 966,807,552 bytes (922.0 MB)Offset End Size (bytes) Buffer Shape
------------------------------------------------------------------------------------------------------------------------
0x000000000000 0x000000004000 16,384 ( 16.00 KB) text_input [16384]
0x000000004000 0x000000005000 4,096 ( 4.00 KB) token_ids [1024]
0x000000005000 0x000000405000 4,194,304 ( 4.00 MB) embedded_input [1024, 1024]
0x000000405000 0x000000805000 4,194,304 ( 4.00 MB) layer_input [1024, max(1024, Q8_K(3072))]
0x000000805000 0x000000C05000 4,194,304 ( 4.00 MB) residual [1024, 1024]
0x000000C05000 0x00000EC05000 234,881,024 ( 224.00 MB) kv_cache [28, 2, 8, 1024, 128]
0x00000EC05000 0x00000EC85000 524,288 ( 512.00 KB) rope_cache [2, 1024, 64]
0x00000EC85000 0x00000F485000 8,388,608 ( 8.00 MB) q_scratch [16, 1024, 128]
0x00000F485000 0x00000F885000 4,194,304 ( 4.00 MB) k_scratch [8, 1024, 128]
0x00000F885000 0x00000FC85000 4,194,304 ( 4.00 MB) v_scratch [8, 1024, 128]
0x00000FC85000 0x000010C85000 16,777,216 ( 16.00 MB) attn_q_gate_packed [1024, 4096]
0x000010C85000 0x000011485000 8,388,608 ( 8.00 MB) attn_gate [1024, 2048]
0x000011485000 0x000011C85000 8,388,608 ( 8.00 MB) attn_scratch [16, 1024, 128]
0x000011C85000 0x000014485000 41,943,040 ( 40.00 MB) mlp_scratch [max(1024*6144, fused_attn, geglu_bf16)]
0x000014485000 0x000014885000 4,194,304 ( 4.00 MB) layer_output [1024, 1024]
0x000014885000 0x000039A05000 622,329,856 ( 593.50 MB) logits [1024, 151936]
| Buffer | Offset | Size | Shape | Purpose |
|---|---|---|---|---|
embedded_input | 0x000000005000 | 4,194,304 | [1024, 1024] | Main FP32 stream after embedding and norm |
layer_input | 0x000000405000 | 4,194,304 | [1024, max(1024, Q8_K(3072))] | Quantized or alternate main stream |
residual | 0x000000805000 | 4,194,304 | [1024, 1024] | Saved skip-connection copy |
kv_cache | 0x000000C05000 | 234,881,024 | [28, 2, 8, 1024, 128] | Persistent decode cache |
q_scratch | 0x00000EC85000 | 8,388,608 | [16, 1024, 128] | Q projection output |
attn_scratch | 0x000011485000 | 8,388,608 | [16, 1024, 128] | Attention output workspace |
mlp_scratch | 0x000011C85000 | 41,943,040 | [max(1024*6144, fused_attn, geglu_bf16)] | MLP gate/up/down workspace |
logits | 0x000014885000 | 622,329,856 | [1024, 151936] | Final logits buffer |
| Physical buffer name | Declared dtype(s) | Source of truth | Why it exists |
|---|---|---|---|
A_EMBEDDED_INPUT | fp32 / q8_0 / q8_k | memory_planner_v8.py | First main-stream arena |
A_LAYER_INPUT | fp32 / q8_0 / q8_k | memory_planner_v8.py | Ping-pong alternate for quantized activations |
A_RESIDUAL | fp32 | memory_planner_v8.py | Skip connection save buffer |
A_ATTN_SCRATCH | fp32 | memory_planner_v8.py | Attention-side temporary output |
A_MLP_SCRATCH | fp32 | memory_planner_v8.py | MLP-side temporary output |
A_KV_CACHE | fp32 | memory_planner_v8.py | Persistent cross-token state |
A_LOGITS | fp32 | memory_planner_v8.py | Final output arena |
A_LAYER_OUTPUT | fp32 | memory_planner_v8.py | Explicit layer-output staging area |
The Lowered IR — Concrete Pointer Expressions
IR1 still talks in slots. Lowered IR talks in addresses. In the Qwen3-0.6B GGUF artifact, lowered_decode.json contains 592 lowered ops and spans 36,181 lines.
Op 0 no longer says “read token_ids and write main_stream.” It says “read tokens from activations + 16384, write output to activations + 20480, and read token embeddings from bump_weights + 0.” Op 4 does the same for the layer-0 Q projection.
Once lowering is done, codegen has almost nothing left to decide. The pointer expressions already exist.
Lowered IR is the moment where the compiler becomes brutally concrete. If a pointer is wrong in generated C, the real bug is usually upstream in memory planning or lowering. Codegen is just copying expressions it was handed.

{
"idx": 0,
"kernel": "embedding_forward_q8_0",
"function": "embedding_forward_q8_0",
"weights": {"token_emb": {"ptr_expr": "bump_weights + 0"}},
"activations": {"tokens": {"ptr_expr": "activations + 16384"}},
"outputs": {"output": {"ptr_expr": "activations + 20480"}},
"params": {"embed_dim": 1024, "num_layers": 28, "seq_len": 1}
}{
"idx": 4,
"kernel": "gemv_q8_0_q8_0",
"op": "q_proj",
"weights": {
"wq": {"ptr_expr": "bump_weights + 170025994"},
"bq": {"ptr_expr": "bump_weights + 172254218"}
},
"activations": {"x": {"ptr_expr": "activations + 4214784"}},
"outputs": {"y": {"ptr_expr": "activations + 248008704"}},
"params": {"_output_dim": 2048, "_input_dim": 1024, "_m": 1}
}idx 589 rmsnorm
W final_ln_weight = bump_weights + 639579146
A input = activations + 20480
O output = activations + 20480
idx 590 quantize_final_output
A input = activations + 20480
O output = activations + 4214784
idx 591 logits
W token_emb = bump_weights + 0
A input = activations + 4214784
O logits = activations + 344477696Codegen — The Dumb Emitter (1,025 Lines)
The comment at the top of codegen_v8.py is unusually candid, and that is a good thing. It says the job is to create memory layout declarations, parse lowered IR, emit unrolled function calls, and pass pointers cleanly. Then it states the bug-routing rule in plain language: if there are memory issues, fix the memory layout builder, not codegen; if there are kernel issues, fix the IR lower, not codegen.
That is the architectural payoff of the whole pipeline. Codegen does not carry model-family intelligence, dispatch logic, or ad hoc runtime decisions. It walks the lowered ops list, emits one C call after another, wires in stop-op hooks, and includes optional parity/profile instrumentation when the build asked for it.
The prefill generator follows the same philosophy. Even parallelization is treated as upstream truth.
“Dumb” is not an insult here. It is a design goal. The less intelligence codegen carries, the fewer places there are for architecture bugs to hide.
#!/usr/bin/env python3
from __future__ import annotations
"""
codegen_v8.py - Generate C code from lowered IR.
RESPONSIBILITIES:
1. Create memory layout from layout.json (structs, offsets, allocations)
2. Parse lowered IR and emit function calls (unrolled, one after another)
3. Pass pointers cleanly to all functions
If there are memory issues → fix the memory layout builder, not codegen.
If there are kernel issues → fix the IR lower, not codegen.
===============================================================================This section documents values that are hardcoded in codegen but should come from
IR config or dedicated kernels. These WILL BREAK for non-Qwen2 models.
Delete entries from this list as they are properly fixed.
NOTE: Init ops (rope_init, etc.) now use init_call.json pattern:
manifest.config → init.json → init_call.json → codegen emits calls
This is the correct pattern for model-specific initialization.
┌─────────────────────────────────────────────────────────────────────────────┐
│ 1. ROPE SCALING TYPE - MEDIUM │
├─────────────────────────────────────────────────────────────────────────────┤
│ Location: rope_precompute_cache kernel │
│ Current: Standard RoPE only (no scaling) │
│ Should be: Support for rope_scaling_type from config: │
│ - "linear": freq *= 1/scaling_factor │
│ - "dynamic": NTK-aware dynamic scaling │
│ - "yarn": YaRN (Yet another RoPE extensioN) │
│ │
│ Impact: Context extension won't work for models using scaled RoPE │
│ - Llama 3.1 uses scaled RoPE for 128K context │
│ - Code Llama uses linear scaling │
│ │
│ Fix: Extend rope_precompute_cache kernel to accept scaling_type param │
│ init.json already has rope_scaling_type field ready to use │ tokenizer_include = ""
if init_call:
for op in init_call.get("operations", []):
c_code = op.get("c_code", {})
if isinstance(c_code, dict) and c_code.get("include"):
tokenizer_include = c_code["include"]
break
parts = []
parts.append(f'''/*
* Auto-generated by codegen_v8.py
* Generated: {now}
* Model: {config.get("model", "unknown")}
* Mode: {ir.get("mode", "decode")}
* Layers: {config.get("num_layers", 0)} (unrolled)
* RoPE: theta={rope_theta}, rotary={rotary_dim}, scaling={rope_scaling_type}/{rope_scaling_factor}
* RoPE kernels: init={rope_init_kernel}, qk={rope_qk_kernel}, cache={rope_cache_layout}
*/
#define _GNU_SOURCE
#include
#include
#include
#include #!/usr/bin/env python3
"""
codegen_prefill_v8.py - Generate C code for PREFILL mode from lowered IR.
This generates ck_prefill() which processes multiple tokens at once.
The IR (lowered_prefill_call.json) already has function names and expressions.
We just substitute num_tokens for const:1 sources.
=============================================================================
IMPORTANT: CODEGEN IS DUMB - NO PARALLELIZATION LOGIC HERE
=============================================================================
When you look at this code, you'll see many `for` loops that LOOK like they
could be parallelized with `#pragma omp parallel for`. You might be tempted
to add pragmas here. DON'T.
WHY NOT?
1. Codegen has NO global view of the computation graph
2. Adding pragmas here could cause FALSE SHARING between ops
3. Two adjacent ops might both parallelize the same buffer = cache thrashing
4. Thread over-subscription if multiple ops spawn threads
WHERE DOES PARALLELIZATION COME FROM?The Generated C — What model_v8.c Looks Like
The emitted file is not tiny. For a concrete Qwen3-0.6B GGUF artifact, model_v8.c becomes a large model-specific C translation unit. But the structure is easy to read once you know what to look for.
The weight metadata is compile-time data. HeaderOffsets and LayerOffsets L_LAYERS[28] are just raw offsets into the bump file. The activation metadata is the same idea for the single contiguous allocation.
That is also where the debugging hooks become visible. Every op gets an optional if (stop_seq == N) return; check, and parity instrumentation can be compiled in with CK_PARITY_DUMP.
The generated C does not describe a framework. It describes this model, with these offsets, calling these kernels in this exact order. generated CThe generated decode+prefill file is large because layers are fully unrolled and every offset is made explicit as compile-time data.
/*
* Auto-generated by codegen_v8.py
* Generated: 2026-05-29 21:57:57
* Model: qwen3
* Mode: decode
* Layers: 28 (unrolled)
* RoPE: theta=1000000.0, rotary=128, scaling=none/1.0
* RoPE kernels: init=rope_precompute_cache, qk=rope_forward_qk_with_rotary_dim, cache=rotary_dim/2
*/
/* ============================================================================
* MODEL CONFIGURATION
* ============================================================================ */
#define EMBED_DIM 1024
#define NUM_HEADS 16
#define NUM_KV_HEADS 8
#define HEAD_DIM 128
#define ROTARY_DIM 128
#define INTERMEDIATE_SIZE 3072
#define NUM_LAYERS 28
#define VOCAB_SIZE 151936
#define MAX_SEQ_LEN 1024
/* RoPE scaling: type=none, factor=1.0 */
/* Memory sizes */
#define WEIGHTS_SIZE 639587338ULL
#define ACTIVATIONS_SIZE 966807552ULL/* Per-layer weight offsets */
typedef struct {
size_t b1;
size_t b2;
size_t bk;
size_t bo;
size_t bq;
size_t bv;
size_t k_norm;
size_t ln1_gamma;
size_t ln2_gamma;
size_t q_norm;
size_t w1;
size_t w2;
size_t wk;
size_t wo;
size_t wq;
size_t wv;
} LayerOffsets;
static const LayerOffsets L_LAYERS[28] = {
[0] = {
[0] = {
.b1 = 183416842,
.b2 = 186783754,
.bk = 173377034,
.bo = 176728074,
.bq = 172254218,
.bv = 174495754,
.k_norm = 173381130,
.ln1_gamma = 170017802,
.ln2_gamma = 170021898,
.q_norm = 172262410,
.w1 = 176732170,
.w2 = 183441418,
.wk = 172262922,
.wo = 174499850,
.wq = 170025994,
.wv = 173381642,
},#define A_TEXT_INPUT 639587890
#define A_TOKEN_IDS 639604274
#define A_EMBEDDED_INPUT 639608370
#define A_LAYER_INPUT 643802674
#define A_RESIDUAL 647996978
#define A_KV_CACHE 652191282
#define A_ROPE_CACHE 887072306
#define A_Q_SCRATCH 887596594
#define A_K_SCRATCH 895985202
#define A_V_SCRATCH 900179506
#define A_ATTN_Q_GATE_PACKED 904373810
#define A_ATTN_GATE 921151026
#define A_ATTN_SCRATCH 929539634
#define A_MLP_SCRATCH 937928242
#define A_LAYER_OUTPUT 979871282
#define A_LOGITS 984065586
/* ============================================================================
* MODEL STRUCT
* ============================================================================ */
typedef struct {
uint8_t *bump; /* Single contiguous allocation */
size_t bump_size;
uint8_t *bump_weights; /* Weights section */
float *activations; /* Activations section */ /* Store token at offset 639604274 (from layout) */
*(int32_t*)(MEM + 639604274) = token;
/* Op 0: embedding_forward_q8_0 (dense_embedding_lookup) layer=-1 section=header */
embedding_forward_q8_0(
(int32_t*)(model->bump + A_TOKEN_IDS),
1,
151936,
(const void*)(model->bump + W_TOKEN_EMB),
NULL,
(float*)(model->bump + A_EMBEDDED_INPUT),
1024,
1024,
1,
0
);
if (stop_seq == 0) return;
/* Op 1: memcpy (residual_save) layer=0 section=body */
memcpy(
(void*)(model->bump + A_RESIDUAL),
(const void*)(model->bump + A_EMBEDDED_INPUT),
4096
);
if (stop_seq == 1) return;
/* Op 2: rmsnorm_forward (rmsnorm) layer=0 section=body */
rmsnorm_forward(
(const float*)(model->bump + A_EMBEDDED_INPUT),
(float*)(model->bump + W_LAYER_0_LN1_GAMMA),
(float*)(model->bump + A_EMBEDDED_INPUT),
NULL,
1,
1024,
1024,
9.999999974752427e-07
);
if (stop_seq == 2) return;
/* Op 3: quantize_row_q8_0 (quantize_input_0) layer=0 section=body */
quantize_row_q8_0(
(const float*)(model->bump + A_EMBEDDED_INPUT),
(void*)(model->bump + A_LAYER_INPUT),
1024
);
if (stop_seq == 3) return;
/* Op 4: gemv_q8_0_q8_0 (q_proj) layer=0 section=body */
gemv_q8_0_q8_0(
(float*)(model->bump + A_Q_SCRATCH),
(const void*)(model->bump + W_LAYER_0_WQ),
(void*)(model->bump + A_LAYER_INPUT),
2048,
1024
);
if (stop_seq == 4) return; /* Op 589: rmsnorm_forward (rmsnorm) layer=-1 section=footer */
rmsnorm_forward(
(const float*)(model->bump + A_EMBEDDED_INPUT),
(float*)(model->bump + W_FINAL_LN_WEIGHT),
(float*)(model->bump + A_EMBEDDED_INPUT),
NULL,
1,
1024,
1024,
9.999999974752427e-07
);
if (stop_seq == 589) return;
/* Op 590: quantize_row_q8_0 (quantize_final_output) layer=-1 section=footer */
quantize_row_q8_0(
(const float*)(model->bump + A_EMBEDDED_INPUT),
(void*)(model->bump + A_LAYER_INPUT),
1024
);
if (stop_seq == 590) return;
/* Op 591: gemv_q8_0_q8_0 (logits) layer=-1 section=footer */
gemv_q8_0_q8_0(
(float*)(model->bump + A_LOGITS),
(const void*)(model->bump + W_TOKEN_EMB),
(void*)(model->bump + A_LAYER_INPUT),
151936,
1024
);
if (stop_seq == 591) return;
model->pos++;
}Compilation and Linking — The .so Is the Model
Step [5/6] is where the compiler hands off to the system compiler. ck_run_v8.py builds or refreshes libckernel_engine.so and libckernel_tokenizer.so, then compiles model_v8.c into libmodel.so.
The compile flags are exactly the ones you would expect for a performance-oriented shared library: -shared, -fPIC, -O3, -march=native, and critically -mcmodel=large.
The runner prefers icx if it is available, otherwise it falls back to gcc, with clang also supported through environment override. Once this step finishes, Python is no longer the execution engine. It is just a loader.
The compiler stage is where the generated C stops being documentation and becomes an executable artifact. After step [5/6], what matters for inference is no longer “did Python build the graph?” but “did the native toolchain compile and link the generated program?”
log_step(5, "Compiling to shared library")
# Output library name (ck_chat.py expects libmodel.so or ck-kernel-inference.so)
lib_path = output_dir / "libmodel.so"
kernel_lib = BUILD_DIR / "libckernel_engine.so"
tokenizer_lib = BUILD_DIR / "libckernel_tokenizer.so"
log(f" Source: {model_c_path}", C_DIM)
log(f" Lines: {sum(1 for _ in open(model_c_path))}", C_DIM)
runtime_targets.append(kernel_lib)
if _runtime_lib_needs_rebuild(tokenizer_lib, tokenizer_source_roots):
runtime_targets.append(tokenizer_lib)
if runtime_targets:
verb = "missing/stale" if any(not p.exists() for p in runtime_targets) else "stale"
log(f" Building {verb} runtime libs: {', '.join(p.name for p in runtime_targets)}", C_DIM)
make_targets = [_path_to_make_target(path) for path in runtime_targets]
run_cmd(["make"] + make_targets, cwd=PROJECT_ROOT) # Override with CK_V8_COMPILER=gcc|icx|clang when needed (e.g., profiling portability).
import shutil
compiler = "gcc"
requested_compiler = os.environ.get("CK_V8_COMPILER", "").strip()
if requested_compiler:
if not shutil.which(requested_compiler):
log_error(f"Requested CK_V8_COMPILER not found in PATH: {requested_compiler}")
sys.exit(1)
compiler = requested_compiler
elif shutil.which("icx"):
compiler = "icx"
omp_flag = "-qopenmp" if compiler == "icx" else "-fopenmp"
cmd = [
compiler,
"-shared", "-fPIC",
"-mcmodel=large", # Handle large static data in v8 models
"-O3", "-march=native",
"-std=c11",
"-fvisibility=default", # Export CK_EXPORT symbols
omp_flag, # OpenMP for parallelization
f"-I{include_dir}",
f"-I{v8_include}",
f"-I{v8_src}",
"-o", str(lib_path),
str(model_c_path),
str(loader_src),
str(v8_src / "ck_parallel_decode_v8.c"), # Thread-pool parallel GEMV dispatch
str(v8_src / "ck_parallel_prefill_v8.c"), # Thread-pool parallel GEMM dispatch (prefill)
f"-L{BUILD_DIR}",
f"-L{output_dir}", # Also look in output_dir for libckernel_engine.so
"-lckernel_tokenizer", # BPE tokenizer library
# Keep tokenizer before engine: both export legacy ck_tokenizer_* symbols, # Load C library first (needed to check for C tokenizer)
lib_path = self.model_dir / "ck-kernel-inference.so"
if not lib_path.exists():
lib_path = self.model_dir / "ck-kernel-decode.so"
if not lib_path.exists():
lib_path = self.model_dir / "libmodel.so"
if not lib_path.exists():
print(f"Error: Model library not found in: {self.model_dir}")
return False
stale_errors = self._runtime_artifact_staleness_errors(lib_path)
if stale_errors:
for msg in stale_errors:
print(f"Error: {msg}")
return False
self.lib = ctypes.CDLL(str(lib_path))Independence — The Generated Code Stands Alone
The independence claim is not rhetorical. The Qwen3 runtime directory already contains everything the native inference path needs: weights.bump, weights_manifest.json, ir1_decode.json, lowered_decode.json, layout_decode.map, model_v8.c, libmodel.so, libckernel_engine.so, and libckernel_tokenizer.so.
That means a silicon vendor can open the generated C, follow the offsets, profile the kernel calls, and reason about memory access without reverse-engineering a dynamic runtime.
You could port the system to a new platform by reimplementing the kernel ABI and recompiling the generated model file.
This is the strongest form of ahead-of-time compilation: the model has become a library plus a weight blob, not an interpreted object graph. For deployment, the line between “the model” and “the runtime” becomes clear: the model-specific logic lives in libmodel.so, while the reusable math lives in the kernel and tokenizer shared objects.

ir1_decode.json
layout_decode.map
libckernel_engine.so
libckernel_tokenizer.so
libmodel.so
lowered_decode.json
model_v8.c
weights.bump
weights_manifest.json/* ============================================================================
* MODEL STRUCT
* ============================================================================ */
typedef struct {
uint8_t *bump; /* Single contiguous allocation */
size_t bump_size;
uint8_t *bump_weights; /* Weights section */
float *activations; /* Activations section */
float *kv_cache; /* KV cache section */
float *rope_cos; /* RoPE cos table */
float *rope_sin; /* RoPE sin table */
float *logits; /* Output logits */
int pos; /* Current position */required artifacts after compile:
libmodel.so
libckernel_engine.so
libckernel_tokenizer.so
weights.bump
required platform work:
implement the kernel ABI in libckernel_engine
implement tokenizer ABI if using C tokenizer path
compile generated model_v8.c for the target toolchainSmart Front-End / Dumb Back-End — Why This Architecture Works
Now the big design claim should be concrete. The smart side of CKE is the front-end: template resolution, model-family detection, kernel binding, quantization dispatch, fusion detection, backward synthesis, slot planning, buffer assignment, and final lowering into pointer expressions.
The dumb side is the back-end: codegen_v8.py reads already-lowered ops and emits C. It does not decide which attention kernel a model gets. It does not decide whether Q/K norm exists. It does not plan buffers. It just writes what the lowered IR already decided.
This split is what keeps the system debuggable. If a kernel choice is wrong, you debug the builder. If a pointer is wrong, you debug memory planning or lowering. If emitted C is malformed, you debug codegen.
The architecture works because it puts intelligence where global context exists and removes intelligence where only local serialization remains. 10,459 vs 1,025The file-size ratio is a useful proxy for design intent: most of the logic lives in the front-end builder, while the emitter stays comparatively small and mechanical.

template selection
model-family detection
kernel binding from weight dtype and semantic op
fusion detection
memory planning
pointer lowering
attention-contract specialization
inference hardening and parity promotionwalk lowered ops in order
emit one C call per lowered op
splice in precomputed pointer expressions
emit offset tables and A_* defines
insert CK_STOP_OP checkpoints
optionally insert CK_PARITY_DUMP and CK_PROFILE plumbing#!/usr/bin/env python3
from __future__ import annotations
"""
codegen_v8.py - Generate C code from lowered IR.
RESPONSIBILITIES:
1. Create memory layout from layout.json (structs, offsets, allocations)
2. Parse lowered IR and emit function calls (unrolled, one after another)
3. Pass pointers cleanly to all functions
If there are memory issues → fix the memory layout builder, not codegen.
If there are kernel issues → fix the IR lower, not codegen.Conclusion — The Compiler Pipeline as Technical Portfolio
Put the whole stack together and the compiler shape is obvious: v8 templates and kernel maps, a 10,459-line IR builder, a 705-line memory planner, a 1,025-line decode codegen, a 1,943-line prefill codegen, generated C for model-specific runtimes, and finally shared libraries that run independently after linking.
It also explains why adding a new model family can be so leverage-heavy. Once the kernel library and lowering contracts exist, much of the work is template authoring: describe the architecture correctly, let the builder synthesize IR, let the planner assign memory, and let codegen serialize the plan.
For ShivasNotes readers, this post is the connective tissue for Posts 37 through 41. SIMD, NEON, quantization, flash attention, and performance analysis matter because the compiler pipeline makes those kernels visible, reproducible, and portable.
The final mental model is simple: templates describe the machine, IR proves the plan, the memory map fixes the bytes, lowered IR fixes the pointers, codegen writes the calls, and the compiler turns that into a standalone model library. That is why CKE feels unusual in the LLM world. It treats the generated C as the model-specific executable truth, not as a debugging byproduct.
v8 templates + kernel maps
↓
10,459-line build_ir_v8.py
↓
705-line memory_planner_v8.py
↓
1,025-line codegen_v8.py + 1,943-line codegen_prefill_v8.py
↓
generated model_v8.c / encoder_v8.c
↓
compiler toolchain
↓
libmodel.so + libckernel_engine.so + libckernel_tokenizer.so
↓
regression + parity + smoke gateswrite or extend a template JSON
declare flags
declare contracts
declare body/header/footer ops
add family-specific kernels only when necessary
then reuse the compiler:
build IR1
run fusion
plan memory
lower to pointer expressions
emit C
compile to libmodel.soSIMD Deep Dive
ARM NEON in CKE
Quantization Deep Dive
Flash Attention on CPU
CPU Performance AnalysisContinue with SIMD Deep Dive, ARM NEON in CKE, Quantization Deep Dive, Flash Attention on CPU, and CPU Performance Analysis for the kernel-level stories this compiler pipeline turns into native code.