In a world where everyone is rushing to build the next high-level autonomous agent or wrap an API in a nice UI, I decided to go in the opposite direction. I wanted to go deep. I wanted to see the "ghost in the machine."
My journey started with a project called C-Transformer, where I wrote GPT-2 from scratch in C with the help of AI. That experience was transformative. Now, armed with the mistakes and lessons from that first attempt, I am building the C-Kernel-Engine, a robust compute engine built on first principles.
Here is why I chose the hard road, and why I believe constraints are actually a feature, not a bug.
Constraint is a Feature (The CPU Contrarian)
I don't have H100s lying around. But I did have access to 2nd Gen Intel Xeons with AVX-512.
There is a misconception that you need to buy expensive GPUs to even participate in modern AI development. But when you strip it down, AI is fundamentally General Matrix Multiplication (GEMM). There is no law of physics that states only a GPU can handle GEMM. Modern server-grade CPUs are incredibly capable if you know how to talk to them.
Instead of waiting for GPU access, I downloaded the Intel OneAPI and started writing C. I focused on:
- SIMD Kernels: Using AVX-512 to parallelize operations directly on the CPU.
- Memory Management: Moving away from fragmented malloc calls toward huge, cache-line-aligned allocations to minimize TLB misses.
- Cache Efficiency: Structuring data to avoid cache eviction and false sharing in multi-threaded environments.
By accepting the constraint of "CPU only," I forced myself to learn High-Performance Computing (HPC) techniques that I would have ignored if I had just pip installed my way to a solution.
Demystifying the "Magic"
When you use PyTorch, you are often dealing with massive bloat. You call a function, magic happens, and you have no clue what occurred in the metal.
But when you write GPT-2 from scratch in C, the fog lifts. You realize that the "magic" is just kernel stitching and autograd. You see the flow clearly:
Forward Pass: Input tokens → dense representations → positional encodings → attention blocks → MLP (Multi layer perceptron).
Backward Pass: Understanding that backpropagation is just the chain rule applied repeatedly.
The "Aha!" Moment The biggest realization came when implementing the backward pass for the final layer. You expect complex derivatives for Softmax and Cross-Entropy loss. But when you actually work through the math, you realize the gradient simplifies beautifully to:
$$ \frac{\partial L}{\partial \text{logits}} = p - \text{one_hot} $$
It’s just your probability vector minus the target vector. That’s it.
Seeing that simplification in raw C code changes how you view AI. It’s not magic; it’s elegant arithmetic. This deep understanding makes modern advancements like Flash Attention (which effectively asks, "Since Softmax is just adding and averaging, can we do it faster?") feel like a natural next step rather than a new black box.
The "Spec-Driven" Trap vs. Active Learning
There is a huge trend right now toward "agentic" workflows, writing a spec and letting an AI agent build the whole thing. For some use cases, that’s great.
But if you rely solely on that, you are robbing your brain of its plasticity.
My workflow with AI is different. It’s a dialogue, not a delegation:
- I ask the AI for boilerplate or speed optimizations.
- I study the code. If it uses malloc twice when we only need one allocation, I don't just accept it.
- I delete it, rewrite it the way I think is better, and explain why.
- The AI immediately pivots, understands my architectural intent, and rewrites the next piece to match my new standard.
This feedback loop is healthy. It leverages the AI's "cold" knowledge of SIMD instructions, Linux kernel utilities, and derivative math, but it forces me to be the architect. I am certain the big labs are using AI exactly this way, not just to generate code, but to speed up their engineers' intuition.
The Road Ahead
I am not a genius. I just realized that my natural intelligence is as plastic as the neural networks I am building. If I put the GPT-2 architecture in front of my brain, my brain will eventually pattern-match and understand it.
The C Kernel Engine is the result of that pattern matching. It is my attempt to show that with the right optimization, CPUs can do the heavy lifting, and that building from scratch is the best way to truly learn.
Current Project: C-Kernel-Engine
The Learning Ground: C-Transformer