Tokenization: The First Decision That Shapes Everything

Lab note

Companion post to the Tokenization carousel. Previously: Positional Encoding: Teaching Transformers Where To Look.

Tokenization is the first place where text stops being human language and starts becoming machine state. A model never reads a paragraph the way you or I do. It reads a sequence of integer IDs emitted by a tokenizer, and every downstream tensor operation treats those IDs as the fundamental atoms of the problem. That is why tokenization is not a preprocessing footnote. It is the first decision that quietly shapes everything else.

This post moves from intuition to algorithms to systems work. We will compare character-level and word-level extremes, unpack BPE, WordPiece, and SentencePiece, then end with a practical lesson from the C-Kernel-Engine and the SVG training line. That lesson is simple: if your domain has important control symbols, the right tokenizer is not always the standard one. Sometimes the hardest part is designing the symbolic interface the model is allowed to think in. The tokenizer decides what counts as a first-class symbol. Every embedding lookup and every attention score inherits that choice.

Roadmap for this post

Sections 1 and 2 explain why token boundaries matter and why neither characters nor full words are a satisfying default.

Sections 3 through 5 unpack the three dominant subword families: BPE, WordPiece, and SentencePiece.

Sections 6 through 10 connect those algorithms to vocabulary size, gradient flow, custom domain tokens, and the trie-based implementation work in C-Kernel-Engine. Section 11 closes with the design lessons that matter before we move on to full attention.

Section 1: Why Tokenization Is the First Decision

A transformer never sees raw text. By the time the forward pass begins, the sentence has already been mapped into IDs like [464, 5023, 2746, 3303]. Those IDs index rows in an embedding table, and the embedding rows are what attention actually consumes. So when people say “the model learned language,” there is a hidden precondition: it only learned language through the segmentation chosen by the tokenizer. IDs only The tokenizer is the gatekeeper between human strings and model tensors. No tokenizer, no IDs; no IDs, no embedding lookup.

That early segmentation decision affects at least four things immediately. It sets the vocabulary size. It sets how many tokens a sentence expands into. And because transformer attention scales quadratically with sequence length, it also sets a large part of the compute bill before the model has learned a single fact.

The tokenizer also defines what the model can represent comfortably. If a concept is always split across awkward fragments, the network has to learn to reassemble it over and over again. If a useful unit appears as a stable token, the model starts with a head start because one embedding row already corresponds to the recurring pattern. In that sense, token boundaries are the first ontology the model receives. ~30K → ~100K BERT uses roughly 30K WordPiece tokens, LLaMA about 32K BPE-style tokens, and GPT-4 class tokenizers roughly 100K. Those numbers are architecture choices, not accidents.

A good mental model is to think of a token as the atom of meaning available to the network. Embeddings attach vectors to those atoms. Attention moves information among those atoms. Prediction asks which atom should come next. If the atomization is poor, every later layer works uphill. Bad tokenization wastes model capacity on bookkeeping. Good tokenization lets the model spend more capacity on patterns that matter.

A sentence tokenized three ways: character-level with many tiny pieces, word-level with a few large pieces, and subword tokenization in the middle.

Model family	Tokenizer family	Typical vocab size	What the choice emphasizes
BERT	WordPiece	30,522	Compact encoder vocabulary with explicit continuation pieces.
LLaMA	BPE via SentencePiece tooling	32,000	Efficient decoder vocabulary with relatively low fertility.
GPT-2	BPE	50,257	A broader decoder vocabulary that reduces sequence length.
GPT-4 class tokenizers	BPE / tiktoken-style	~100,000	Shorter sequences and strong coverage across mixed domains.

The token boundary is the model’s first contract

Before embeddings, before positional signals, before attention, the tokenizer decides which units the network is even allowed to name directly.

Everything downstream is easier when those units align with recurrent structure in the data.

Section 2: Character-Level vs Word-Level — The Two Extremes

Character-level tokenization sits at one extreme. Every character becomes a token. That keeps the vocabulary tiny—sometimes as small as the byte range—but sequence length explodes. The word transformer becomes 11 separate steps, and if attention is O(n²), then making a sequence 11 times longer can mean roughly 121 times more pairwise attention work for that word-level span. 11× longer → 121× attention Quadratic attention means that long sequences hurt twice: more positions to store and far more token-to-token comparisons to compute.

Word-level tokenization sits at the opposite extreme. Each word becomes one token, so sequence length stays small and attention remains comparatively cheap. But the vocabulary becomes enormous. English alone is messy, and once you add names, typos, code, multilingual text, and specialized jargon, out-of-vocabulary failures become a constant tax. OOV pressure A pure word-level vocabulary must either grow without mercy or collapse rare words into [UNK]. Neither option is attractive.

Neither extreme handles the real world gracefully. Character models are universal but expensive. Word models are efficient but brittle. Subword tokenization is the compromise because it tries to keep the vocabulary manageable while still allowing the model to build larger recurring chunks from smaller pieces.

That is why character-level models such as ByT5 and Charformer are interesting but unusual. They often need architectural tricks, pooling stages, or specialized downsampling to survive the long sequences their tokenizer creates. The tokenizer is cheap to define, but the compute burden shows up elsewhere in the model design. A tiny vocabulary is not automatically efficient. If it makes the sequence dramatically longer, the real cost simply moves into attention and memory.

Attention cost grows with sequence length

python

def relative_attention_cost(old_len, new_len):
    return (new_len * new_len) / (old_len * old_len)

word_level = 1
char_level = len("transformer")  # 11
print(relative_attention_cost(word_level, char_level))
# 121.0

Strategy	Vocabulary size	Sequence length	Main failure mode
Character-level	Tiny	Very long	Too much attention compute per sentence.
Word-level	Huge	Short	Vocabulary explosion and constant OOV handling.
Subword	Medium	Medium	Requires a learned segmentation strategy.

Why subword tokenization won

It is the engineering middle ground: enough compositionality to cover rare words, enough compression to keep sequences practical.

Section 3: Byte Pair Encoding (BPE)

BPE started life as a compression algorithm. Philip Gage introduced byte pair encoding in 1994 as a way to compress data by repeatedly replacing frequent symbol pairs with new symbols. NLP borrowed that idea later because compression and tokenization are cousins. Both ask the same question: which recurring pieces are worth naming explicitly? compression first BPE was not invented for language modeling. The breakthrough was realizing that “compress frequent pairs” is also a sensible way to discover reusable subwords.

The training algorithm is bottom-up. You start with a base vocabulary of small units such as characters or bytes. Then you count adjacent pairs in a corpus, merge the most frequent pair into a new token, record the merge rule, and repeat until you reach the target vocabulary size. The final tokenizer is therefore a learned recipe for building larger chunks from smaller ones.

Training algorithm: building the vocabulary

Start with a base vocabulary of individual characters or bytes.
Count all adjacent symbol pairs in the training corpus.
Merge the most frequent pair into a new token.
Record that merge as a rule with a priority order.
Repeat until the vocabulary reaches the desired size.

Encoding new text uses those merge rules as replay instructions. The tokenizer first breaks the input back down to the base symbols. Then it applies the learned merges in priority order until no more valid merges remain. What survives is the tokenization for the new string. merge recipe The merge table is the real model of the tokenizer. It captures which adjacent patterns the corpus made frequent enough to deserve their own token.

Toy BPE trainer and encoder sketch

python

from collections import Counter

def train_bpe(words, target_merges):
    pieces = [list(word) for word in words]
    merges = []
    for _ in range(target_merges):
        pair_counts = Counter()
        for word in pieces:
            for i in range(len(word) - 1):
                pair_counts[(word[i], word[i + 1])] += 1
        pair = max(pair_counts, key=pair_counts.get)
        merges.append(pair)
        merged_symbol = ''.join(pair)
        for word in pieces:
            i = 0
            out = []
            while i < len(word):
                if i < len(word) - 1 and (word[i], word[i + 1]) == pair:
                    out.append(merged_symbol)
                    i += 2
                else:
                    out.append(word[i])
                    i += 1
            word[:] = out
    return merges

Concrete BPE merge example

text

Corpus: "low lower lowest lowering"
Step 0: Base vocab = {l, o, w, e, r, s, t, i, n, g, ' '}
Step 1: Most frequent pair: (l, o) -> "lo"
Step 2: Most frequent pair: (lo, w) -> "low"
Step 3: Most frequent pair: (e, r) -> "er"
Step 4: Most frequent pair: (low, er) -> "lower"
...
Result: "low" "lower" "low" + "est" "lower" + "ing"

That is why BPE works so well for large decoder models. Frequent stems, suffixes, and whitespace-prefixed chunks become stable reusable tokens. Rare words still remain representable because they can always fall back to smaller pieces. The system is simple, deterministic, and scales well. BPE feels natural in language because language itself contains frequent reusable chunks: stems, endings, spaces, punctuation patterns, and domain phrases.

A step-by-step BPE diagram showing characters merging into lo, low, er, and finally lower.

Where BPE dominates

GPT-2, GPT-3/4, LLaMA, Mistral, Falcon, and many modern decoder stacks all rely on BPE-style tokenization because it is simple, fast, and effective at scale.

Section 4: WordPiece (BERT)

WordPiece looks similar to BPE at first glance because it also builds a subword vocabulary. The key difference is the scoring rule used during training. Classic BPE merges the most frequent adjacent pair. WordPiece instead prefers the merge that most improves the likelihood of the training data under the model’s objective. frequency vs likelihood BPE asks “what pair appears most often?” WordPiece asks “what merge best explains the corpus under the scoring objective?”

The visual hallmark of WordPiece is the ## continuation marker. A token like playing may become ["play", "##ing"]. That prefix makes it explicit that the token is not a new word start. It is a continuation piece attached to what came before.

Encoding algorithm: greedy longest-match

For each word, try to match the longest substring in the vocabulary.
If the whole word exists, emit it as one token.
Otherwise, emit the longest matching prefix and continue on the remainder with a ## prefix.
If even a single character cannot be matched, the tokenizer falls back to [UNK] or an equivalent unknown-token path.

That encoding rule matters because WordPiece does not replay merge rules the way BPE does. It performs a greedy search over the current vocabulary. In the worst case the scan can be quadratic in word length, but in practice the average case is usually much closer to linear because matches are found quickly on common text. The algorithm is surprisingly small once you see it in code. greedy scan WordPiece tokenization is dominated by longest-match substring checks. Its runtime profile depends much more on vocabulary lookup speed than on merge-rule replay.

WordPiece greedy longest-match in C

// WordPiece: Greedy Longest-Match (from BC Gov HPC Embeddings)
tokens_t get_token(HashTable *table, const char *text) {
    size_t len = strlen(text);
    bool prefix = false;

    for (size_t i = 0; i < len;) {
        int found = 0;
        // Try progressively shorter substrings
        for (size_t j = len - i; j > 0; j--) {
            strncpy(buffer, text + i, j);
            buffer[j] = '\\0';

            // Add ## prefix for continuation tokens
            if (prefix)
                snprintf(prefix_buffer, j + 3, "##%s", buffer);
            else
                snprintf(prefix_buffer, j + 1, "%s", buffer);

            char *key_found = check_substring(table, prefix_buffer);
            if (key_found) {
                token_result.token_values[token_count++] = atoi(key_found);
                prefix = true;
                i += j;
                found = true;
                break;
            }
        }
        if (!found) i++; // Skip unknown character
    }
    return token_result;
}

That particular implementation matters beyond pedagogy. The BC Gov HPC work used WordPiece in pure C for legal document processing with BERT models. It was engineered with AVX-512-friendly preprocessing and fast vocabulary lookup because tokenization overhead becomes real when documents are large and batch throughput matters. Even “just preprocessing” can be a systems problem. WordPiece survives in production because the algorithm is understandable, deterministic, and fast enough when the lookup path is engineered well.

A waterfall-style WordPiece diagram showing a greedy longest-match scan over the word unhappiness.

Feature	BPE	WordPiece
Training signal	Highest-frequency pair merge	Highest-likelihood merge
Encoding strategy	Replay merge rules	Greedy longest-match over vocabulary
Continuation marker	Usually implicit by position	`##` prefix
Common home	Decoder LLMs	Encoder models such as BERT

Why WordPiece still matters

BERT, DistilBERT, MiniLM, and many production encoders still rely on WordPiece-style vocabularies. If you work with classification, retrieval, or legal/document models, you still meet it constantly.

Section 5: SentencePiece (Unigram Model)

SentencePiece makes a different philosophical move. It operates on raw text instead of assuming that words have already been split by spaces or language-specific rules. That means it does not need a separate pre-tokenization stage to decide where words begin. This is one reason it became so attractive for multilingual work. raw text first SentencePiece does not require language-specific whitespace tokenization before training. That makes it much easier to apply across scripts and languages.

Its most recognizable marker is the ▁ symbol. SentencePiece uses that character to encode word boundaries directly inside the token stream. So I like cats may become ["▁I", "▁like", "▁cats"]. Whitespace becomes part of the representation instead of an external parsing rule.

Unigram training algorithm

Start with a large candidate vocabulary containing many possible substrings.
Assign probability scores to those candidate tokens using corpus statistics.
Use dynamic programming such as Viterbi search to find the most probable tokenization of each sentence.
Remove tokens whose removal hurts the total corpus likelihood the least.
Repeat pruning until the vocabulary reaches the desired size.

This is the reverse of BPE. BPE grows a vocabulary from the bottom up by merging. SentencePiece unigram starts broad and then prunes the candidate set down. It is top-down rather than bottom-up. reverse of BPE BPE asks “what new piece should we add next?” SentencePiece unigram asks “what pieces can we safely remove while preserving the best segmentation?”

SentencePiece-style segmentation as dynamic programming

python

def best_segmentation(text, vocab_scores):
    best = [(-1e9, []) for _ in range(len(text) + 1)]
    best[0] = (0.0, [])
    for i in range(len(text)):
        score_i, path_i = best[i]
        if score_i < -1e8:
            continue
        for token, token_score in vocab_scores.items():
            if text.startswith(token, i):
                j = i + len(token)
                cand = (score_i + token_score, path_i + [token])
                if cand[0] > best[j][0]:
                    best[j] = cand
    return best[len(text)]

The benefit is not only elegance. Because SentencePiece works on raw text, it does not hard-code assumptions about where “words” are in languages that do not use spaces the same way English does. That makes it attractive for CJK languages, Arabic, and multilingual corpora where hand-designed segmentation rules quickly become fragile. It is a tokenizer that takes language diversity seriously. SentencePiece treats segmentation as a learned probabilistic decision, not a fixed language-specific preprocessing rule.

A two-panel diagram contrasting BPE bottom-up merges with SentencePiece unigram top-down pruning.

Why SentencePiece became influential

T5, mT5, ALBERT, XLNet, and many multilingual systems adopted SentencePiece because it works directly on raw text and makes multilingual coverage much easier to manage.

Section 6: The Vocabulary Size Trade-Off

Vocabulary size is a balancing act. A tiny vocabulary such as raw bytes guarantees that every possible text is representable. But the sequence becomes long. A very large vocabulary shortens the sequence, yet the embedding table and output head become expensive, and rare tokens may be learned poorly because they appear too infrequently. 400M parameters A 100K-token vocabulary with d_model = 4096 needs roughly 409.6 million embedding parameters before you count the rest of the transformer.

This is where the fertility ratio becomes useful. Fertility means the average number of tokens needed per word or per linguistic unit. Lower fertility usually means more efficient inference because the model needs fewer steps to express the same sentence. But low fertility bought with an excessively large vocabulary can create dead weight in the parameter budget. fertility ratio Lower fertility is good until the vocabulary becomes so large that rare tokens stop getting enough gradient updates to be useful.

A dual-axis chart showing average sequence length dropping as vocabulary size increases, while embedding parameters rise sharply.

Model	Approximate vocab size	Tokenizer family	Why it lands there
BERT	30,522	WordPiece	Encoder-oriented compromise between coverage and compact embeddings.
GPT-2	50,257	BPE	Larger decoder vocab to shorten sequences on open-domain web text.
LLaMA	32,000	BPE via SentencePiece tooling	Compact decoder vocab with good efficiency.
Qwen2	151,936	BPE	Very large vocabulary for broad multilingual and mixed-domain coverage.
GPT-4 class tokenizer	~100,000	BPE / tiktoken-style	Aggressive sequence compression across many domains.

Embedding table size grows linearly with vocab size

python

def embedding_params(vocab_size, d_model):
    return vocab_size * d_model

for vocab in [30522, 50257, 100000, 151936]:
    params = embedding_params(vocab, 4096)
    print(vocab, params / 1e6, 'million parameters')

Multilingual models often need a larger vocabulary because multiple scripts must coexist efficiently. If the vocabulary is too small, each script gets fragmented into many pieces and fertility rises. If the vocabulary is too large, the model spends an enormous fraction of its parameters on embeddings and logits. The sweet spot depends on model size, corpus size, and domain diversity. Choosing vocab size is choosing where to spend parameters: on longer sequences and more attention, or on larger embedding and output tables.

There is no universal best vocab size

A tokenizer for a small local model trained on one domain should not copy the vocabulary budget of a frontier model trained on the whole internet. The right size depends on the job.

Section 7: When NOT to Use Standard Tokenizers — Custom Token Design

Standard BPE, WordPiece, and SentencePiece are excellent defaults for natural language. But structured outputs change the problem. If the model is generating code, a DSL, SVG control prompts, or another formal interface, standard tokenizers can accidentally fragment the very symbols that carry the task semantics. That is where custom token design becomes more important than generic subword compression. roundtrip ≠ atomicity A tokenizer can be perfectly reversible and still be a bad symbolic interface for a tiny model. Byte-perfect decoding does not guarantee that the most important control units survive as single tokens.

The SVG training experiments documented in Training SVGs With C-Kernel-Engine: A Research Report make this concrete. In spec03, standard BPE passed byte-perfect encode/decode roundtrip. The tokenizer report showed byte_match_rate = 1.0 and successful reversibility. But the same report also showed that exact learned pieces for canonical control tags were 0 / 50. 0 / 50 prompt atoms The tokenizer could reconstruct the bytes, but it did not preserve the canonical control tags as learned single pieces. For a small model, that is a real representational failure.

That distinction explains why spec04 could reach 100% renderability and 0% exactness. The model learned enough syntax to emit something renderable. It did not learn the intended symbolic contract reliably enough to hit the exact requested layout. Reconstructing fragmented control tags consumed capacity that should have been available for semantics. For a tiny model trained on a formal language, protecting the right atoms can matter more than finding the globally most compressed subword inventory.

Observation from the SVG line	What it meant	Why it matters for tokenization
`byte_match_rate = 1.0`	Encode/decode roundtrip was solved.	Reversibility alone is not enough.
`0 / 50` exact prompt atoms	Control tags were still fragmented.	The model had to learn intent through broken-up pieces.
100% renderability, 0% exactness	Syntax improved without contract obedience.	Symbolic interface and curriculum still needed repair.

The repair strategy was a hybrid tokenizer. First extract the domain tokens that really matter. Then reserve them as protected atoms. After that, let BPE operate on the remaining natural-language and numeric content where compression actually helps.

Scan the corpus for domain tokens such as [layout:bullet-panel] and other bracketed control atoms.
Reserve those atoms as explicit tokens marked special=True or their equivalent.
Apply BPE only to the remaining text, numbers, punctuation, and narrative content.
Train the model on a vocabulary where control intent stays whole while ordinary text still benefits from subword compression.

Protected control vocabulary sketch

json

{
  "system_tokens": ["<|unk|>", "<|bos|>", "<|eos|>", "<|pad|>"],
  "task_tokens": ["[task:svg]", "[task:card]", "[task:chart]"],
  "shape_tokens": ["[shape:circle]", "[shape:rect]", "[shape:triangle]"],
  "palette_tokens": ["[palette:warm]", "[palette:cool]", "[palette:mono]"],
  "layout_tokens": ["[layout:bullet-panel]", "[layout:compare-panels]"],
  "size_tokens": ["[size:xs]", "[size:sm]", "[size:md]", "[size:lg]"]
}

That leads to the corrected lesson from the SVG line. For a small local model, the training problem is not merely “more data” or “lower loss.” It is choosing the right symbolic interface, protecting the right atoms, generating the right curriculum mixture, and proving the contract with probes. The tokenizer is part of that contract. symbolic interface first The model can only learn the contract you expose to it. If the interface is fragmented, the curriculum and optimizer inherit that fragmentation.

A comparison showing standard BPE splitting a layout control token into many fragments while a custom tokenizer keeps it atomic.

Custom token design is not a hack

It is an admission that some domains have natural symbols the model should be allowed to name directly.

For formal languages, protecting those symbols can be more valuable than squeezing out every possible bit of subword compression.

Section 8: How Tokenization Affects the Backward Pass

Tokenization itself has no backward pass. It is a discrete preprocessing step, not a differentiable layer. The model never computes gradients with respect to “where should the token boundary have been?” inside the usual training loop. Once token IDs exist, the differentiable story starts at the embedding lookup.

The forward path is simple: token ID in, embedding row out. The backward path is also simple: the gradient arriving at that embedding vector gets accumulated into the corresponding row of the embedding table. Only tokens that appeared in the batch receive updates. Rare tokens therefore get sparse learning signals, while common tokens receive many more gradient touches. seen rows only Embedding gradients are sparse with respect to the vocabulary. A token that never appears in the batch gets no update at all.

Vocabulary size shows up twice in the parameter count. The input embedding table has shape [vocab_size, d_model]. The output projection or LM head often has shape [d_model, vocab_size]. If weights are tied, those are shared parameters, but the vocabulary dimension still dominates their total size. logit width Every extra token widens not only the embedding table but also the model’s output distribution. Bigger vocabularies mean bigger logits.

Component	Shape	Why tokenization matters
Input embedding	`[vocab_size, d_model]`	Every token gets its own learned row vector.
LM head	`[d_model, vocab_size]`	Next-token prediction must score every token in the vocabulary.
Weight tying	Shared table	Gradients from input meaning and output prediction accumulate into the same parameters.

Embedding lookup and sparse gradient intuition

python

def forward(token_ids, embedding_table):
    return embedding_table[token_ids]

# backward intuition:
# dL/d_embedding_table[row] accumulates only for rows in token_ids
# unused rows receive zero gradient on this batch

Weight tying makes the interaction even more interesting. When the input embedding table and output projection share parameters, a token receives gradients from two roles at once. One role says “what does this token mean when I read it?” The other says “how should I shape this token’s logit when I predict it?” Weight tying means a token learns both as a meaning vector and as a prediction target. Tokenization therefore shapes both the input interface and the output competition.

Backward-pass takeaway

Tokenization changes the sparsity pattern of learning. It decides which rows exist, how often they are updated, and how large the embedding/logit structures must be.

Section 9: C-Kernel-Engine Tokenizer Implementation

The C-Kernel-Engine work is useful because it turns tokenization from an abstract algorithm into concrete systems code. The engine supports multiple tokenizer families inside one C runtime so it can match the expectations of different checkpoints. That design begins with an explicit tokenizer enum. Even at the type level, BPE, WordPiece, and SentencePiece are treated as distinct execution paths.

C-Kernel-Engine tokenizer type enum

typedef enum {
    CK_TOKENIZER_BPE = 0,       // GPT-2, LLaMA, Qwen
    CK_TOKENIZER_WORDPIECE = 1, // BERT, RoBERTa
    CK_TOKENIZER_SPM = 2        // SentencePiece (unigram)
} CKTokenizerType;

Vocabulary lookup is where the implementation becomes especially interesting. A trie, or prefix tree, matches the problem structure better than a hash table because tokenization repeatedly asks prefix questions. The reported C-Kernel-Engine trie path achieved dramatic speedups over the older hash-table path and also beat common PyTorch or tiktoken baselines on long text. That is the sort of result you get when the data structure matches the workload. 43.6× faster Average throughput jumped from roughly 2,352 characters per millisecond for the hash path to more than 102,000 characters per millisecond for the trie path.

A grouped bar chart comparing hash-table lookup, trie lookup, and PyTorch/tiktoken lookup times across different text lengths.

Text length	C-Kernel hash	C-Kernel trie	PyTorch / tiktoken	Trie speedup
11 chars	0.006 ms	0.006 ms	0.010 ms	1.65× vs PyTorch
200 chars	0.127 ms	0.010 ms	0.043 ms	4.54× vs PyTorch
3,000 chars	1.312 ms	0.031 ms	0.484 ms	15.48× vs PyTorch
15,000 chars	6.296 ms	0.131 ms	2.405 ms	18.35× vs PyTorch
Average	2,352 ch/ms	102,492 ch/ms	6,190 ch/ms	43.6× vs hash / 16.6× vs PyTorch

That speedup makes theoretical sense. Trie lookup is O(k) in the token length because it walks the bytes of the candidate piece once. Hash tables give you average-case constant-time lookup for exact keys, but tokenization rarely knows the final key in advance. It is constantly testing prefixes, which is exactly where tries shine. O(k) prefix walk A trie is a natural fit for tokenization because tokenization is a prefix-search problem disguised as a vocabulary lookup problem.

The BPE implementation also does the right thing algorithmically. It applies merge rules in priority order rather than collapsing everything with a greedy longest-match shortcut. That detail is crucial for HuggingFace parity because true BPE and WordPiece are not interchangeable. The engine also supports byte fallback and space-prefix auto-detection so GPT-2-style Ġ and SentencePiece-style ▁ conventions can be handled correctly.

Simplified C-Kernel-Engine BPE encode flow

// C-Kernel-Engine BPE encode (simplified)
int ck_tokenizer_encode(CKTokenizer *tok, const char *text,
                        int max_len, int *output_ids, int max_tokens) {
    // 1. Pre-tokenize: split on whitespace/punctuation
    // 2. For each word: look up in trie
    //    - If whole word matches: emit single token ID
    //    - Otherwise: apply merge rules iteratively
    //      a. Start with character-level tokens
    //      b. Find highest-priority applicable merge
    //      c. Apply merge, reducing token count
    //      d. Repeat until no more merges apply
    // 3. Handle special tokens (BOS, EOS)
    return num_tokens;
}

The project also supports multiple tokenizer file formats. GGUF loading matters for direct checkpoint integration. JSON matters for compatibility with HuggingFace tokenizer exports. Binary and memory-mapped formats matter for production startup time and zero-copy access. 151,936 tokens Qwen2 is a good example of why tokenizer infrastructure matters: a very large vocabulary amplifies every inefficiency in lookup, loading, and cache behavior.

GGUF for direct model loading alongside LLaMA and Mistral style checkpoints.
JSON for HuggingFace tokenizer.json compatibility.
Binary vocabularies for optimized memory-mapped loading in production deployments.
Plain-text token lists for inspection, debugging, and conversion utilities.

The WordPiece branch also carries real lineage. It traces back to the BC Gov HPC Embeddings project where a WordPiece tokenizer was built in C17 with AVX-512 SIMD support. That work used fast case conversion, SIMD-assisted string comparison, and a bump allocator to keep the hot path free of unnecessary heap traffic. Later, that practical experience fed into the generalized tokenizer backends used in C-Kernel-Engine. Tokenizer engineering is real systems work: data structures, SIMD lanes, memory layout, file formats, and correctness parity all matter once the model leaves the notebook.

Implementation lesson

Once vocabularies get large and throughput matters, tokenization is not “just preprocessing.” It becomes a genuine performance surface, and careful data-structure choices can produce order-of-magnitude wins.

Section 10: Comparison Table — WordPiece vs BPE vs SentencePiece

Now that the individual methods are on the table, the comparison becomes easier. BPE is the pragmatic workhorse. WordPiece is the classic encoder-friendly longest-match scheme. SentencePiece unigram is the most language-agnostic and probabilistic of the three.

Method	Training algorithm	Encoding algorithm	Subword marker	Used by	Pros	Cons
BPE	Bottom-up merges by frequency	Replay merge rules in priority order	Usually implicit by position	GPT-2/3/4, LLaMA, Mistral	Simple, deterministic, efficient, dominant in decoder LLMs	Needs pre-tokenization; merge rules reflect corpus frequency more than explicit likelihood
WordPiece	Bottom-up merges by likelihood score	Greedy longest-match over current vocabulary	`##` continuation prefix	BERT, DistilBERT, many encoders	Clear continuation markers, effective in encoder pipelines	Still needs pre-tokenization and can fall back to `[UNK]`
SentencePiece (unigram)	Top-down pruning by corpus loss	Viterbi / most probable segmentation	`▁` word-start marker	T5, mT5, ALBERT, XLNet	Raw-text training, language-agnostic, multilingual-friendly	More complex training story and less intuitive than basic BPE

Feature	BPE	WordPiece	SentencePiece (Unigram)
Training	Bottom-up merges (frequency)	Bottom-up merges (likelihood)	Top-down pruning (loss)
Encoding	Replay merge rules in order	Greedy longest-match	Viterbi (most probable path)
Subword marker	None / position-based	`##` prefix for continuation	`▁` prefix for word start
Pre-tokenization	Required	Required	Not required
Multilingual behavior	Needs larger vocab	Needs larger vocab	Naturally handles any script
Determinism	Yes	Yes	Can be probabilistic
Used by	GPT-2/3/4, LLaMA, Mistral	BERT, DistilBERT	T5, mT5, ALBERT, XLNet

The deeper pattern is that each tokenizer encodes a different inductive bias. BPE says frequency is the best clue for reusable pieces. WordPiece says the best clue is usefulness under a model score. SentencePiece says segmentation itself should be optimized probabilistically from raw text. And custom token design says domain structure sometimes beats all three. There is no universally “best” tokenizer. There is only a tokenizer whose segmentation bias fits the model, the data, and the task better than the alternatives.

Three rows showing the sentence I’m playing unhappily tokenized differently by BPE, WordPiece, and SentencePiece.

How to choose in practice

If you need a strong default for decoder LLMs, BPE is still the practical baseline. If you are matching BERT-style checkpoints, use WordPiece. If you need raw-text multilingual flexibility, SentencePiece is often the cleanest choice. If your domain has formal control symbols, design custom protected atoms first and let subword compression happen around them.

Section 11: Summary & What’s Next

Tokenization is one of the most underappreciated design choices in the NLP stack. It decides what the model’s basic symbols are before the first layer ever runs. BPE, WordPiece, and SentencePiece all solve the same broad problem, but they do so with different assumptions about frequency, likelihood, raw text, and segmentation. Those assumptions ripple outward into vocabulary size, sequence length, compute cost, and learnability.

The systems lesson is just as important as the algorithmic one. The C-Kernel-Engine tokenizer work shows that lookup structures, merge-rule correctness, file formats, and SIMD-friendly implementation choices can produce dramatic real-world speedups. The BC Gov WordPiece heritage shows the same thing from another angle: even preprocessing deserves serious engineering when the workload is large. Tokenization is not only a linguistic choice; it is a runtime choice. interface before intelligence A model can only learn through the symbols you expose. Choosing the right atoms is often the first act of model design.

The SVG training experiments add the final correction. For tiny domain-specific models, the best tokenizer may be the one that protects the right control language rather than the one that most efficiently compresses general text. If the symbolic interface is wrong, the model spends capacity repairing the interface instead of learning the task. That is why custom atoms belong in the tokenizer conversation, not outside it.

Key takeaways

BPE is the modern decoder workhorse: bottom-up merges, simple rules, strong practical efficiency.
WordPiece is BERT’s classic encoder tokenizer: greedy longest-match with ## continuation tokens.
SentencePiece is the multilingual polyglot: raw-text training, probabilistic segmentation, and no required pre-tokenization.
Custom tokens become essential when your domain has formal control symbols that should remain atomic.
C-Kernel-Engine shows that tokenizer implementation details—tries, merge parity, memory mapping—can deliver major speed gains.

Further reading	Why it matters
Sennrich et al. 2016 on subword units	The NLP paper that made BPE-style subword segmentation mainstream.
Devlin et al. 2019 on BERT	Canonical modern WordPiece encoder context.
Kudo & Richardson 2018 on SentencePiece	Raw-text subword modeling and the unigram tokenizer story.
C-Kernel-Engine	Concrete tokenizer implementation and performance engineering in C.
SVG training research report	The best case study in this series for why token boundaries and protected control atoms matter.

Next in the series

Once tokens are embedded and positioned, the next question is how they interact. The next post moves into the full attention mechanism: queries, keys, values, and the score geometry that connects them.

Tokenization: The First Decision That Shapes Everything

Roadmap for this post

Section 1: Why Tokenization Is the First Decision

The token boundary is the model’s first contract

Section 2: Character-Level vs Word-Level — The Two Extremes

Why subword tokenization won

Section 3: Byte Pair Encoding (BPE)

Training algorithm: building the vocabulary

Where BPE dominates

Section 4: WordPiece (BERT)

Encoding algorithm: greedy longest-match

Why WordPiece still matters

Section 5: SentencePiece (Unigram Model)

Unigram training algorithm

Why SentencePiece became influential

Section 6: The Vocabulary Size Trade-Off

There is no universal best vocab size

Section 7: When NOT to Use Standard Tokenizers — Custom Token Design

Custom token design is not a hack

Section 8: How Tokenization Affects the Backward Pass

Backward-pass takeaway

Section 9: C-Kernel-Engine Tokenizer Implementation

Implementation lesson

Section 10: Comparison Table — WordPiece vs BPE vs SentencePiece

How to choose in practice

Section 11: Summary & What’s Next

Key takeaways

Next in the series

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support

Tokenization: The First Decision That Shapes Everything

Roadmap for this post

Section 1: Why Tokenization Is the First Decision

The token boundary is the model’s first contract

Section 2: Character-Level vs Word-Level — The Two Extremes

Why subword tokenization won

Section 3: Byte Pair Encoding (BPE)

Training algorithm: building the vocabulary

Where BPE dominates

Section 4: WordPiece (BERT)

Encoding algorithm: greedy longest-match

Why WordPiece still matters

Section 5: SentencePiece (Unigram Model)

Unigram training algorithm

Why SentencePiece became influential

Section 6: The Vocabulary Size Trade-Off

There is no universal best vocab size

Section 7: When NOT to Use Standard Tokenizers — Custom Token Design

Custom token design is not a hack

Section 8: How Tokenization Affects the Backward Pass

Backward-pass takeaway

Section 9: C-Kernel-Engine Tokenizer Implementation

Implementation lesson

Section 10: Comparison Table — WordPiece vs BPE vs SentencePiece

How to choose in practice

Section 11: Summary & What’s Next

Key takeaways

Next in the series

Subscribe

Subscribe to emails from Anthony

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support