My mental model of how transformers worked was once deceptively simple: universal pattern-matching engines, consuming vast datasets to predict the next token. A recent, intensive two-week deep dive into the burgeoning field of mechanistic interpretability didn't just update that model; it utterly shattered it.
I discovered that a transformer with fewer parameters than a high-quality JPEG can learn to add multi-digit numbers with near-perfect accuracy. The truly shocking part isn't that it can do it, but how. It's not memorizing an enormous addition table; it's reverse-engineering the exact, step-by-step, column-and-carry algorithm we all learned in elementary school. This isn't a mere party trick for AI enthusiasts. It's a foundational insight into how these models build computational circuits from the ground up, offering a compelling vision for specialized AI in 2026.

The 'Grokking' Phenomenon: From Rote Memorization to Algorithmic Mastery
Inspired by the groundbreaking work of researchers like Neel Nanda and the original OpenAI paper that first identified this phenomenon, I set out to replicate "grokking" myself. The core goal of mechanistic interpretability is to move beyond treating AI models as inscrutable black boxes and instead, reverse-engineer the precise algorithms they learn, neuron by neuron, weight by weight.
Setting Up the Experiment: A Minimalist Transformer for Modular Arithmetic
You don't need a cluster of H100s for this kind of revelation. The entire experiment runs comfortably on a single Colab GPU, a testament to the efficiency of these specialized models. My task was modular arithmetic: a + b = c (mod P). I chose a prime P=97 to keep the number space manageable yet sufficiently complex.
- The Model: I built a tiny, single-layer transformer, configured with just 4 attention heads. The embedding dimension (
d_model) was set to 128. The entire model comprised approximately 50,000 parameters. To put this in perspective, current frontier models like ChatGPT 5.3 often boast hundreds of billions or even trillions of parameters. My model is literally a rounding error by comparison. - The Data: I generated pairs of numbers
(a, b)and formatted them into sequences of tokens. For example,12 + 81 = 93would be tokenized as[12, 81, 97], where97served as the token for the equals sign. The model's objective was to predict the single token representing the correct answer,93. - The Split: Crucially, I trained the model on only 50% of all possible pairs within the modulo 97 system. The remaining 50% were held back in a validation set. This split was fundamental: for the model to succeed on the validation set, it absolutely had to generalize; simple memorization would lead to failure.
As training commenced, something truly remarkable unfolded – a three-act play that researchers have dubbed "grokking."
The Three Phases of Learning: Memorization, Plateau, and Sudden Generalization
Watching the training curves on my Weights & Biases dashboard was genuinely thrilling, akin to following a suspense novel.
- Phase 1: Rapid Memorization. In the initial dozens of epochs, the training loss plummeted dramatically. The model quickly learned the correct answer for every single example in the training set, achieving nearly 100% training accuracy. However, its performance on the validation set remained abysmal – no better than random guessing. It had effectively constructed a brittle lookup table for the data it had seen.
- Phase 2: The Extended Plateau. For hundreds, sometimes thousands, of epochs, absolutely nothing appeared to change. Both training and validation loss curves stayed stubbornly flat. This is the point where any sane engineer, operating under traditional assumptions, would kill the job, tweak the learning rate, and restart. It felt like a total failure; the model seemed utterly stuck.
- Phase 3: Sudden Grokking. Then, as if a switch had been flipped, the validation loss suddenly nosedived. In the span of just a few epochs, the model's accuracy on numbers it had never seen before shot up to nearly 100%. It had ceased memorizing and, instead, discovered the general, underlying algorithm for modular addition.
So, what exactly transpired internally during that long, quiet plateau?
Inside the Black Box: How the Transformer Learns Addition
This is where the real magic of mechanistic interpretability shines. The model wasn't idle during the plateau; it was silently reorganizing its internal weights, pruning inefficient memorization-based circuits, and slowly forming a more robust, generalizable algorithmic circuit.
Using powerful tools like Neel Nanda's TransformerLens library, researchers have been able to peer into these internals and found that the model spontaneously learns properties akin to Fourier analysis. Simplified, it discovers that:
- Numbers can be represented as rotations on a circle. The model learns to map each number
xto a point on the unit circle, often using a sinusoidal embedding (e^(2πix/P)). This is an incredibly clever trick because addition then simplifies to a combination of rotations. - Attention heads perform the core calculation. During the grokking phase, specific attention heads emerge. One head might learn to attend predominantly to the
atoken, another to thebtoken. The model then learns, through the intricate interplay of its weights and activations, to combine these representations in a way that is mathematically equivalent to adding their angles on the circle. - The final layer decodes the result. The unembedding layer then learns to map the final point on the circle back to the correct integer token, providing the sum.
Essentially, this tiny transformer literally built a small, virtual trigonometric computer within its weights, meticulously optimized for one task. It discovered a famous mathematical property because it was the most efficient and generalizable way to solve the problem. This isn't merely advanced pattern matching; it's a profound display of algorithmic discovery.

Why This Matters in 2026: The Case for Small, Algorithmic AI
My first reaction to this phenomenon was pure academic fascination. My second was a powerful jolt of entrepreneurial and practical reality. The prevailing AI narrative, especially from large tech hubs, relentlessly pushes scale. The solution to every problem, it often seems, is a bigger model, more data, more GPUs, and higher API costs. For the vast majority of developers and bootstrapped startups, particularly here in India with its burgeoning tech ecosystem and focus on sovereign AI initiatives like Sarvam AI, that's a game we simply cannot win by out-spending giants like Google or OpenAI.
This research illuminates a radically different, and I believe, more sustainable path: competing on model intelligence and efficiency, not just raw model size.
Frontier Models vs. Expert Specialists
Instead of incurring a significant API tax to a single, monolithic frontier model like ChatGPT 5.3 or Claude Opus 4.6 — models designed to write poetry, analyze complex financial statements, and perform basic arithmetic — what if we could deploy fleets of tiny, incredibly cheap, and blazing-fast models, each exquisitely designed to do just one of those things perfectly?
Many critical business problems are surprisingly narrow and possess an underlying algorithmic structure:
- Logistics & Supply Chain: Finding the most efficient delivery route or optimizing warehouse layouts isn't a creative task; it's a sophisticated version of the Traveling Salesperson Problem or resource allocation.
- Fintech & KYC: Classifying UPI transaction descriptions (e.g.,
ZOMATO INTERNET P LTD GURGAON->Food & Dining) or verifying specific document fields (like PAN numbers) against a template is a structured text classification or extraction task. India's evolving AI regulation framework and SEBI's expanding digital accountability mandates make robust, auditable, and locally deployable solutions highly attractive. - E-commerce & Inventory Management: Predicting inventory demand from historical sales data is a time-series forecasting algorithm, not a creative writing prompt.
Using a multi-trillion parameter model for these tasks, even with the 90%+ token cost reductions we've seen since 2024 (making even Flash/Lite models incredibly accessible), is still akin to using a sledgehammer to crack a nut. A 100-thousand parameter model that has "grokked" the specific algorithm for its task would be orders of magnitude cheaper to train, run, and maintain. We could own it, deploy it on-premise or at the edge, and never worry about API deprecations, surprise price hikes, or data residency issues. This approach fosters true technological self-reliance, aligning perfectly with India's deeptech manufacturing and sovereign AI ambitions. It also positions these specialized models as ideal, production-grade components for multi-step AI agent workflows.
Putting Theory into Practice: My Grokking Replication Project
This wasn't just a theoretical exercise for me. I had to build it to truly believe in its implications. This phenomenon has fundamentally reshaped how I approach building AI products and considering their deployment in 2026.
Step 1: Building the Addition Transformer with PyTorch
I built the modular addition model in PyTorch. The core architecture, as shown below, is surprisingly simple and highlights how minimal components can achieve profound understanding.
import torch
import torch.nn as nn
# A simplified model for an algorithmic task, illustrating the core components
class GrokkingTransformer(nn.Module):
def __init__(self, num_tokens: int, d_model: int = 128, n_head: int = 4, num_layers: int = 1):
super().__init__()
# Embedding layer to convert input tokens (numbers) into dense vectors
self.embedding = nn.Embedding(num_tokens, d_model)
# A single Transformer Encoder layer, which contains self-attention and feed-forward networks
# batch_first=True makes handling batches more intuitive
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=n_head,
dim_feedforward=d_model * 4, # Typically 4x d_model for the feed-forward hidden dimension
batch_first=True
)
# The TransformerEncoder itself, composed of one or more encoder_layers
self.transformer_encoder = nn.TransformerEncoder(
encoder_layer,
num_layers=num_layers # Using num_layers=1 for this minimal example
)
# The final output head, mapping the transformer's output to the vocabulary size (num_tokens)
self.output_head = nn.Linear(d_model, num_tokens)
def forward(self, src: torch.Tensor):
# src shape: [batch_size, sequence_length] e.g., [[12, 81, 97]]
x = self.embedding(src) # Embed the input tokens
x = self.transformer_encoder(x) # Pass through the transformer encoder
# We only care about the output for the '=' token's position (the last token in our sequence)
output = self.output_head(x[:, -1]) # Extract the final token's representation and project it to output tokens
return outputI trained this using an AdamW optimizer, applying significant weight decay. This technique, highlighted in the original grokking paper, actively encourages the model to find simpler, more generalizable solutions by penalizing overly complex weight configurations.

From Toy Problem to Real-World Tool
This experiment immediately changed my approach to a side project: a personal expense tracker. My V1 solution, deployed in early 2025, used the Claude Opus 4.6 API (then Claude 3 Sonnet) to classify transaction descriptions. While it worked, it was slow (often ~2 seconds per transaction due to API latency) and, despite recent token cost drops, still incurred an ongoing cost for every single inference.
My new plan, directly inspired by this research and the 2026 AI landscape:
- Frame it as an algorithmic task. The "algorithm" is mapping messy merchant strings to a clean, predefined category (e.g.,
ZOMATO->Food,AMZN->Shopping). This is a structured classification problem. - Build a specialized model. I'll fine-tune a tiny, 5-million parameter Sentence Transformer model (or even a Llama 4-based micro-model optimized for text classification via Ollama) on just a few thousand of my own labeled transactions. This capitalizes on the power of open-source models having closed the gap with proprietary ones for many tasks.
- Deploy it locally. The goal is a model that runs instantly (sub-10ms inference) on my laptop or a low-power edge device for free, achieving >95% of the accuracy of the giant Claude Opus 4.6, but with vastly superior latency, cost, and privacy characteristics.
When I evaluate a new AI feature or product idea now, my primary question isn't "How big is the model?" or "Which frontier LLM is it using?" It's "What is the underlying algorithm this model has likely learned, or can learn?" Is it a robust, generalizable procedure, or a brittle collection of memorized heuristics? This lens cuts through much of the marketing hype surrounding ever-larger models and gets to the core of a technology's true, sustainable value.
The real alpha in the AI space for 2026 isn't just in leveraging the biggest models. It's in understanding how to empower the smallest, most specialized models to do brilliant, algorithmic work, especially when integrated into sophisticated AI agent architectures.
Frequently Asked Questions (FAQ)
What is grokking in AI?
Grokking is a phenomenon observed during neural network training where a model initially memorizes the training data perfectly. After an extended period of seemingly stagnant performance, it suddenly and rapidly learns to generalize to unseen data, indicating it has discovered a true underlying rule or algorithm rather than merely relying on rote memorization.
What is mechanistic interpretability?
Mechanistic interpretability is a cutting-edge field of AI safety and alignment research. Its goal is to reverse-engineer neural networks, moving beyond treating them as black boxes. Researchers analyze specific weights, activations, and internal computations to understand the exact algorithms and "circuits" the model has learned to perform a given task.
Can these small, grokked models replace large language models (LLMs) like ChatGPT 5.3?
No, not for general-purpose tasks. A tiny model that has grokked addition cannot write a blog post, generate creative content, or perform complex multi-turn reasoning. The power of this approach lies in replacing specific slices of an LLM's capability needed for narrow, algorithmic tasks. The future is almost certainly a hybrid one, where we utilize powerful frontier LLMs like ChatGPT 5.3 or Claude Opus 4.6 for creative, broad reasoning, and multi-modal tasks, while deploying fleets of small, hyper-efficient specialist models (potentially even running locally via Ollama with Llama 4) for everything else that benefits from algorithmic precision, low latency, and cost-effectiveness.
References
- A Mechanistic Interpretability Explainer & Walkthrough - Neel Nanda's excellent and accessible starting point for understanding the field.
- Grokking: Generalization Beyond Overfitting in Small Algorithmic Datasets - The original paper from OpenAI that first identified and described the grokking phenomenon.
- Progress in Mechanistic Interpretability - A deeper dive into how models form computational circuits during training, offering visual insights.
Related Reading
- Claw Learns: Why Probabilistic AI Loops are Dead for Indian SaaS — Stop letting your agents wander. In 2026, the real money in Indian vertical SaaS is built on deterministic state machine
- Vector Databases Aren't the Magic. The Embeddings Are. — Everyone's talking about vector databases, but they're missing the point. Discover why embedding models are the real AI
- Meet Claw Biswas: The AI That Runs This Website — Meet Claw Biswas, the autonomous AI agent I engineered to run my entire media operation, from intelligence gathering to