The Foundation of All Neural Networks
Andrej Karpathy's micrograd tutorial is the perfect entry point to understanding neural networks. Instead of using massive tensor libraries, micrograd implements automatic differentiation using only scalar values - individual numbers like 1.0, -2.5, etc. This constraint forces you to understand what's actually happening in backpropagation.
The Core Insight: Everything is Just Addition and Multiplication
Neural networks seem complex, but they're just chains of basic operations. A neuron performing w*x + b is literally just:
Input: x = 2.0
Weight: w = -3.0
Bias: b = 1.0
Step 1: multiply → w*x = -6.0
Step 2: add → -6.0 + 1.0 = -5.0
Step 3: activation → tanh(-5.0) = -0.9999
The Value Class: Building Blocks of Computation
The entire micrograd engine centers on a Value class that wraps scalars and tracks their computational history:
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
Each Value remembers:
- Its actual numeric value (
data) - Its gradient (
grad) - What operation created it (
_op) - What inputs created it (
_prev)
Forward Pass: Building the Computation Graph
When you write c = a + b, micrograd creates a new Value that remembers it came from adding a and b:
a = Value(-4.0)
b = Value(2.0)
c = a + b # c.data = -2.0, c._prev = {a, b}, c._op = '+'
Complex expressions build directed acyclic graphs (DAGs):
Expression: f = (a * b + b**3).relu()
Graph Structure:
a(-4.0) ──┐
├─→ mult ──┐
b(2.0) ───┘ ├─→ add ──→ relu ──→ f
┌──────────┘
b**3 ─────┘
The Backward Pass: Chain Rule in Action
The magic happens in backpropagation. Starting from the output, gradients flow backward following the chain rule:
def backward(self):
# Topological sort to get correct order
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
# Backward pass
self.grad = 1
for v in reversed(topo):
v._backward()
Each operation implements its own backward function:
# Addition: derivative is 1 for both inputs
def _backward():
self.grad += out.grad
other.grad += out.grad
# Multiplication: derivative uses chain rule
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
Visualizing the Computation Graph
One of micrograd's coolest features is graph visualization. Here's what a simple neuron looks like:
Input Layer:
x₁ = 2.0 ──→ [w₁ = -3.0] ──→ mult ──┐
x₂ = 0.0 ──→ [w₂ = 1.0] ──→ mult ──┤
├──→ sum ──→ tanh ──→ output
bias = 6.88 ────────────────────────┘
Each node shows both the forward value and the gradient:
Node Format: [forward_value | gradient]
Example: [2.0 | -3.0] means value=2.0, gradient=-3.0
Building Neural Networks
With automatic differentiation working, building neural networks becomes straightforward:
class Neuron:
def __init__(self, nin, nonlin=True):
self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
self.b = Value(0)
self.nonlin = nonlin
def __call__(self, x):
act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
return act.tanh() if self.nonlin else act
class MLP:
def __init__(self, nin, nouts):
sz = [nin] + nouts
self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
Training Loop: Gradient Descent
Training follows the standard pattern:
# Forward pass
ypred = [n(x) for x in xs]
loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
# Zero gradients
for p in parameters():
p.grad = 0
# Backward pass
loss.backward()
# Update parameters
for p in parameters():
p.data += -0.01 * p.grad
It's crazy to think that you can explain modern machine learning concepts with high school calculus and some mental gymastics.
Comparison with PyTorch
The beauty of micrograd is its simplicity compared to PyTorch:
# Micrograd (explicit)
a = Value(2.0)
b = Value(-3.0)
c = a * b
c.backward()
print(a.grad) # -3.0
# PyTorch (tensor-based)
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(-3.0, requires_grad=True)
c = a * b
c.backward()
print(a.grad) # tensor(-3.0)
Same mathematical principles, different abstractions.
The Gradient Flow Visualization
Understanding how gradients flow is crucial. Consider this expression:
f = (x * y) + sin(x)
Gradient Flow: df/df = 1.0 (starting point) df/d(xy) = 1.0 (addition passes gradient through) df/dx = y * 1.0 + cos(x) * 1.0 (chain rule for both paths) df/dy = x * 1.0 (multiplication rule)
Key Implementation Details
Handling Repeated Variables
When a variable appears multiple times, gradients accumulate:
a = Value(3.0)
b = a + a # a appears twice
# Backward pass must accumulate: a.grad += 1.0 + 1.0 = 2.0
Topological Sorting
Critical for correct gradient flow:
# Incorrect: might compute gradients before their dependencies
for node in nodes:
node.\_backward()
# Correct: topological order ensures dependencies computed first
for node in topologically_sorted(nodes):
node._backward()
Activation Functions
Each activation needs its derivative:
def tanh(self):
t = (math.exp(2*self.data) - 1)/(math.exp(2*self.data) + 1)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1 - t**2) * out.grad # derivative of tanh
out._backward = _backward
return out
The Learning Journey
Following this tutorial transformed my understanding:
- Neural networks aren't magic - they're just computational graphs
- I felt the gradients flowing through the network
- I can implement custom operations confidently
Modern Connections
While micrograd uses scalars, modern frameworks use tensors for efficiency:
# Micrograd: Each number is separate
for i in range(1000):
result += values[i] * weights[i]
# PyTorch: Vectorized operations
result = torch.dot(values, weights) # All at once
Same math, different scale.
The 100-Line Revolution
The entire autograd engine is about 100 lines. This proves that the core ideas behind billion-parameter models are fundamentally simple. The complexity comes from scale, not from the underlying mathematics.
The complete micrograd implementation is available on GitHub. The tutorial video remains one of the clearest explanations of backpropagation ever created.