From RNNs to Transformers — complete implementations you can run, learn from, and build upon. Every paper, every line of code, explained.
If you really learn all of these, you'll know 90% of what matters today
— Ilya SutskeverEach paper comes with deep explanations, clean code, visualizations, and exercises. Click any card to explore.
Character-level language models that generate Shakespeare, code, and music
Gates, memory cells, and learning long-term dependencies
Dropout, layer norm, and preventing overfitting in sequence models
Compression as the key to intelligence and model selection
Two-part codes, prequential MDL, and normalized maximum likelihood
Information equilibration and evolutionary dynamics in complex systems
Cellular automata, chaos theory, and emergent behavior
AlexNet — the paper that sparked the deep learning revolution
Skip connections enabling 1000+ layer networks
Pre-activation and improved gradient flow
Exponentially expanding receptive fields without resolution loss
A simple way to prevent neural networks from overfitting
The Transformer architecture that revolutionized AI
Line-by-line PyTorch implementation with explanations
The original attention mechanism before Transformers
Teaching networks when order doesn't matter in inputs
Differentiable external memory with content-based addressing
Attention as output — pointing at input elements for variable-size combinatorial problems
Learning relationships between objects (Sort-of-CLEVR, Relation Networks)
Memory as a set of interacting slots — solving problems LSTMs can't
Unifying graph neural networks — messages, updates, and readouts for molecular prediction
End-to-end speech recognition — replacing the entire ASR pipeline with a single neural network
Solving posterior collapse by limiting the decoder's receptive field
Scaling models beyond memory limits with pipeline parallelism and micro-batching
From sequence models to modern architectures — a complete curriculum.
Days 1-7 · RNNs, LSTMs, regularization, compression, and complexity theory
Days 8-12 · CNNs, residual learning, and the vision revolution
Days 13-16 · Attention mechanisms and sequence-to-sequence learning
Days 17-22 · Memory networks, graphs, reasoning, and speech
Days 23-28 · GANs, VAEs, diffusion, and scaling laws
Days 29-30 · RLHF, alignment, and the path to ChatGPT
Every paper comes with complete, runnable implementations. No "left as an exercise" — we build everything from scratch so you truly understand how these systems work.
Clean, documented, runs everywhere
With complete solutions
Run and experiment live
Theory meets practice
class MultiHeadAttention(nn.Module):
"""Multi-head attention from scratch."""
def __init__(self, d_model, n_heads):
super().__init__()
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
def forward(self, x):
# Compute attention scores
scores = Q @ K.transpose(-2, -1)
scores = scores / math.sqrt(self.d_k)
attn = F.softmax(scores, dim=-1)
return attn @ V
Join the journey. Learn AI the right way — by building from scratch.