Python / Python Deep Learning and Neural Networks Interview Questions
A neural network is a parameterised function composed of stacked layers. Each layer applies a linear transformation followed by a non-linear activation: h = σ(Wx + b), where W is a weight matrix, b is a bias vector, and σ is an activation function. Stacking L such layers gives a universal function approximator capable of learning arbitrarily complex input–output mappings, provided the network is wide or deep enough.
Forward propagation simply evaluates this composed function left to right: the input x passes through layer 1, the output becomes the input to layer 2, and so on until the final layer produces a prediction. The entire computation is a directed acyclic graph (DAG) of tensor operations — exactly the structure PyTorch's autograd engine records to enable automatic differentiation.
import torch
import torch.nn as nn
class TwoLayerNet(nn.Module):
def __init__(self, in_dim, hidden_dim, out_dim):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden_dim) # W1, b1
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, out_dim) # W2, b2
def forward(self, x):
h = self.relu(self.fc1(x)) # h = ReLU(W1 x + b1)
return self.fc2(h) # y = W2 h + b2
model = TwoLayerNet(in_dim=10, hidden_dim=64, out_dim=1)
x = torch.randn(32, 10) # batch of 32 inputs
y_hat = model(x) # forward pass — calls model.forward(x)
print(y_hat.shape) # torch.Size([32, 1])Why depth matters: a network with one wide hidden layer can theoretically approximate any function (universal approximation theorem), but deeper networks can represent certain functions exponentially more efficiently — a function that needs an exponentially wide shallow network may be captured by a compact deep one, because each layer can reuse and compose features built by earlier layers.
Backpropagation is the algorithm for computing the gradient of a scalar loss L with respect to every parameter in the network. It exploits the chain rule of calculus: if the loss depends on parameter W through intermediate quantities h₁, h₂, ..., hₙ, then ∂L/∂W = (∂L/∂hₙ)(∂hₙ/∂hₙ₋₁)···(∂h₁/∂W). Backprop applies the chain rule systematically starting from the loss and working backwards through each layer, accumulating local gradients.
At each layer, two quantities are needed: the local gradient (how does the layer's output change with its input/weights?) and the upstream gradient (how does the loss change with this layer's output?). Multiplying them gives the gradient flowing to the layer's parameters and to its input, which becomes the upstream gradient for the preceding layer.
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 64), nn.ReLU(),
nn.Linear(64, 1)
)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
x = torch.randn(32, 10)
y = torch.randn(32, 1)
# --- Standard training step ---
optimizer.zero_grad() # 1. Clear old gradients (they accumulate!)
y_hat = model(x) # 2. Forward pass — build computation graph
loss = loss_fn(y_hat, y)# 3. Compute scalar loss
loss.backward() # 4. Backprop — traverse graph in reverse
# populates .grad for every parameter
optimizer.step() # 5. Update parameters: W -= lr * W.grad
# Inspect gradients of first layer
print(model[0].weight.grad.shape) # torch.Size([64, 10])
# Manual chain rule for a single neuron:
# loss = (y_hat - y)^2, y_hat = w*x + b
# dL/dw = 2*(y_hat - y) * x <- upstream * local
w = torch.tensor([2.0], requires_grad=True)
x_s = torch.tensor([3.0])
y_s = torch.tensor([1.0])
loss_s = (w * x_s - y_s) ** 2
loss_s.backward()
print(w.grad) # tensor([40.]) == 2*(2*3-1)*3
Activation functions introduce non-linearity — without them, stacking linear layers would collapse into a single linear transformation. Several families exist, each with different mathematical properties that affect training dynamics.
| Function | Formula | Range | Key property |
|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | (0, 1) | Saturates for |x|>>0 — causes vanishing gradient |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | (-1, 1) | Zero-centred; still saturates |
| ReLU | max(0, x) | [0, ∞) | Non-saturating for x>0; sparse; fast |
| Leaky ReLU | max(αx, x) α≈0.01 | (-∞,∞) | Fixes ReLU's dying neuron problem |
| GELU | x·Φ(x) | (-∞,∞) | Used in BERT/GPT; smooth probabilistic gate |
| Softmax | eˣⁱ/Σeˣʲ | (0,1) sums to 1 | Multi-class output — probability distribution |
import torch
import torch.nn.functional as F
x = torch.linspace(-3, 3, 7)
print(F.relu(x)) # [0, 0, 0, 0, 1, 2, 3] (zeroes negatives)
print(F.sigmoid(x)) # (0,1) — saturates near 0 and 1 at extremes
print(F.tanh(x)) # (-1,1) — saturates near ±1
print(F.leaky_relu(x, negative_slope=0.01)) # small slope for x<0
print(F.gelu(x)) # smooth variant used in transformers
# Softmax: multi-class final layer
logits = torch.tensor([2.0, 1.0, 0.1])
probs = F.softmax(logits, dim=0)
print(probs) # [0.659, 0.242, 0.099] — sums to 1.0
# In a model: prefer nn.ReLU() (in-place optional with inplace=True)
import torch.nn as nn
act = nn.ReLU() # stateless — can be shared across layersWhy ReLU replaced sigmoid: for large networks the vanishing gradient problem made sigmoid/tanh networks nearly untrainable. For a neuron deep in the network, the gradient arriving from backprop has already been multiplied by many sigmoid derivatives — each at most 0.25 — so the gradient shrinks exponentially with depth. ReLU's derivative is exactly 1 for positive inputs (no shrinkage in that direction), allowing gradients to flow through deep networks without exponential decay. The trade-off is the 'dying ReLU' problem where neurons receiving strongly negative inputs get stuck outputting zero permanently, addressed by Leaky ReLU and ELU variants.
Vanishing gradients occur when gradients shrink exponentially as they are backpropagated through many layers — the product of many small numbers (e.g. sigmoid derivatives ≤ 0.25) approaches zero, making early layer weights unable to update meaningfully. Exploding gradients are the opposite: the product of many large numbers causes gradients to grow exponentially, destabilising training with numerically infinite or NaN updates.
Both problems worsen with depth. The root mathematical cause is that repeated matrix multiplication of the weight matrices during backprop concentrates the gradient spectrum: if weight matrices have singular values consistently less than 1, gradients vanish; if greater than 1, they explode. Several techniques address this:
| Technique | Addresses | How it helps |
|---|---|---|
| ReLU / Leaky ReLU | Vanishing | Gradient = 1 for positive inputs — no shrinkage |
| Batch Normalisation | Both | Normalises layer inputs; stabilises gradient magnitude |
| Residual connections (ResNet) | Vanishing | Gradient highway: ∂L/∂x = ∂L/∂(x+F) flows directly |
| Gradient clipping | Exploding | Caps gradient norm before the update step |
| Careful weight init (Xavier/He) | Both | Ensures variance stable across layers at init |
| LSTM/GRU gates | Vanishing (RNNs) | Gating controls gradient flow through time |
import torch
import torch.nn as nn
# Gradient clipping — applied AFTER backward(), BEFORE optimizer.step()
model = nn.LSTM(input_size=10, hidden_size=128, num_layers=3, batch_first=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
x = torch.randn(32, 20, 10) # (batch, seq_len, input_size)
output, _ = model(x)
loss = output.sum()
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # clip!
optimizer.step()
# Residual connection in code:
class ResidualBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim, dim), nn.ReLU(),
nn.Linear(dim, dim)
)
def forward(self, x):
return x + self.net(x) # gradient flows through x directly
If weights are initialized too small, activations and gradients shrink layer by layer — a form of vanishing gradient from the start. If too large, they explode. The goal of principled initialisation is to keep the variance of activations and gradients roughly constant across all layers at the start of training.
Xavier (Glorot) initialisation draws weights from a distribution with variance 2/(fan_in + fan_out). It was derived assuming linear activations (or tanh in the original paper) by requiring that the variance of the layer's output equals the variance of its input. He (Kaiming) initialisation uses variance 2/fan_in, derived for ReLU activations specifically — since ReLU zeroes out half the input on average, the variance of the output is halved, so doubling the initial weight variance compensates for this. Using Xavier with ReLU causes variance to shrink by roughly half per layer, eventually vanishing.
import torch
import torch.nn as nn
# Default PyTorch Linear layers use Kaiming Uniform initialisation
layer = nn.Linear(256, 128)
print(layer.weight.std()) # approximately sqrt(2/256) ≈ 0.088
# Explicit initialisation
def init_weights(m):
if isinstance(m, nn.Linear):
# Xavier: good for sigmoid/tanh activations
nn.init.xavier_uniform_(m.weight)
# He/Kaiming: good for ReLU activations (default in PyTorch)
# nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')
nn.init.zeros_(m.bias)
model = nn.Sequential(
nn.Linear(784, 256), nn.ReLU(),
nn.Linear(256, 128), nn.ReLU(),
nn.Linear(128, 10)
)
model.apply(init_weights) # apply init_weights to every sub-module
# Verifying activation variance stays stable across layers:
x = torch.randn(100, 784)
for layer in model:
x = layer(x)
print(f'{layer.__class__.__name__}: std={x.std():.3f}')
# With He init + ReLU: std should remain near 1.0 throughout
Batch Normalisation (BN) normalises the pre-activation values within a mini-batch to have zero mean and unit variance, then rescales them with learnable parameters γ (scale) and β (shift): BN(x) = γ · (x - μ_B) / √(σ²_B + ε) + β, where μ_B and σ²_B are the batch mean and variance, and ε is a small constant for numerical stability.
BN addresses internal covariate shift — the distribution of each layer's inputs changes during training as the preceding layers' weights update, forcing each layer to continuously adapt to a moving target. By renormalising inputs at each layer, BN stabilises this distribution. In practice, BN also provides a mild regularisation effect (similar to adding noise via the mini-batch statistics), reduces sensitivity to learning rate, and substantially reduces the need for dropout in many architectures.
import torch
import torch.nn as nn
# BatchNorm1d: for fully-connected layers (normalises over batch dim)
# BatchNorm2d: for conv layers (normalises per channel over batch+spatial)
model = nn.Sequential(
nn.Linear(784, 256),
nn.BatchNorm1d(256), # BN BEFORE or AFTER activation — varies by paper
nn.ReLU(),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Linear(128, 10),
)
# BatchNorm behaves DIFFERENTLY in train vs eval mode!
model.train() # uses batch mean/var during forward pass
model.eval() # uses running mean/var (exponential moving avg)
# Always call model.eval() at inference time:
with torch.no_grad():
model.eval()
preds = model(torch.randn(1, 784)) # inference — correct behavior
# Manual: BN keeps running stats during training
bn = nn.BatchNorm1d(256)
print(bn.running_mean.shape) # torch.Size([256]) — updated each forward call
All these optimizers share the same goal — updating parameters to reduce loss — but differ in how they use gradient history to adapt the update step. Understanding the mechanics helps diagnose slow training and poor generalisation.
| Optimizer | Update rule (simplified) | Key advantage | Limitation |
|---|---|---|---|
| SGD | θ ← θ - η·g | Simple, no memory overhead | Slow convergence, sensitive to lr |
| SGD + Momentum | v ← βv + g; θ ← θ - η·v | Accelerates consistent directions, damps oscillation | Still global lr |
| RMSProp | θ ← θ - η·g / √(E[g²]+ε) | Adapts lr per parameter; good for RNNs | No momentum term |
| Adam | Combines momentum + RMSProp; bias-corrected | Robust default; fast convergence | Can generalise worse than SGD on some tasks |
import torch
import torch.nn as nn
model = nn.Linear(10, 1)
# SGD — baseline, works but needs careful lr tuning
opt_sgd = torch.optim.SGD(model.parameters(), lr=0.01)
# SGD + Momentum — adds velocity; β=0.9 is standard
opt_mom = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9,
weight_decay=1e-4) # L2 regularisation
# Adam — adaptive learning rate + momentum; best default for DL
opt_adam = torch.optim.Adam(model.parameters(),
lr=1e-3, # default, usually works
betas=(0.9, 0.999), # momentum terms
eps=1e-8,
weight_decay=1e-5)
# AdamW — Adam with decoupled weight decay (better than Adam + L2)
opt_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3,
weight_decay=1e-2)
# Learning rate schedulers — change lr during training
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(opt_adam, T_max=100)
for epoch in range(100):
# ... training loop ...
scheduler.step() # decrease lr following cosine curveWhen to choose: Adam is the safe default for most deep learning tasks. SGD with momentum often achieves better final generalisation on image classification tasks (the finding that motivated the NLP community's shift back to AdamW for fine-tuning pre-trained transformers). AdamW is now the standard for fine-tuning large language models.
During training, Dropout randomly sets each neuron's output to zero with probability p (the drop probability) and scales the remaining activations by 1/(1-p) to preserve the expected sum. This means each forward pass trains a different thinned sub-network — with n neurons, there are 2ⁿ possible sub-networks, and each training step updates a random one.
The regularisation effect comes from several mechanisms: (1) it prevents co-adaptation — neurons cannot rely on specific other neurons always being present, so each must learn useful features independently; (2) it is mathematically equivalent to training an exponentially large ensemble and averaging their predictions at test time (where Dropout is disabled); (3) the multiplicative noise acts similarly to L2 regularisation on the weights. At inference, Dropout is disabled and all neurons are active — the 1/(1-p) scaling during training ensures the expected value of each neuron's output is the same during training and inference.
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 512), nn.ReLU(),
nn.Dropout(p=0.5), # drop 50% of neurons
nn.Linear(512, 256), nn.ReLU(),
nn.Dropout(p=0.3),
nn.Linear(256, 10)
)
# Training: Dropout is ACTIVE (neurons randomly zeroed)
model.train()
x = torch.ones(1, 784)
out1 = model(x)
out2 = model(x) # different result! different neurons dropped each time
# Inference: Dropout is DISABLED (all neurons active)
model.eval()
with torch.no_grad():
out3 = model(x)
out4 = model(x) # same result — deterministic
# Inverted Dropout (PyTorch default):
# Scale by 1/(1-p) DURING training, not during inference
# => test-time output has correct expected value without scaling
dp = nn.Dropout(p=0.5)
model.train()
x_in = torch.ones(10)
print(dp(x_in)) # ~5 zeros, remaining values are 2.0 (scaled by 1/0.5)
A convolutional layer applies a set of learnable filters (kernels) by sliding each filter over the spatial dimensions of the input and computing a dot product at each position. For a 2D image, a kernel of size k×k with C_in input channels and C_out output channels has k×k×C_in×C_out parameters total. This produces one feature map per output channel, where each value represents the response of that filter at a specific spatial location.
CNNs are powerful for images because of two structural inductive biases they encode: (1) translation equivariance — the same filter is applied everywhere, so if an object moves in the image, the corresponding feature map activation moves identically; (2) parameter sharing — instead of a separate weight per input-output pixel pair (as a fully-connected layer would require), the filter weights are shared across all spatial locations, drastically reducing parameters and improving sample efficiency.
import torch
import torch.nn as nn
# Standard Conv2d usage
# Input: (batch, C_in, H, W)
# Output: (batch, C_out, H', W')
conv = nn.Conv2d(
in_channels=3, # RGB image
out_channels=64, # 64 filters
kernel_size=3, # 3x3 kernel
stride=1,
padding=1, # 'same' padding — preserves H and W
)
x = torch.randn(8, 3, 32, 32) # batch of 8 RGB 32x32 images
out = conv(x)
print(out.shape) # torch.Size([8, 64, 32, 32])
n_params = 3 * 64 * 3 * 3 + 64 # weights + biases
print('Parameters:', n_params) # 1792
# Compare: FC layer 3*32*32 -> 64*32*32 would be 3*32*32*64*32*32 = 603M!
# Typical CNN block:
cnn_block = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2), # halves spatial dims
)
A vanilla RNN processes a sequence step-by-step, maintaining a hidden state hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b) that acts as a compressed memory of everything seen so far. The problem is that this hidden state must be updated at every step — and during backpropagation through time (BPTT), gradients are multiplied by Wₕ repeatedly. If the spectral radius of Wₕ is less than 1, gradients vanish over long sequences; if greater than 1, they explode. In practice, vanilla RNNs cannot effectively learn dependencies longer than ~10–20 steps.
LSTMs introduce a separate cell state cₜ (the long-term memory) and three gates — forget, input, and output — each controlled by sigmoid activations. The forget gate fₜ = σ(Wf[hₜ₋₁, xₜ] + bf) decides what to erase from cₜ₋₁; the input gate decides what new information to write; the output gate controls what the hidden state exposes. The key mathematical insight is that the cell state update is additive: cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ. Additive updates mean the gradient can flow through time without repeated multiplicative shrinkage, solving the vanishing gradient problem for long sequences.
import torch
import torch.nn as nn
# LSTM usage in PyTorch
lstm = nn.LSTM(
input_size=64,
hidden_size=128,
num_layers=2, # stacked LSTM
batch_first=True, # input shape: (batch, seq, features)
dropout=0.2, # applied between stacked layers
bidirectional=False
)
x = torch.randn(32, 50, 64) # (batch=32, seq_len=50, input=64)
output, (h_n, c_n) = lstm(x)
print(output.shape) # (32, 50, 128) — all time-step hidden states
print(h_n.shape) # (2, 32, 128) — final hidden state, both layers
print(c_n.shape) # (2, 32, 128) — final cell state, both layers
# GRU: simplified LSTM with only 2 gates — often comparable quality
gru = nn.GRU(input_size=64, hidden_size=128, batch_first=True)
out_gru, h_gru = gru(x)
# For classification, use the LAST hidden state:
last_h = output[:, -1, :] # (32, 128) — last time step
classifier = nn.Linear(128, 5)
logits = classifier(last_h)
Self-attention computes a weighted sum of all input vectors, where the weight between positions i and j reflects how much position i should 'attend to' position j. Concretely, input vectors are linearly projected into queries (Q), keys (K), and values (V), and the attention output is: Attention(Q, K, V) = softmax(QKᵀ/√dₖ) · V. The division by √dₖ prevents the dot products from growing large in high-dimensional spaces, which would push softmax into saturation.
Multi-head attention runs H parallel attention heads with different Q/K/V projections, then concatenates and projects their outputs — each head can learn to attend to different types of relationships simultaneously. The critical advantage over RNNs: self-attention connects any two positions in the sequence in O(1) operations regardless of their distance, while RNNs need O(n) sequential steps to connect positions n apart. This makes transformers trainable in parallel across the sequence length, enabling training on vastly larger datasets.
import torch
import torch.nn as nn
import math
class ScaledDotProductAttention(nn.Module):
def forward(self, Q, K, V, mask=None):
d_k = Q.shape[-1]
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = torch.softmax(scores, dim=-1)
return weights @ V, weights
# PyTorch's built-in multi-head attention
mha = nn.MultiheadAttention(
embed_dim=512,
num_heads=8, # 8 heads, each with dim=64
dropout=0.1,
batch_first=True
)
seq_len, batch, d_model = 20, 4, 512
x = torch.randn(batch, seq_len, d_model)
out, attn_weights = mha(x, x, x) # Q=K=V=x for self-attention
print(out.shape) # (4, 20, 512)
print(attn_weights.shape)# (4, 20, 20) — weight of each position pair
The choice of loss function should match the output type and the probabilistic assumption about the data-generating process — it is the mathematical link between model predictions and the training signal.
| Task | Loss | PyTorch class | Notes |
|---|---|---|---|
| Binary classification | Binary cross-entropy | nn.BCEWithLogitsLoss | Takes logits (pre-sigmoid); numerically stable |
| Multi-class classification | Cross-entropy | nn.CrossEntropyLoss | Takes logits; combines log-softmax + NLLLoss |
| Regression | MSE | nn.MSELoss | Sensitive to outliers |
| Regression (robust) | MAE / Huber | nn.L1Loss / nn.HuberLoss | Huber blends L1+L2; robust to outliers |
| Multi-label classification | BCE per label | nn.BCEWithLogitsLoss | Each label independent — not mutually exclusive |
| Contrastive / metric learning | Triplet margin | nn.TripletMarginLoss | Learns embeddings |
import torch
import torch.nn as nn
# Binary classification — output is a single logit (no sigmoid)
bce = nn.BCEWithLogitsLoss() # applies sigmoid internally
logit = torch.tensor([2.0, -1.0, 0.5])
label = torch.tensor([1.0, 0.0, 1.0])
loss = bce(logit, label)
# Multi-class — outputs are raw logits per class (no softmax)
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 10) # batch of 8, 10 classes
targets = torch.randint(0, 10, (8,)) # class indices 0-9
loss = ce(logits, targets)
# Class-weighted cross-entropy — for imbalanced datasets
weights = torch.tensor([1.0]*9 + [10.0]) # up-weight class 9
ce_weighted = nn.CrossEntropyLoss(weight=weights)
# Regression
mse = nn.MSELoss()
huber = nn.HuberLoss(delta=1.0) # L2 for |error|<1, L1 for |error|>1
pred = torch.randn(32, 1)
true = torch.randn(32, 1)
print(mse(pred, true), huber(pred, true))
Transfer learning reuses a model trained on a large dataset (typically ImageNet for vision, or a large text corpus for NLP) as a starting point for a related task with less data. The pretrained model has already learned general features (edges, textures, shapes for images; grammar, semantics for text) — fine-tuning adapts these features to the target task without needing to learn them from scratch.
Two common strategies: (1) Feature extraction — freeze all pretrained layers and train only a new task-specific head; (2) Full fine-tuning — unfreeze some or all pretrained layers and train end-to-end with a small learning rate to avoid overwriting the useful pretrained representations. A common practical pattern is to first train only the head for a few epochs (so it doesn't start with random gradients corrupting the pretrained backbone), then unfreeze and fine-tune everything together with a smaller lr.
import torch
import torch.nn as nn
import torchvision.models as models
# Load pretrained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# --- Strategy 1: Feature extraction ---
# Freeze ALL pretrained parameters
for param in backbone.parameters():
param.requires_grad = False
# Replace the final FC layer for our task (e.g. 5 classes)
in_features = backbone.fc.in_features # 2048 for ResNet-50
backbone.fc = nn.Linear(in_features, 5)
# Only backbone.fc.parameters() have requires_grad=True
# --- Strategy 2: Full fine-tuning ---
backbone2 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
backbone2.fc = nn.Linear(backbone2.fc.in_features, 5)
# Use layer-wise lr: smaller lr for early layers
optimizer = torch.optim.AdamW([
{'params': backbone2.layer1.parameters(), 'lr': 1e-5},
{'params': backbone2.layer4.parameters(), 'lr': 1e-4},
{'params': backbone2.fc.parameters(), 'lr': 1e-3},
], weight_decay=1e-2)
# Verify which parameters will be updated
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total = sum(p.numel() for p in backbone.parameters())
print(f'Trainable: {trainable:,} / Total: {total:,}')
PyTorch's data loading follows a clean two-class design: Dataset encapsulates how to access a single sample (index → (X, y)), and DataLoader wraps a Dataset to handle batching, shuffling, and parallel data loading. Separating these responsibilities makes it easy to write dataset-specific logic once and reuse the same efficient loading infrastructure.
The most critical performance consideration is that the data loading pipeline must keep the GPU continuously fed — the GPU should never sit idle waiting for the next batch. Key knobs: num_workers launches subprocesses that prefetch batches in parallel with the GPU computation; pin_memory=True allocates batch tensors in pinned (non-pageable) CPU memory, enabling faster CPU→GPU transfers via DMA; prefetch_factor controls how many batches each worker prefetches ahead.
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
class TabularDataset(Dataset):
def __init__(self, X: np.ndarray, y: np.ndarray):
# Convert to tensors once at construction (not per __getitem__)
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.long)
def __len__(self):
return len(self.X) # required — DataLoader uses this for indexing
def __getitem__(self, idx):
return self.X[idx], self.y[idx] # single sample
dataset = TabularDataset(X_train, y_train)
loader = DataLoader(
dataset,
batch_size=256,
shuffle=True, # shuffle each epoch
num_workers=4, # parallel data loading
pin_memory=True, # faster CPU->GPU transfer
drop_last=True, # drop incomplete final batch
persistent_workers=True, # keep workers alive between epochs
)
# Training loop
for X_batch, y_batch in loader:
X_batch = X_batch.cuda(non_blocking=True) # async transfer
y_batch = y_batch.cuda(non_blocking=True)
# ... forward, backward, step
A fixed learning rate is a poor choice for most training runs: too high early on causes instability; too high late in training prevents fine convergence to a sharp minimum. Learning rate schedulers systematically vary the lr during training to get the best of both worlds — fast progress early, precise convergence later.
| Schedule | Behaviour | Best for |
|---|---|---|
| StepLR | Multiply lr by γ every N epochs | Quick experiments; baseline |
| CosineAnnealingLR | lr follows cosine curve from η_max to η_min | Most DL tasks; smooth decay |
| OneCycleLR | Warmup from low to high lr, then decay — all in one cycle | Fast training (super-convergence) |
| ReduceLROnPlateau | Reduce lr when validation metric stops improving | Unknown training time; auto-adapts |
| CyclicLR | Cycle between base_lr and max_lr repeatedly | Escaping sharp minima |
| WarmupThenDecay | Linear warmup then cosine decay | Large transformers, LLMs |
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
# CosineAnnealingLR — smooth decay from max to min lr
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100, eta_min=1e-6
)
# OneCycleLR — requires total_steps at init
n_epochs, steps_per_epoch = 10, 100
scheduler_1cycle = optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=1e-2,
total_steps=n_epochs * steps_per_epoch,
pct_start=0.3, # 30% of steps for warmup
)
# ReduceLROnPlateau — triggered by validation metric
scheduler_plateau = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=5
)
# Training loop
for epoch in range(100):
train_one_epoch(model, optimizer, loader)
val_loss = validate(model, val_loader)
scheduler_cos.step() # epoch-based schedulers
scheduler_plateau.step(val_loss) # metric-based scheduler
print(f'LR: {optimizer.param_groups[0]["lr"]:.6f}')
Deep neural networks have millions of parameters and can trivially memorise training data. Classical regularisation (L1/L2 on weights) still applies, but modern deep learning has developed additional techniques that often work better or are used in combination.
| Technique | How it works | Best applied to |
|---|---|---|
| L2 (weight decay) | Penalises large weights: adds λ‖w‖² to loss | All DL models; use AdamW for correct implementation |
| Dropout | Randomly zero neurons during training | Fully-connected layers; less common in conv/transformer |
| Data augmentation | Artificially increase diversity of training set | Vision (flips, crop, colour jitter, mixup, cutmix) |
| Early stopping | Stop training when val loss stops improving | Any model; simple and effective baseline |
| Label smoothing | Soften one-hot labels to (1-ε, ε/(k-1),...) | Classification; improves calibration |
| Stochastic depth | Randomly drop entire residual blocks during training | Very deep networks (ResNets, ViTs) |
import torch
import torch.nn as nn
import torchvision.transforms as T
# Data augmentation for images
train_transform = T.Compose([
T.RandomHorizontalFlip(p=0.5),
T.RandomCrop(32, padding=4),
T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
# Label smoothing: penalises overconfident predictions
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# A 10-class example: true label 3 becomes
# [0.01, 0.01, 0.01, 0.91, 0.01, ...] instead of [0,0,0,1,0,...]
# Mixup augmentation (manual implementation)
def mixup_batch(x, y, alpha=0.4):
lam = torch.distributions.Beta(alpha, alpha).sample().item()
idx = torch.randperm(x.size(0))
x_mix = lam * x + (1 - lam) * x[idx]
y_a, y_b = y, y[idx]
return x_mix, y_a, y_b, lam
# Early stopping — track best val loss, restore best weights
best_val_loss = float('inf')
patience_count = 0
for epoch in range(max_epochs):
val_loss = validate(model, val_loader)
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pt')
patience_count = 0
else:
patience_count += 1
if patience_count >= patience:
break
An embedding layer is a learnable lookup table that maps discrete tokens (words, categories, user IDs) to dense, low-dimensional real-valued vectors. It is mathematically a matrix E ∈ ℝ^{V×d} (vocabulary size × embedding dimension), and looking up token i simply retrieves row i — equivalent to multiplying a one-hot vector by E, but implemented as an O(1) table lookup rather than an O(V) matrix multiply.
The key advantage over one-hot encoding is that embeddings are learned — similar tokens (synonyms, related categories) naturally end up with similar embedding vectors because they appear in similar contexts during training. This gives embeddings semantic meaning and enables generalisation: the model can leverage the fact that 'Paris' and 'Berlin' are semantically similar even if 'Berlin' was rare in training data, because their embedding vectors will be nearby.
import torch
import torch.nn as nn
vocab_size = 10000
embed_dim = 128
embedding = nn.Embedding(
num_embeddings=vocab_size,
embedding_dim=embed_dim,
padding_idx=0 # token 0 gets a fixed zero vector (PAD token)
)
# Input: integer token IDs
token_ids = torch.tensor([[1, 5, 23, 0], [42, 7, 0, 0]]) # (2, 4)
embedded = embedding(token_ids)
print(embedded.shape) # (2, 4, 128) — each token -> 128-dim vector
# Pre-trained embeddings (e.g. GloVe, Word2Vec)
pretrained = torch.randn(vocab_size, embed_dim) # replace with real vectors
embedding.weight.data.copy_(pretrained)
# Freeze pretrained embeddings:
# embedding.weight.requires_grad = False
# In a text model:
class TextClassifier(nn.Module):
def __init__(self):
super().__init__()
self.embed = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, 256, batch_first=True)
self.fc = nn.Linear(256, 5)
def forward(self, x):
e = self.embed(x) # (B, L, 128)
_, (h, _) = self.lstm(e)
return self.fc(h[-1])
PyTorch provides two main ways to persist a model: saving the full model object (convenient but fragile to class definition changes) or saving only the state dictionary (recommended for production and reproducibility). The state dict is a Python OrderedDict mapping layer names to their parameter tensors — it contains everything needed to recreate the model's learned state.
A proper training checkpoint includes more than just model weights — it must also save the optimizer state (which contains momentum buffers and adaptive learning rate accumulators in Adam), the current epoch and step, the best validation metric, and the random number generator state, so that training can be resumed exactly where it left off without any change in behaviour.
import torch
import torch.nn as nn
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ── Recommended: save/load state_dict ──
torch.save(model.state_dict(), 'model_weights.pt')
model_new = nn.Linear(10, 5) # same architecture
model_new.load_state_dict(torch.load('model_weights.pt'))
model_new.eval() # ALWAYS call eval() for inference
# ── Full training checkpoint ──
def save_checkpoint(path, epoch, model, optimizer, best_val_loss):
torch.save({
'epoch': epoch,
'model_state': model.state_dict(),
'optimizer_state': optimizer.state_dict(),
'best_val_loss': best_val_loss,
'rng_state': torch.get_rng_state(),
}, path)
def load_checkpoint(path, model, optimizer):
ckpt = torch.load(path, map_location='cpu')
model.load_state_dict(ckpt['model_state'])
optimizer.load_state_dict(ckpt['optimizer_state'])
return ckpt['epoch'], ckpt['best_val_loss']
# Loading on different device: always load to CPU first,
# then move to device (avoids GPU OOM if original GPU is unavailable)
model.load_state_dict(
torch.load('model_weights.pt', map_location='cpu')
)
model = model.to('cuda')
Modern GPUs (Volta and later) have dedicated hardware for 16-bit floating-point operations (FP16 / BFloat16) that can be 2–8× faster than FP32 for matrix multiplications. Mixed precision training runs the forward pass and gradient computations in FP16 (or BF16) for speed, while maintaining a master copy of the weights in FP32 for numerical precision during the optimizer update.
Loss scaling addresses a key challenge: FP16's limited dynamic range (smallest positive ≈ 6×10⁻⁸) can cause small gradient values to underflow to zero. The scaler multiplies the loss by a large scalar before backward (inflating gradients into FP16's representable range), then divides the gradients back before the optimizer step. PyTorch's GradScaler automates this and dynamically adjusts the scale factor.
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
model = nn.Linear(1024, 512).cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = GradScaler() # manages loss scaling automatically
x = torch.randn(256, 1024).cuda()
y = torch.randn(256, 512).cuda()
for step in range(100):
optimizer.zero_grad()
# autocast: runs eligible ops in FP16 automatically
with autocast(device_type='cuda', dtype=torch.float16):
y_hat = model(x) # FP16 matrix multiply
loss = nn.MSELoss()(y_hat, y)
# Scale loss -> backward in FP16 -> unscale gradients -> optimizer step
scaler.scale(loss).backward() # inflate loss to prevent underflow
scaler.unscale_(optimizer) # restore original gradient magnitudes
nn.utils.clip_grad_norm_(model.parameters(), 1.0) # clip after unscale
scaler.step(optimizer) # skip step if gradients are inf/NaN
scaler.update() # adjust scale factor for next step
# BFloat16 (bfloat16): available on A100+ GPUs
# - Same exponent range as FP32 (no underflow problem -> no scaler needed)
# - Less precision (7-bit mantissa vs 10-bit for FP16)
with autocast(device_type='cuda', dtype=torch.bfloat16):
y_hat = model(x) # no scaler needed with BF16
These three mechanisms serve different but complementary purposes that are often confused. Understanding the distinction prevents subtle bugs in training, validation, and inference code.
| Mechanism | What it controls | Effect |
|---|---|---|
| model.eval() | Layer behaviour (Dropout, BatchNorm) | Disables Dropout; BatchNorm uses running stats instead of batch stats |
| model.train() | Layer behaviour (Dropout, BatchNorm) | Enables Dropout; BatchNorm uses current batch stats |
| torch.no_grad() | Gradient tracking | Stops building the computation graph; saves memory; tensors cannot call .backward() |
| torch.inference_mode() | Gradient tracking + view tracking | Stricter than no_grad; ~10% faster; returned tensors cannot be used in autograd at all |
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 64), nn.ReLU(),
nn.BatchNorm1d(64), nn.Dropout(0.5),
nn.Linear(64, 1)
)
# ─── Training ───────────────────────────────────────────────────────
model.train() # Dropout ACTIVE, BatchNorm uses batch stats
x = torch.randn(32, 10)
out1 = model(x)
out2 = model(x) # DIFFERENT — Dropout randomly drops each call
# ─── Validation (compute val loss, need backward later? No) ─────────
model.eval()
with torch.no_grad():
# Dropout OFF, BatchNorm uses running stats, no computation graph
val_out = model(x)
val_loss = nn.MSELoss()(val_out, torch.zeros(32, 1))
# ─── Inference / deployment ─────────────────────────────────────────
model.eval()
with torch.inference_mode(): # fastest; cannot go back to autograd
pred = model(torch.randn(1, 10))
# COMMON BUG: forgetting model.eval() at inference
# model.eval() and torch.no_grad() are INDEPENDENT — you need BOTH:
# - model.eval() alone: still builds graph (memory waste)
# - torch.no_grad() alone: Dropout still active (wrong predictions)
PyTorch's device abstraction allows the same code to run on CPU, single GPU, or multiple GPUs with minimal changes. The fundamental operations are moving tensors to a device with .to(device) or .cuda(), and ensuring model and data tensors always reside on the same device before any computation.
A critical performance concept: CPU–GPU data transfers are expensive (PCIe bandwidth is limited vs. GPU memory bandwidth). Minimise them by loading data onto the GPU once per batch, pre-computing dataset statistics on CPU, and avoiding frequent tensor transfers inside the training loop.
import torch
import torch.nn as nn
# Device-agnostic code pattern
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using: {device}') # cuda / mps / cpu
# Move model to device
model = nn.Linear(10, 5).to(device)
# Move data to device in the training loop
for X_batch, y_batch in loader:
X_batch = X_batch.to(device, non_blocking=True)
y_batch = y_batch.to(device, non_blocking=True)
y_hat = model(X_batch)
# ...
# Check which device a tensor is on
t = torch.randn(3)
print(t.device) # cpu
t_gpu = t.cuda() # or t.to('cuda:0')
print(t_gpu.device) # cuda:0
# Apple Silicon
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
# Memory diagnostics
print(torch.cuda.memory_allocated() / 1e9, 'GB allocated')
print(torch.cuda.max_memory_allocated() / 1e9, 'GB peak')
torch.cuda.empty_cache() # release unused cached GPU memory
# Multi-GPU: DistributedDataParallel (DDP) preferred over DataParallel
model_ddp = nn.parallel.DistributedDataParallel(model, device_ids=[0, 1])
All normalisation variants compute mean and variance and apply the same transformation (x-μ)/√(σ²+ε) — they differ only in which dimensions the mean and variance are computed over. This seemingly small difference has large practical consequences depending on the architecture and batch size.
| Method | Normalises over | Best for | Key limitation |
|---|---|---|---|
| BatchNorm | Batch + spatial dims per channel | CNNs, large batch MLP | Breaks with batch_size=1; train/eval difference |
| LayerNorm | All features per sample | Transformers, NLP, RNNs | Slower than BN on large spatial dims |
| InstanceNorm | Spatial dims per channel per sample | Style transfer, GAN | Loses channel statistics |
| GroupNorm | Spatial dims per group of channels per sample | Object detection, small batch | Requires choosing n_groups |
import torch
import torch.nn as nn
# BatchNorm1d: normalise over batch for FC layers
# Input: (N, C) or (N, C, L)
bn = nn.BatchNorm1d(num_features=128)
# LayerNorm: normalise over feature dim(s) — no dependency on batch
# Input: (*, normalized_shape) — last dims are normalised
ln = nn.LayerNorm(normalized_shape=128) # used in transformers
ln_2d = nn.LayerNorm([128, 8, 8]) # can normalise spatial too
# GroupNorm: split channels into groups, normalise per group per sample
# Input: (N, C, *) — C must be divisible by num_groups
gn = nn.GroupNorm(num_groups=8, num_channels=128)
# InstanceNorm: each sample, each channel independently
inst = nn.InstanceNorm2d(num_features=128)
# Example: why LayerNorm is used in transformers
d_model = 512
x = torch.randn(4, 20, d_model) # (batch, seq_len, d_model)
# BatchNorm would normalise over batch and seq_len per feature dim —
# unstable at inference when batch=1 (as in autoregressive generation)
# LayerNorm normalises over d_model for each (batch, seq) position independently
print(ln(x).shape) # (4, 20, 512) — each position normalised independently
An autoencoder is a neural network trained to reconstruct its input through a bottleneck. The encoder f: X → Z maps inputs to a lower-dimensional latent space Z, and the decoder g: Z → X̂ reconstructs the input. Training minimises the reconstruction loss (e.g. MSE for continuous inputs, binary cross-entropy for binary) without any labels — it is an unsupervised learning technique.
The bottleneck forces the encoder to learn a compressed, information-dense representation. A well-trained latent space can be used for: (1) dimensionality reduction and visualisation (better than PCA for non-linear data); (2) anomaly detection (normal samples reconstruct well; anomalies have high reconstruction error); (3) de-noising (train with noisy input, clean target — denoising autoencoders); (4) generative modelling (Variational Autoencoders / VAEs impose a probabilistic structure on Z that enables generation).
import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim=784, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 256), nn.ReLU(),
nn.Linear(256, latent_dim)
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim), nn.Sigmoid() # pixel values in [0,1]
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z)
ae = Autoencoder()
optimizer = torch.optim.Adam(ae.parameters(), lr=1e-3)
criterion = nn.MSELoss()
for X_batch, _ in loader: # labels not used!
X_flat = X_batch.view(X_batch.size(0), -1) # flatten images
X_hat = ae(X_flat)
loss = criterion(X_hat, X_flat)
optimizer.zero_grad(); loss.backward(); optimizer.step()
# Anomaly detection at inference:
ae.eval()
with torch.no_grad():
X_hat = ae(test_samples)
recon_error = ((test_samples - X_hat) ** 2).mean(dim=1)
# High recon_error => anomalous sample
Reading loss curves is one of the most important practical skills in deep learning. The shape of the training and validation loss over time reveals the failure mode and guides the fix.
| Loss curve shape | Diagnosis | Likely fix |
|---|---|---|
| Loss is NaN from the start | Exploding gradients or bad init | Gradient clipping, lower lr, check data for inf/NaN |
| Loss doesn't decrease at all | Vanishing gradient, lr too low, dead neurons | Check activations, raise lr, use He init + ReLU |
| Loss decreases then plateaus early | Learning rate too high or model too small | Reduce lr / lr schedule, increase capacity |
| Train loss low, val loss high (large gap) | Overfitting | More regularisation: dropout, weight decay, augmentation, early stopping |
| Both losses plateau at high value | Underfitting (high bias) | Increase model capacity, train longer, reduce regularisation |
| Loss oscillates wildly | Learning rate too high | Reduce lr, use lr schedule, check batch size |
import torch
import torch.nn as nn
# Checking for gradient issues
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for step, (X, y) in enumerate(loader):
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
# Check for NaN/Inf in loss
if not torch.isfinite(loss):
print(f'Step {step}: non-finite loss = {loss.item()}')
break
# Monitor gradient norms
total_norm = 0
for p in model.parameters():
if p.grad is not None:
total_norm += p.grad.data.norm(2).item() ** 2
total_norm = total_norm ** 0.5
if step % 100 == 0:
print(f'Step {step}: loss={loss.item():.4f} grad_norm={total_norm:.4f}')
optimizer.step()
# Check dead ReLU neurons
def count_dead_neurons(model, X):
activations = []
def hook(m, inp, out):
activations.append((out <= 0).float().mean().item())
handles = [l.register_forward_hook(hook)
for l in model.modules() if isinstance(l, nn.ReLU)]
with torch.no_grad(): model(X)
for h in handles: h.remove()
return activations # fraction of dead neurons per layer
A GAN consists of two competing networks: a generator G that maps random noise z ~ p(z) to fake data samples, and a discriminator D that classifies inputs as real or fake. They play a minimax game with objective: min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]. At the Nash equilibrium, G produces samples from the true data distribution and D outputs 0.5 for every input (cannot distinguish real from fake).
In practice, GANs suffer from several well-known training challenges: mode collapse (G learns to produce only a subset of modes of the data distribution); training instability (the minimax game does not converge reliably); and vanishing generator gradient (when D becomes too good early on, it correctly classifies fake samples with near-certainty, giving G near-zero gradient signal). These led to many GAN variants — DCGAN (convolutional architecture), WGAN (Wasserstein distance instead of JS divergence), and progressive growing GANs.
import torch
import torch.nn as nn
latent_dim, img_dim = 100, 784
# Generator: noise -> fake image
generator = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, 512), nn.ReLU(),
nn.Linear(512, img_dim), nn.Tanh()
)
# Discriminator: image -> real (1) or fake (0)
discriminator = nn.Sequential(
nn.Linear(img_dim, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 1) # raw logit; use BCEWithLogitsLoss
)
criterion = nn.BCEWithLogitsLoss()
opt_G = torch.optim.Adam(generator.parameters(), lr=2e-4, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(discriminator.parameters(), lr=2e-4, betas=(0.5, 0.999))
for real_imgs, _ in loader:
real_imgs = real_imgs.view(-1, img_dim)
bs = real_imgs.size(0)
# Train Discriminator
z = torch.randn(bs, latent_dim)
fake_imgs = generator(z).detach() # detach: don't update G here
loss_D = (criterion(discriminator(real_imgs), torch.ones(bs, 1))
+ criterion(discriminator(fake_imgs), torch.zeros(bs, 1)))
opt_D.zero_grad(); loss_D.backward(); opt_D.step()
# Train Generator
z = torch.randn(bs, latent_dim)
loss_G = criterion(discriminator(generator(z)), torch.ones(bs, 1))
opt_G.zero_grad(); loss_G.backward(); opt_G.step()
Introduced in PyTorch 2.0, torch.compile applies ahead-of-time compilation to a PyTorch model or function. Rather than executing each operation eagerly (PyTorch's default), it captures the computation as a graph, optimises it (fusing operations, eliminating redundant memory reads/writes), and compiles it to efficient machine code using a backend (TorchInductor by default, which generates CUDA/C++ kernels).
The primary benefit is kernel fusion: instead of launching a separate GPU kernel for each operation (e.g. separate kernels for matrix multiply, add bias, and ReLU), the compiler fuses them into a single kernel that reads and writes GPU memory once. GPU memory bandwidth is often the bottleneck for transformer-style models, so reducing memory round-trips directly translates to throughput gains — typically 10–50% speedup for training and inference on modern hardware.
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(1024, 1024), nn.GELU(),
nn.Linear(1024, 512), nn.GELU(),
nn.Linear(512, 10)
)
# Compile the model — first call triggers compilation (may take 30s+)
compiled_model = torch.compile(model)
# Usage is identical to a regular model
x = torch.randn(256, 1024).cuda()
compiled_model = compiled_model.cuda()
out = compiled_model(x) # warm-up: triggers compilation
out = compiled_model(x) # subsequent calls use compiled kernels
# Compilation modes (trade-off speed of compilation vs runtime)
model_default = torch.compile(model) # best overall
model_reduce = torch.compile(model, mode='reduce-overhead') # fewer overheads
model_max = torch.compile(model, mode='max-autotune') # slowest to compile, fastest to run
# Measure speedup
import time
x = torch.randn(512, 1024, device='cuda')
for _ in range(5): model(x) # warm-up
t0 = time.time()
for _ in range(100): model(x)
torch.cuda.synchronize()
print('Eager:', time.time() - t0)
for _ in range(5): compiled_model(x)
t0 = time.time()
for _ in range(100): compiled_model(x)
torch.cuda.synchronize()
print('Compiled:', time.time() - t0)
Self-attention is permutation equivariant — swapping two positions in the input produces the same output with those two positions swapped, because attention treats all positions symmetrically. Without positional information, a transformer cannot distinguish 'The dog bit the man' from 'The man bit the dog'. Positional encodings inject sequence order information into the token embeddings before they enter the transformer.
The original 'Attention is All You Need' paper uses sinusoidal encodings: PE(pos, 2i) = sin(pos / 10000^{2i/d}) and PE(pos, 2i+1) = cos(pos / 10000^{2i/d}), where pos is the position and i is the dimension index. Each dimension oscillates at a different frequency, giving a unique fingerprint to every position. The key properties: (1) each position has a unique encoding; (2) the encoding for position pos+k is a linear function of position pos, allowing the model to reason about relative distances; (3) it generalises to sequence lengths unseen during training.
import torch
import math
def sinusoidal_positional_encoding(max_len, d_model):
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term) # even dims: sin
pe[:, 1::2] = torch.cos(position * div_term) # odd dims: cos
return pe # (max_len, d_model)
import torch.nn as nn
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
pe = sinusoidal_positional_encoding(max_len, d_model)
self.register_buffer('pe', pe) # not a parameter; saved with model
def forward(self, x): # x: (batch, seq_len, d_model)
x = x + self.pe[:x.size(1)] # add pos encoding to each embedding
return self.dropout(x)
# Modern alternative: Rotary Position Embeddings (RoPE)
# Used in LLaMA, Mistral — encodes relative rather than absolute position
# Applied directly to Q and K matrices before attention computation
Deep learning has many hyperparameters, but they are not equally important. Empirical research and practitioner experience has established a rough hierarchy of impact. Tuning in the wrong order wastes compute — finding the optimal dropout rate is pointless if the learning rate is still wildly off.
| Priority | Hyperparameter | Typical search range |
|---|---|---|
| 1 (highest) | Learning rate | Log-uniform: 1e-5 to 1e-1 |
| 1 | Batch size | 32, 64, 128, 256, 512 |
| 2 | Model architecture (depth, width) | Task-specific; start from established baselines |
| 2 | Optimizer (Adam vs SGD + momentum) | Usually Adam/AdamW first |
| 3 | Weight decay / L2 penalty | Log-uniform: 1e-5 to 1e-1 |
| 3 | LR schedule and warmup | Cosine with 5-10% warmup steps |
| 4 (lower) | Dropout rate | 0.0, 0.1, 0.2, 0.5 |
| 4 | Batch norm epsilon / momentum | Rarely tuned; defaults usually fine |
import optuna
import torch
import torch.nn as nn
def objective(trial):
# Optuna suggests hyperparameters — log-uniform search for lr
lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
wd = trial.suggest_float('weight_decay', 1e-5, 1e-1, log=True)
n_layers = trial.suggest_int('n_layers', 2, 6)
hidden_dim = trial.suggest_categorical('hidden_dim', [128, 256, 512])
dropout = trial.suggest_float('dropout', 0.0, 0.5)
layers = []
in_dim = 784
for _ in range(n_layers):
layers += [nn.Linear(in_dim, hidden_dim), nn.ReLU(),
nn.Dropout(dropout)]
in_dim = hidden_dim
model = nn.Sequential(*layers, nn.Linear(in_dim, 10))
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
val_acc = train_and_evaluate(model, optimizer, n_epochs=10)
return val_acc
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Best params:', study.best_params)
Encoder-decoder (seq2seq) architectures handle tasks where the input and output are sequences of potentially different lengths — machine translation, summarisation, speech recognition, image captioning. The encoder processes the full input sequence and produces a context representation; the decoder generates the output sequence token by token, conditioning each prediction on the context and all previously generated tokens.
In transformer-based seq2seq, the encoder uses bidirectional self-attention (each position attends to all input positions), while the decoder uses two attention mechanisms: masked self-attention (each output position can only attend to previous output positions, preserving the autoregressive property) and cross-attention (each decoder position attends to all encoder output positions to draw relevant information from the input).
import torch
import torch.nn as nn
# PyTorch's built-in Transformer (encoder-decoder)
transformer = nn.Transformer(
d_model=512,
nhead=8,
num_encoder_layers=6,
num_decoder_layers=6,
dim_feedforward=2048,
dropout=0.1,
batch_first=True
)
# Source and target sequences
src = torch.randn(4, 20, 512) # (batch, src_len, d_model)
tgt = torch.randn(4, 15, 512) # (batch, tgt_len, d_model)
# Causal mask: prevent decoder from attending to future target tokens
tgt_len = tgt.size(1)
tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_len)
out = transformer(src, tgt, tgt_mask=tgt_mask)
print(out.shape) # (4, 15, 512)
# Teacher forcing: at training time, feed ground-truth previous tokens
# to the decoder (not its own previous predictions)
# At inference: autoregressive — use model's own previous output:
def greedy_decode(model, src, max_len, sos_idx, eos_idx):
memory = model.encoder(src)
ys = torch.tensor([[sos_idx]])
for _ in range(max_len):
mask = nn.Transformer.generate_square_subsequent_mask(ys.size(1))
out = model.decoder(ys.float(), memory, tgt_mask=mask)
next_token = out[:, -1].argmax()
ys = torch.cat([ys, next_token.unsqueeze(0).unsqueeze(0)], dim=1)
if next_token.item() == eos_idx: break
return ys
Quantization reduces model size and inference latency by representing weights and activations in lower-precision integer formats (INT8, INT4, INT2) rather than FP32 or FP16. A 32-bit float weight is replaced by an 8-bit integer plus a scale factor and zero-point: x_float = scale × (x_int - zero_point). This yields 4× memory reduction for INT8, enabling larger models to fit on limited hardware and significantly faster integer arithmetic on CPUs and mobile accelerators.
Three main approaches: (1) Post-Training Quantization (PTQ) — quantize a trained FP32 model without retraining, using a small calibration dataset to determine optimal scale factors; (2) Quantization-Aware Training (QAT) — simulate quantization noise during training (fake quantization), allowing the model to adapt and typically recovering the accuracy lost by PTQ; (3) Dynamic quantization — weights are quantized ahead of time, activations quantized dynamically at inference (simplest, good baseline for RNNs).
import torch
import torch.nn as nn
from torch.quantization import quantize_dynamic, prepare, convert
# ─── Dynamic Quantization (simplest — weights INT8, activations FP32) ───
model_fp32 = nn.LSTM(input_size=64, hidden_size=128)
model_int8 = quantize_dynamic(
model_fp32,
qconfig_spec={nn.Linear, nn.LSTM},
dtype=torch.qint8
)
print('FP32 size:', sum(p.numel() * 4 for p in model_fp32.parameters()), 'bytes')
# INT8 model is ~4x smaller
# ─── Post-Training Static Quantization ───
model = nn.Sequential(nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10))
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = prepare(model) # insert observer modules
# Calibrate with representative data
model_prepared.eval()
with torch.no_grad():
for X_cal, _ in calibration_loader:
model_prepared(X_cal)
model_int8 = convert(model_prepared) # convert to INT8
# ─── Modern approach: bitsandbytes / llm.int8() for LLMs ───
# 8-bit quantization of LLM weights with minimal accuracy loss
# Allows running 7B+ parameter models on consumer GPUs
# from transformers import AutoModelForCausalLM
# model = AutoModelForCausalLM.from_pretrained('gpt2', load_in_8bit=True)
A well-structured training loop separates concerns cleanly: data loading, forward pass, loss computation, backpropagation, gradient management, metric tracking, and model persistence. Each step has specific pitfalls that silently degrade results.
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torch.utils.data import DataLoader
def train_epoch(model, loader, optimizer, criterion, device, scaler):
model.train()
total_loss, n_correct, n_total = 0.0, 0, 0
for X, y in loader:
X, y = X.to(device, non_blocking=True), y.to(device, non_blocking=True)
optimizer.zero_grad(set_to_none=True) # faster than zero_grad()
with autocast(device_type='cuda', dtype=torch.float16):
logits = model(X)
loss = criterion(logits, y)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
total_loss += loss.item() * X.size(0)
n_correct += (logits.argmax(1) == y).sum().item()
n_total += X.size(0)
return total_loss / n_total, n_correct / n_total
@torch.no_grad()
def eval_epoch(model, loader, criterion, device):
model.eval()
total_loss, n_correct, n_total = 0.0, 0, 0
for X, y in loader:
X, y = X.to(device, non_blocking=True), y.to(device, non_blocking=True)
logits = model(X)
loss = criterion(logits, y)
total_loss += loss.item() * X.size(0)
n_correct += (logits.argmax(1) == y).sum().item()
n_total += X.size(0)
return total_loss / n_total, n_correct / n_total
# Main training loop
best_val_acc = 0
for epoch in range(n_epochs):
tr_loss, tr_acc = train_epoch(model, train_loader, optimizer,
criterion, device, scaler)
vl_loss, vl_acc = eval_epoch(model, val_loader, criterion, device)
scheduler.step()
if vl_acc > best_val_acc:
best_val_acc = vl_acc
torch.save(model.state_dict(), 'best.pt')
print(f'Epoch {epoch:3d}: tr={tr_loss:.4f}/{tr_acc:.3f} '
f'val={vl_loss:.4f}/{vl_acc:.3f}')
Batch size controls the trade-off between gradient estimate quality and training speed. With batch size B, the gradient is estimated as the average loss gradient over B samples — the variance of this estimate is proportional to σ²/B, where σ² is the per-sample gradient variance. Larger batches give lower-variance (more accurate) gradient estimates, but with diminishing returns: the benefit of doubling the batch size has halved variance but the compute cost also doubles.
Generalisation effect: empirically, large batches often lead to sharper minima that generalise worse than the flatter minima found by small batches. The noise in small-batch SGD acts as implicit regularisation — the stochastic gradient trajectory tends to find broader minima, which are more robust to small perturbations. This is the 'large batch training problem'. Mitigations: linear scaling rule (scale lr proportionally with batch size), warmup, and gradient accumulation (simulate large batches while maintaining small-batch noise).
import torch
import torch.nn as nn
model = nn.Linear(10, 1)
criterion = nn.MSELoss()
# Gradient accumulation: simulate batch_size=1024 with micro_batch=32
accumulation_steps = 32 # effective_batch_size = 32 * 32 = 1024
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
optimizer.zero_grad()
for step, (X, y) in enumerate(loader):
# Forward and backward every micro-batch
loss = criterion(model(X), y) / accumulation_steps # scale by 1/K
loss.backward() # gradients accumulate, not cleared
if (step + 1) % accumulation_steps == 0:
# Clip and step only after accumulating K micro-batches
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
# Linear scaling rule: if you double batch size, double the lr
base_lr = 1e-3
base_batch = 256
new_batch = 1024
new_lr = base_lr * (new_batch / base_batch) # 4e-3
# But use warmup to stabilise the larger lr at the start
Each layer type encodes different structural assumptions (inductive biases) about the data. Using a layer whose assumptions match the data's structure allows the model to learn faster and with less data than a generic alternative.
| Data type | Structure | Recommended layer | Reason |
|---|---|---|---|
| Tabular | No spatial/sequential structure | Linear (MLP) | Features are independent; no shared structure to exploit |
| Images | 2D spatial locality + translation equivariance | Conv2d | Same pattern anywhere in image; fewer params than FC |
| Text/sequences | Long-range dependencies, variable length | Transformer (self-attention) | O(1) path length between any two positions |
| Short sequences / time series | Local temporal patterns | Conv1d or LSTM | Local: Conv1d; long-range: LSTM |
| Graphs | Irregular node connectivity | Graph Conv (GCN/GAT) | Aggregates neighbor information per node |
| Point clouds | Permutation invariant 3D | PointNet / sparse conv | Must handle unordered sets |
import torch
import torch.nn as nn
# Tabular data: simple MLP
mlp = nn.Sequential(
nn.Linear(30, 128), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(128, 64), nn.ReLU(),
nn.Linear(64, 1)
)
# Image: CNN
cnn = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
nn.AdaptiveAvgPool2d(1), # global average pooling -> (B, 64, 1, 1)
nn.Flatten(),
nn.Linear(64, 10)
)
# Text: embedding + transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=256, nhead=4, dim_feedforward=512,
dropout=0.1, batch_first=True
)
text_model = nn.Sequential(
nn.Embedding(10000, 256),
nn.TransformerEncoder(encoder_layer, num_layers=4)
)
# Time series: Conv1d (local patterns) or LSTM (sequential patterns)
ts_cnn = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=5, padding=2)
ts_rnn = nn.LSTM(input_size=1, hidden_size=64, batch_first=True)
The choice of evaluation metric should match the task's real-world objective, not just be the easiest to compute. The training loss and the evaluation metric are often different — models are trained with cross-entropy but evaluated with accuracy, F1, mAP, or BLEU depending on the application.
| Task | Primary metric | When it falls short |
|---|---|---|
| Classification (balanced) | Accuracy | Misleading on imbalanced classes |
| Classification (imbalanced) | F1 / AUC-ROC / PR-AUC | PR-AUC better than ROC-AUC for severe imbalance |
| Object detection | mAP (mean Average Precision) | Doesn't account for localisation precision at all scales |
| Regression | MAE / RMSE / R² | RMSE sensitive to outliers; R² can be negative |
| Machine translation | BLEU score | Doesn't capture semantic similarity |
| Language generation | Perplexity / ROUGE / BERTScore | Perplexity doesn't measure fluency |
| Segmentation | Intersection over Union (IoU / mIoU) | Sensitive to class imbalance |
import torch
from torchmetrics import Accuracy, F1Score, AUROC, MeanSquaredError
# torchmetrics: handles accumulation across batches correctly
n_classes = 5
acc = Accuracy(task='multiclass', num_classes=n_classes)
f1 = F1Score(task='multiclass', num_classes=n_classes, average='macro')
auroc = AUROC(task='multiclass', num_classes=n_classes)
model.eval()
with torch.no_grad():
for X, y in val_loader:
logits = model(X)
preds = logits.argmax(dim=1)
probs = torch.softmax(logits, dim=1)
acc.update(preds, y)
f1.update(preds, y)
auroc.update(probs, y)
print(f'Val Acc: {acc.compute():.4f}')
print(f'Val F1: {f1.compute():.4f}')
print(f'Val AUROC:{auroc.compute():.4f}')
# Manual accuracy (without torchmetrics)
all_preds, all_labels = [], []
with torch.no_grad():
for X, y in val_loader:
preds = model(X).argmax(1)
all_preds.append(preds.cpu())
all_labels.append(y.cpu())
preds = torch.cat(all_preds)
labels = torch.cat(all_labels)
accuracy = (preds == labels).float().mean()
print(f'Accuracy: {accuracy:.4f}')
Research-time PyTorch models depend on Python's interpreter and PyTorch's eager execution mode — both are too slow and have too many dependencies for production deployment. Two standard serialisation formats allow deploying PyTorch models without Python: TorchScript (PyTorch-native, supports dynamic shapes better) and ONNX (framework-agnostic, runs on TensorRT, OpenVINO, CoreML, ONNX Runtime across many hardware targets).
TorchScript compiles a PyTorch model into an intermediate representation (IR) that can run in C++ via LibTorch, without any Python dependency. It is created either via torch.jit.trace (records operations from a concrete example — doesn't handle data-dependent control flow) or torch.jit.script (analyzes Python source — handles control flow but requires type annotations and a subset of Python). ONNX export traces the model similarly and serialises it to the ONNX protobuf format, which can then be run on any ONNX-compatible runtime.
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 5)
).eval()
example_input = torch.randn(1, 10)
# ─── TorchScript: tracing ───────────────────────────────────────────
traced = torch.jit.trace(model, example_input)
traced.save('model_traced.pt')
# Load and run without original Python class:
loaded = torch.jit.load('model_traced.pt')
output = loaded(torch.randn(4, 10))
# ─── TorchScript: scripting (handles if/for in forward) ────────────
@torch.jit.script
def activation(x: torch.Tensor) -> torch.Tensor:
if x.sum() > 0:
return torch.relu(x)
return torch.tanh(x)
# ─── ONNX export ────────────────────────────────────────────────────
torch.onnx.export(
model,
example_input,
'model.onnx',
input_names=['features'],
output_names=['logits'],
dynamic_axes={'features': {0: 'batch_size'}, # variable batch
'logits': {0: 'batch_size'}},
opset_version=17,
)
# Validate exported ONNX model
import onnx, onnxruntime as ort
onnx.checker.check_model('model.onnx')
sess = ort.InferenceSession('model.onnx')
result = sess.run(None, {'features': example_input.numpy()})
Knowledge distillation (Hinton et al., 2015) trains a small student network to mimic the output distribution of a large, accurate teacher network. Instead of training only on hard labels (the correct class as a one-hot vector), the student is also trained to match the teacher's soft probabilities — the full output distribution including small probabilities assigned to incorrect classes.
The soft probabilities carry richer information than hard labels: if the teacher assigns 0.7 to 'cat' and 0.25 to 'dog', this communicates that the image looks somewhat cat-like but also dog-like — a nuanced signal the student can learn from. A temperature parameter T sharpens or softens this distribution: p_i = exp(z_i/T) / Σ exp(z_j/T). Higher T produces a softer, more uniform distribution that exposes the teacher's confidence relationships across all classes, giving the student a richer gradient signal. The distillation loss combines the cross-entropy with hard labels and the KL divergence with the teacher's soft targets.
import torch
import torch.nn as nn
import torch.nn.functional as F
teacher = BigModel().eval() # pretrained, frozen
student = SmallModel() # to be trained
T = 3.0 # temperature — soften the distributions
alpha = 0.7 # weight for distillation vs hard-label loss
ce_loss = nn.CrossEntropyLoss()
kl_div_loss = nn.KLDivLoss(reduction='batchmean')
optimizer = torch.optim.AdamW(student.parameters(), lr=1e-3)
for X, y_hard in loader:
# Teacher forward (no grad)
with torch.no_grad():
teacher_logits = teacher(X)
# Student forward
student_logits = student(X)
# Hard-label cross-entropy
loss_hard = ce_loss(student_logits, y_hard)
# Soft-target KL divergence (temperature-scaled)
student_soft = F.log_softmax(student_logits / T, dim=1)
teacher_soft = F.softmax(teacher_logits / T, dim=1)
loss_kl = kl_div_loss(student_soft, teacher_soft) * (T ** 2)
# T^2 scaling: compensates for the T-scaled gradients
loss = alpha * loss_kl + (1 - alpha) * loss_hard
optimizer.zero_grad(); loss.backward(); optimizer.step()
Self-supervised learning (SSL) is a form of unsupervised learning where the model is trained on a pretext task defined entirely from the data itself — no human-provided labels. The learned representations can then be transferred to downstream tasks with few or no labels (linear probe, fine-tuning).
Contrastive methods like SimCLR define a pretext task based on augmentation invariance: for each input, create two random augmented views (crops, colour jitter, flips) and train the model so that representations of the two views of the same image are similar (positive pair), while representations of views from different images are dissimilar (negative pairs). The NT-Xent loss (normalised temperature-scaled cross-entropy) implements this: for a batch of N images (2N views), the model is trained to identify the matching view among 2(N-1) negative candidates.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T
# Augmentation pipeline: two random views of the same image
augment = T.Compose([
T.RandomResizedCrop(224),
T.RandomHorizontalFlip(),
T.ColorJitter(0.4, 0.4, 0.4, 0.1),
T.RandomGrayscale(p=0.2),
T.GaussianBlur(kernel_size=23),
T.ToTensor(),
])
class SimCLRLoss(nn.Module):
def __init__(self, temperature=0.07):
super().__init__()
self.tau = temperature
def forward(self, z1, z2):
# L2-normalise projections to unit sphere
z1 = F.normalize(z1, dim=1)
z2 = F.normalize(z2, dim=1)
# All 2N representations as rows
z = torch.cat([z1, z2], dim=0) # (2N, d)
# Pairwise cosine similarities / temperature
sim_matrix = z @ z.T / self.tau # (2N, 2N)
# Mask out self-similarities on diagonal
n = z1.size(0)
labels = torch.cat([torch.arange(n, 2*n), torch.arange(n)]).to(z.device)
# Remove diagonal (self-similarity)
mask = ~torch.eye(2*n, dtype=bool, device=z.device)
sim_matrix = sim_matrix[mask].view(2*n, -1)
return F.cross_entropy(sim_matrix, labels)
# After pretraining: linear evaluation
# Freeze backbone, train linear head on downstream task
backbone = resnet50_pretrained
for p in backbone.parameters(): p.requires_grad = False
linear_head = nn.Linear(2048, num_classes)
optimizer = torch.optim.Adam(linear_head.parameters(), lr=1e-3)
This question tests whether you understand the full PyTorch workflow: defining a custom nn.Module, implementing forward, and running the standard train loop. It is a common practical screen in ML engineering interviews.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# ─── 1. Define the model ────────────────────────────────────────────
class FeedForwardNet(nn.Module):
def __init__(self, in_dim: int, hidden_dim: int, out_dim: int,
dropout: float = 0.1):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden_dim)
self.bn1 = nn.BatchNorm1d(hidden_dim)
self.relu = nn.ReLU()
self.drop = nn.Dropout(dropout)
self.fc2 = nn.Linear(hidden_dim, out_dim)
self._init_weights()
def _init_weights(self):
nn.init.kaiming_uniform_(self.fc1.weight, nonlinearity='relu')
nn.init.zeros_(self.fc1.bias)
nn.init.xavier_uniform_(self.fc2.weight)
nn.init.zeros_(self.fc2.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.relu(self.bn1(self.fc1(x)))
x = self.drop(x)
return self.fc2(x)
# ─── 2. Create data ──────────────────────────────────────────────────
torch.manual_seed(42)
X = torch.randn(1000, 20)
y = (X[:, 0] + X[:, 1] > 0).long() # binary label
ds = TensorDataset(X, y)
loader = DataLoader(ds, batch_size=64, shuffle=True)
# ─── 3. Instantiate model, loss, optimizer ───────────────────────────
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = FeedForwardNet(20, 64, 2, dropout=0.1).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
# ─── 4. Training loop ─────────────────────────────────────────────────
for epoch in range(30):
model.train()
epoch_loss = 0.0
for X_b, y_b in loader:
X_b, y_b = X_b.to(device), y_b.to(device)
optimizer.zero_grad(set_to_none=True)
loss = criterion(model(X_b), y_b)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
if epoch % 5 == 0:
print(f'Epoch {epoch:3d}: loss={epoch_loss / len(loader):.4f}')Key interview checkpoints: (1) subclass nn.Module and call super().__init__(); (2) define all layers as attributes in __init__; (3) implement forward; (4) follow the zero-grad → forward → loss → backward → step order; (5) call model.train() before training and model.eval() before evaluation.
