Prev Next

Python / PyTorch Fundamentals Interview Questions

1. What is PyTorch and what are its key advantages over other deep learning frameworks? 2. What is a PyTorch tensor and how does it differ from a NumPy array? 3. What are the most important tensor operations in PyTorch? 4. What are tensor data types (dtypes) in PyTorch and why do they matter? 5. How does broadcasting work in PyTorch and what are the rules? 6. What is autograd in PyTorch and how does it compute gradients? 7. What is the computation graph in PyTorch and how does the dynamic graph differ from a static graph? 8. How do torch.no_grad() and tensor.detach() differ, and when do you use each? 9. What is nn.Module and how do you build a custom neural network in PyTorch? 10. What are nn.Sequential and other container modules in PyTorch? 11. What built-in layers does PyTorch's nn module provide and how do you use the most common ones? 12. What are activation functions in PyTorch and how do you apply them? 13. What are the most important loss functions in PyTorch and when do you use each? 14. What optimizers does PyTorch provide and how do you configure them? 15. What are learning rate schedulers in PyTorch and how do you use them? 16. What are the most common built-in layers in torch.nn and what do they do? 17. How do you initialise weights in a PyTorch model? 18. What loss functions does PyTorch provide and when do you use each? 19. What optimizers does PyTorch provide and how do you choose between them? 20. What are learning rate schedulers in PyTorch and how do you use them? 21. What activation functions are commonly used in PyTorch and how do you choose between them? 22. What loss functions does PyTorch provide and how do you choose the right one? 23. What optimizers does PyTorch provide and what is the difference between SGD, Adam, and AdamW? 24. What is the standard PyTorch training loop and what does each step do? 25. What are Dataset and DataLoader in PyTorch and how do they work together? 26. How do you move tensors and models between CPU and GPU in PyTorch? 27. What is the difference between model.parameters() and model.state_dict() in PyTorch? 28. How do you save and load PyTorch models correctly, including full training checkpoints? 29. What is overfitting and what regularization techniques does PyTorch support to address it? 30. What is the vanishing/exploding gradient problem and how do you detect and fix it in PyTorch? 31. What is weight initialization in PyTorch and why does it matter? 32. What is the difference between nn.Parameter and a regular tensor attribute in nn.Module? 33. How do you implement and use learning rate schedulers in PyTorch? 34. How do you debug a PyTorch training loop where the loss is not decreasing or is NaN? 35. What is the difference between torch.tensor() and torch.Tensor() (capital T) for creating tensors? 36. How does gradient accumulation work in PyTorch and when would you use it? 37. What is mixed precision training in PyTorch and how do you implement it with torch.cuda.amp? 38. What is torch.compile() and how does it speed up PyTorch model execution? 39. What is the difference between batch size, epoch, and iteration in PyTorch training? 40. How do you compute and track evaluation metrics like accuracy during PyTorch training? 41. What is the purpose of torch.manual_seed() and how do you ensure reproducibility in PyTorch? 42. How does PyTorch handle multi-dimensional indexing and slicing of tensors? 43. What is the difference between.view(),.reshape(), and.contiguous() in PyTorch, and why does it matter? 44. How do you freeze layers and perform transfer learning / fine-tuning in PyTorch? 45. What is the purpose of torch.utils.data.random_split() and how do you create train/validation/test splits in PyTorch? 46. What is Batch Normalization in PyTorch and how does it differ from Layer Normalization? 47. How do you implement and use a custom loss function in PyTorch? 48. What is torch.compile() vs TorchScript and how do you export a PyTorch model for production deployment?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is PyTorch and what are its key advantages over other deep learning frameworks?

PyTorch is an open-source deep learning framework developed by Meta AI (Facebook), released in 2016. It is built around two core ideas: tensor computation with GPU acceleration (similar to NumPy but on the GPU) and automatic differentiation via a dynamic computation graph (called define-by-run or eager execution).

PyTorch vs TensorFlow comparison
FeaturePyTorchTensorFlow 2.x
Graph styleDynamic (eager by default)Eager by default (was static in v1)
DebuggingNative Python debugger (pdb, print)More complex — graph abstractions
Research adoptionDominant in academiaStrong in production
DeploymentTorchScript, ONNX, TorchServeTensorFlow Serving, TFLite, TF.js
API feelPythonic, NumPy-likeMore verbose historically
CommunityFast-growing, most ML papersLarge, enterprise-focused
/div>

Key advantages of PyTorch:

  • Dynamic computation graph — the graph is built at runtime, making debugging with standard Python tools natural
  • Pythonic API — feels like writing NumPy code; easy to mix with standard Python control flow
  • Strong GPU support.cuda() / .to(device) moves tensors to GPU with one call
  • Rich ecosystem — torchvision, torchaudio, torchtext, HuggingFace Transformers, PyTorch Lightning
  • Production path — TorchScript, torch.compile, and ONNX export for deployment
What type of computation graph does PyTorch use by default?
Which organisation originally developed and open-sourced PyTorch?

2. What is a PyTorch tensor and how does it differ from a NumPy array?

A tensor is PyTorch's core data structure — an n-dimensional array similar to NumPy's ndarray, but with two critical extra capabilities: it can live on a GPU for accelerated computation, and it supports automatic differentiation (autograd) for computing gradients during backpropagation.

import torch
import numpy as np

# Creating tensors
t1 = torch.tensor([1.0, 2.0, 3.0])          # from Python list
t2 = torch.zeros(3, 4)                       # 3×4 zeros
t3 = torch.ones(2, 3)                        # 2×3 ones
t4 = torch.rand(2, 3)                        # uniform random [0,1)
t5 = torch.randn(2, 3)                       # standard normal
t6 = torch.arange(0, 10, 2)                  # [0, 2, 4, 6, 8]
t7 = torch.linspace(0, 1, 5)                 # 5 evenly spaced pts

# Shape, dtype, device
print(t2.shape)     # torch.Size([3, 4])
print(t1.dtype)     # torch.float32
print(t1.device)    # cpu

# NumPy ↔ PyTorch bridge (shares memory on CPU!)
np_array = np.array([1.0, 2.0, 3.0])
torch_from_np = torch.from_numpy(np_array)   # shares memory
np_from_torch = t1.numpy()                   # shares memory

np_array[0] = 99
print(torch_from_np[0])  # tensor(99.) — memory is shared!
Tensor vs NumPy ndarray
FeaturePyTorch TensorNumPy ndarray
GPU supportYes — .to('cuda')No
AutogradYes — requires_grad=TrueNo
Memory sharingYes (CPU tensors)Yes (via from_numpy)
Default dtypefloat32float64
BroadcastingYes (same rules)Yes
/div>
What two capabilities does a PyTorch tensor have that a NumPy array does not?
What happens to memory when you call torch.from_numpy(arr) on a NumPy array?

3. What are the most important tensor operations in PyTorch?

PyTorch provides a rich set of tensor operations covering arithmetic, shape manipulation, reduction, and linear algebra. Most have both a functional form (torch.add) and a method form (tensor.add), plus in-place variants with a trailing underscore (tensor.add_).

import torch

a = torch.tensor([[1.,2.,3.],[4.,5.,6.]])
b = torch.tensor([[7.,8.,9.],[10.,11.,12.]])

# ── Arithmetic
print(a + b)          # element-wise add
print(a * b)          # element-wise multiply (Hadamard)
print(torch.matmul(a, b.T))  # matrix multiply  (2×3) @ (3×2) → (2×2)
print(a @ b.T)        # same with @ operator

# ── Shape manipulation
print(a.shape)                    # torch.Size([2, 3])
print(a.reshape(3, 2))            # (3, 2) — new view if possible
print(a.view(6))                  # (6,)   — must be contiguous
print(a.unsqueeze(0).shape)       # (1, 2, 3) — add dim
print(a.squeeze(0).shape)         # removes dim of size 1
print(torch.cat([a, b], dim=0))   # (4, 3) — concatenate rows
print(torch.stack([a, b], dim=0)) # (2, 2, 3) — new dim
print(a.permute(1, 0))            # (3, 2) — transpose

# ── Reduction
print(a.sum())           # scalar sum
print(a.sum(dim=1))      # sum along rows → (2,)
print(a.mean(dim=0))     # mean along columns → (3,)
print(a.max(), a.min())
print(a.argmax())        # index of max (flattened)

# ── In-place (modifies tensor, avoids memory allocation)
a.add_(1)   # a += 1
a.mul_(2)   # a *= 2
# Warning: in-place ops on tensors requiring grad can cause issues!

Key distinction: reshape returns a view when possible (no copy) and falls back to a copy if the tensor is not contiguous. view always requires a contiguous tensor and always returns a view. Use contiguous().view() or just reshape() to be safe.

What is the difference between torch.cat and torch.stack?
What does the trailing underscore in PyTorch method names like tensor.add_() signify?

4. What are tensor data types (dtypes) in PyTorch and why do they matter?

Every tensor has a dtype that determines the numeric type and precision of its elements. Choosing the right dtype affects memory usage, computation speed, and numeric precision — a critical consideration when training on GPUs.

Common PyTorch dtypes
dtypeAliasBitsUse case
torch.float32torch.float32Default for model weights and activations
torch.float64torch.double64High-precision numerical work
torch.float16torch.half16Mixed-precision training (GPU)
torch.bfloat1616Modern GPUs (A100+); wider exponent than float16
torch.int64torch.long64Indices, class labels, sequence lengths
torch.int32torch.int32General integer computation
torch.bool8Masks, boolean indexing
torch.uint88Image pixel values (0–255)
/div>
import torch

# Creating tensors with specific dtypes
x = torch.tensor([1.0, 2.0], dtype=torch.float32)
y = torch.tensor([1, 2, 3], dtype=torch.long)     # class labels
m = torch.tensor([True, False, True], dtype=torch.bool)

# Casting between dtypes
print(x.dtype)             # torch.float32
x64 = x.double()           # → float64
x16 = x.half()             # → float16
xi  = x.to(torch.int32)   # → int32

# Default dtype (float32 for floats, int64 for ints)
print(torch.tensor([1.0]).dtype)   # torch.float32
print(torch.tensor([1]).dtype)     # torch.int64

# Change global default
torch.set_default_dtype(torch.float64)  # rarely needed

# Why dtype matters for loss computation:
# CrossEntropyLoss expects:
#   input:  float32  (logits)
#   target: int64    (class indices)
loss_fn = torch.nn.CrossEntropyLoss()
logits = torch.randn(4, 10)                 # float32
targets = torch.randint(0, 10, (4,))        # int64
loss = loss_fn(logits, targets)             # works!
# targets_wrong = targets.float()           # would error!

Most common dtype errors: passing float64 weights into a model expecting float32, or passing float targets to a loss function expecting long (e.g. CrossEntropyLoss).

What dtype should class label targets be for PyTorch's CrossEntropyLoss?
What is the advantage of torch.bfloat16 over torch.float16 for training on modern GPUs?

5. How does broadcasting work in PyTorch and what are the rules?

Broadcasting allows PyTorch to perform arithmetic between tensors of different shapes without explicit copying. PyTorch follows the same broadcasting rules as NumPy. Understanding broadcasting is essential to avoid subtle shape bugs.

import torch

# Rule: align shapes from the RIGHT, expand dims of size 1
a = torch.ones(3, 4)     # shape (3, 4)
b = torch.ones(4)        # shape    (4) → treated as (1, 4) → broadcast to (3, 4)
c = a + b                # works! c.shape = (3, 4)

# Adding a bias vector to a batch of activations
batch = torch.randn(32, 128)   # (batch=32, features=128)
bias  = torch.randn(128)       # (128,) broadcasts across the batch dim
out   = batch + bias           # (32, 128) ✓

# Adding column and row vectors → 2D result
col = torch.arange(3).reshape(3, 1)  # (3, 1)
row = torch.arange(4).reshape(1, 4)  # (1, 4)
grid = col + row                      # (3, 4) — outer-sum
print(grid)
# tensor([[0, 1, 2, 3],
#         [1, 2, 3, 4],
#         [2, 3, 4, 5]])

# Common broadcasting errors:
# a = torch.ones(3, 4)
# b = torch.ones(3)        # (3,) aligns to (1, 3) NOT (3, 1)
# a + b  → ERROR: size 4 != size 3 in dimension 1
# Fix: b.reshape(3, 1) to make it (3, 1)
Broadcasting rules (step by step)
StepRule
1. Align rightPad missing leading dimensions with 1
2. Check compatibilityEach dim must be equal, or one of them must be 1
3. Expand size-1 dimsDimension of size 1 is stretched to match the other tensor
4. Error if incompatibleRaises RuntimeError if no dim is 1 and sizes differ
/div>
Tensors of shape (3,1) and (1,4) are added together. What is the output shape?
Tensors of shape (32,128) and (128,) are added. Why does this work?

6. What is autograd in PyTorch and how does it compute gradients?

PyTorch's autograd engine implements automatic differentiation. When you perform operations on tensors with requires_grad=True, PyTorch records every operation in a dynamic computation graph. Calling .backward() on a scalar loss traverses this graph in reverse using the chain rule, accumulating gradients in each tensor's .grad attribute.

import torch

# requires_grad=True tells PyTorch to track this tensor
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Forward pass — operations are recorded
y = x ** 2           # y = [4.0, 9.0]
z = y.sum()          # z = 13.0  (scalar)

# Backward pass — computes dz/dx using chain rule
z.backward()
print(x.grad)        # tensor([4., 6.])  dz/dx = 2x

# Verify: dz/d(x[0]) = d(x[0]^2)/d(x[0]) = 2*x[0] = 4 ✓

# Gradients ACCUMULATE — always zero before next backward!
x.grad.zero_()   # or optimizer.zero_grad()

# Non-leaf tensors (created by ops) have grad_fn
a = torch.tensor(3.0, requires_grad=True)
b = a * 2
print(b.grad_fn)       # <MulBackward0 object>
print(b.requires_grad) # True — inherited from a

# Detach: stop tracking a tensor
c = b.detach()         # c shares data with b but no grad history
print(c.requires_grad) # False

# torch.no_grad(): context manager to disable gradient tracking
with torch.no_grad():
    inference = a * 2   # faster, no graph built
    print(inference.requires_grad)  # False
Key autograd concepts
ConceptWhat it is
requires_grad=TrueTells autograd to track operations on this tensor
.gradAccumulated gradient after .backward() — lives on leaf tensors
grad_fnReference to the function that created a non-leaf tensor
.backward()Traverses graph backwards, fills .grad via chain rule
.detach()Returns tensor with same data but no gradient history
torch.no_grad()Context: disables gradient tracking (inference, validation)
/div>
What method do you call to trigger gradient computation in PyTorch?
Why must you call optimizer.zero_grad() (or tensor.grad.zero_()) before each backward pass?

7. What is the computation graph in PyTorch and how does the dynamic graph differ from a static graph?

PyTorch builds a dynamic computation graph (also called eager execution or define-by-run). Every time you run the forward pass, a new graph is constructed on-the-fly based on the actual Python code paths executed. This is in contrast to TensorFlow 1.x's static graph, which is compiled once and then executed repeatedly.

import torch

# Dynamic graph: Python control flow works naturally
def dynamic_model(x, use_relu=True):
    h = x @ torch.randn(4, 4)
    if use_relu:           # real Python if — changes the graph!
        h = torch.relu(h)
    else:
        h = torch.tanh(h)
    return h.sum()

x = torch.randn(2, 4, requires_grad=True)

# Each call may build a DIFFERENT graph depending on use_relu
loss1 = dynamic_model(x, use_relu=True)
loss1.backward()   # graph includes ReLU nodes

x.grad.zero_()
loss2 = dynamic_model(x, use_relu=False)
loss2.backward()   # graph includes Tanh nodes

# The graph is discarded after backward() by default
# retain_graph=True keeps it for multiple backward calls
y = (x ** 2).sum()
y.backward(retain_graph=True)   # graph kept
y.backward()                    # can call again

# Inspecting the graph
z = x ** 3
print(z.grad_fn)                # <PowBackward0>
print(z.grad_fn.next_functions) # upstream functions
Dynamic vs Static computation graph
AspectDynamic (PyTorch eager)Static (TF1 / torch.compile)
When builtAt runtime, every forward passOnce, then reused
Python control flowWorks natively (if/for/while)Must use special graph ops
DebuggingUse pdb, print anywhereHarder — graph is opaque
PerformanceSlight overhead from graph constructionFaster after compilation
FlexibilityHigh — easy to change architecturesLow — recompile to change
/div>
What happens to the computation graph by default after calling loss.backward()?
What is the main debugging advantage of PyTorch's dynamic computation graph over a static graph?

8. How do torch.no_grad() and tensor.detach() differ, and when do you use each?

Both torch.no_grad() and .detach() stop gradient tracking, but they work at different levels and serve different purposes.

import torch

model_param = torch.tensor(2.0, requires_grad=True)

# ── torch.no_grad(): context manager — disables ALL grad tracking
# Use for inference and validation loops
with torch.no_grad():
    out = model_param * 3       # no graph built
    print(out.requires_grad)    # False
    print(out.grad_fn)          # None
# Faster + less memory — standard pattern for eval

# ── .detach(): detaches a SPECIFIC tensor from the graph
# The tensor still knows about grad, but is cut off from history
a = model_param * 4
print(a.requires_grad)          # True  (still tracking)
b = a.detach()                  # b shares data with a
print(b.requires_grad)          # False (disconnected)
print(b.data_ptr() == a.data_ptr())  # True — SAME memory!

# Common use case: compute a "stop gradient" target
# in actor-critic / target networks
target = a.detach()             # stop gradient through target
loss = (a - target) ** 2       # gradient only flows through a, not target

# ── @torch.no_grad() decorator variant
@torch.no_grad()
def predict(x):
    return model_param * x      # no grad even without with block

# Validation loop pattern
def validate(model, loader):
    model.eval()                # turns off dropout, batchnorm train mode
    with torch.no_grad():       # no gradient computation
        for x, y in loader:
            pred = model(x)
            # compute metrics...
no_grad vs detach comparison
Featuretorch.no_grad()tensor.detach()
ScopeAll ops within the context blockOne specific tensor
Memory savedYes — no graph builtPartial — graph still exists upstream
Typical useInference, validation loopsTarget networks, stop-gradient
Output requires_gradFalseFalse
/div>
When should you use torch.no_grad() during model training?
What is the key difference between tensor.detach() and torch.no_grad()?

9. What is nn.Module and how do you build a custom neural network in PyTorch?

nn.Module is the base class for all neural network components in PyTorch. Subclassing it gives you parameter management, device placement, train/eval mode toggling, state dict serialisation, and hooks — all for free.

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_features: int, hidden: int, out_features: int):
        super().__init__()   # MUST call this first!

        # Layers defined as attributes are auto-registered as sub-modules
        self.fc1  = nn.Linear(in_features, hidden)
        self.relu = nn.ReLU()
        self.drop = nn.Dropout(p=0.3)
        self.fc2  = nn.Linear(hidden, out_features)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Define the forward computation."""
        x = self.fc1(x)
        x = self.relu(x)
        x = self.drop(x)
        x = self.fc2(x)
        return x

# Instantiate and inspect
model = MLP(in_features=784, hidden=256, out_features=10)

# Forward pass — calls forward() via __call__
x = torch.randn(32, 784)   # batch of 32
out = model(x)              # shape (32, 10)

# Parameter inspection
for name, param in model.named_parameters():
    print(name, param.shape, param.requires_grad)
# fc1.weight  torch.Size([256, 784])  True
# fc1.bias    torch.Size([256])       True
# fc2.weight  torch.Size([10, 256])   True
# fc2.bias    torch.Size([10])        True

total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}")

Critical rules:

  • Always call super().__init__() in __init__
  • Define layers as attributes (not local variables) so PyTorch registers them
  • Implement the forward() method — never call it directly; use model(x) which invokes hooks
  • Use model(x) not model.forward(x) so pre/post-forward hooks fire
Why must you call super().__init__() at the start of an nn.Module subclass's __init__?
What is the difference between calling model.forward(x) and model(x)?

10. What are nn.Sequential and other container modules in PyTorch?

PyTorch provides several container modules that compose layers without requiring a custom nn.Module subclass. They are convenient for simple feedforward architectures but less flexible than full subclassing.

import torch
import torch.nn as nn

# ── nn.Sequential: layers applied in order
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)
out = model(torch.randn(32, 784))  # (32, 10)

# Named layers in Sequential (for easier access)
model_named = nn.Sequential(
    ("fc1",  nn.Linear(784, 256)),
    ("relu", nn.ReLU()),
    ("fc2",  nn.Linear(256, 10)),
)
print(model_named.fc1.weight.shape)   # torch.Size([256, 784])

# ── nn.ModuleList: list of modules (for dynamic use)
class ResNet(nn.Module):
    def __init__(self, n_blocks: int):
        super().__init__()
        # ModuleList properly registers all contained modules
        self.blocks = nn.ModuleList([
            nn.Linear(64, 64) for _ in range(n_blocks)
        ])
    def forward(self, x):
        for block in self.blocks:
            x = torch.relu(block(x)) + x  # residual
        return x

# ── nn.ModuleDict: dict of modules (for conditional routing)
class MultiHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.heads = nn.ModuleDict({
            "sentiment": nn.Linear(128, 2),
            "topic":     nn.Linear(128, 10),
        })
    def forward(self, x, task: str):
        return self.heads[task](x)
Container modules
ContainerWhen to use
nn.SequentialSimple feedforward chains; no branching
nn.ModuleListDynamic or variable-length list of modules in a loop
nn.ModuleDictNamed modules selected conditionally (e.g. multi-task)
nn.ParameterListList of nn.Parameter objects (rare)
nn.ParameterDictDict of nn.Parameter objects (rare)
/div>
Why use nn.ModuleList instead of a plain Python list to store layers?
When is nn.Sequential NOT a suitable choice for building a model?

11. What built-in layers does PyTorch's nn module provide and how do you use the most common ones?

PyTorch's torch.nn module contains all the standard neural network building blocks. Understanding what each layer does mathematically helps you choose the right component and configure it correctly.

Most common nn layers
LayerFormula / purposeKey parameters
nn.Lineary = xW^T + b — fully connectedin_features, out_features, bias=True
nn.Conv2d2D cross-correlation — feature extractionin_channels, out_channels, kernel_size, stride, padding
nn.BatchNorm1d/2dNormalise over batch; learnable γ, βnum_features, eps, momentum
nn.DropoutZero random neurons with prob p during trainp (dropout probability)
nn.EmbeddingLearnable lookup table for integer tokensnum_embeddings, embedding_dim
nn.LSTMLong Short-Term Memory recurrent layerinput_size, hidden_size, num_layers
nn.MultiheadAttentionScaled dot-product attentionembed_dim, num_heads
nn.LayerNormNormalise over feature dims per samplenormalized_shape
/div>
import torch, torch.nn as nn

# nn.Linear
fc = nn.Linear(128, 64)        # (batch, 128) → (batch, 64)
print(fc.weight.shape)          # (64, 128)  — transposed internally
print(fc.bias.shape)            # (64,)

# nn.Conv2d
conv = nn.Conv2d(
    in_channels=3,
    out_channels=32,
    kernel_size=3,
    stride=1,
    padding=1,           # "same" padding preserves H, W
)
x_img = torch.randn(8, 3, 32, 32)   # (batch, C, H, W)
print(conv(x_img).shape)             # (8, 32, 32, 32)

# nn.BatchNorm2d
bn = nn.BatchNorm2d(32)       # num_features = channels
# In train mode: normalises over (N, H, W) per channel
# In eval mode:  uses running mean/var from training

# nn.Embedding
emb = nn.Embedding(num_embeddings=10000, embedding_dim=128)
tokens = torch.tensor([5, 23, 100])  # integer token ids
print(emb(tokens).shape)              # (3, 128)

# nn.Dropout — active only in train mode
drop = nn.Dropout(p=0.5)
x = torch.ones(4, 8)
print(drop(x))   # ~half zeros (train), all ones after model.eval()
What input shape does nn.Conv2d expect in PyTorch?
When does nn.Dropout zero out neurons?

12. What are activation functions in PyTorch and how do you apply them?

Activation functions introduce non-linearity into neural networks, enabling them to learn complex mappings. PyTorch provides them both as nn.Module classes (usable as layers) and as functional forms in torch.nn.functional.

Common activation functions
FunctionFormulaTypical use
ReLUmax(0, x)Default for hidden layers (fast, avoids vanishing grad)
LeakyReLUmax(αx, x), α≈0.01When dying ReLU is a problem
Sigmoid1/(1+e^−x) → (0,1)Binary classification output
Tanh(e^x−e^−x)/(e^x+e^−x) → (−1,1)RNNs, zero-centred alternative to sigmoid
Softmaxe^xᵢ/Σe^xⱼ → sums to 1Multi-class output (use with NLLLoss)
GELUx·Φ(x) smoothTransformers (BERT, GPT)
SiLU/Swishx·sigmoid(x)Modern architectures (EfficientNet)
/div>
import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2., -1., 0., 1., 2.])

# ── As nn.Module (use inside nn.Sequential or __init__)
relu = nn.ReLU()
print(relu(x))          # [0, 0, 0, 1, 2]

sigmoid = nn.Sigmoid()
print(sigmoid(x))       # [0.12, 0.27, 0.50, 0.73, 0.88]

# ── As functional (use inside forward())
print(F.relu(x))        # same as nn.ReLU()(x)
print(F.gelu(x))        # smooth approximation

# Softmax: dim must be specified!
logits = torch.randn(4, 10)   # (batch=4, classes=10)
probs = F.softmax(logits, dim=1)   # dim=1 (classes)
print(probs.sum(dim=1))            # tensor([1., 1., 1., 1.])

# !! Never apply Softmax before CrossEntropyLoss !!
# CrossEntropyLoss = LogSoftmax + NLLLoss internally
# Applying softmax first → double-softmax = wrong!
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, torch.randint(0, 10, (4,)))  # pass raw logits!
Why should you NOT apply nn.Softmax before nn.CrossEntropyLoss?
What is the key advantage of ReLU over sigmoid/tanh as an activation function in deep networks?

13. What are the most important loss functions in PyTorch and when do you use each?

Choosing the right loss function is critical — it defines what the model is optimising for. PyTorch provides loss functions in torch.nn as modules and in torch.nn.functional as functions.

Common PyTorch loss functions
LossUse caseInput / Target
nn.MSELossRegression — minimise squared errorpred: float, target: float
nn.MAELoss / L1LossRegression — robust to outlierspred: float, target: float
nn.CrossEntropyLossMulti-class classificationpred: (N,C) logits, target: (N,) long
nn.BCEWithLogitsLossBinary classification (numerically stable)pred: (N,) logits, target: (N,) float 0/1
nn.NLLLossUsed with log-softmax outputpred: (N,C) log-probs, target: (N,) long
nn.KLDivLossDistribution divergence (VAE, distillation)pred: log-probs, target: probs
nn.HuberLossRegression robust to outlierspred: float, target: float
/div>
import torch, torch.nn as nn

batch = 8

# ── Regression
pred = torch.randn(batch, 1)
target = torch.randn(batch, 1)
mse  = nn.MSELoss()(pred, target)
mae  = nn.L1Loss()(pred, target)
print(mse, mae)

# ── Multi-class classification
logits  = torch.randn(batch, 10)        # raw scores, NOT softmax
labels  = torch.randint(0, 10, (batch,)) # class indices, dtype=long
ce_loss = nn.CrossEntropyLoss()(logits, labels)
print(ce_loss)

# Class-weighted cross entropy (handle imbalance)
weights = torch.tensor([1.0]*9 + [5.0])  # upweight class 9
ce_weighted = nn.CrossEntropyLoss(weight=weights)(logits, labels)

# ── Binary classification (single output neuron)
bin_logits = torch.randn(batch)           # single score
bin_labels = torch.randint(0, 2, (batch,)).float()  # 0 or 1, float!
bce_loss   = nn.BCEWithLogitsLoss()(bin_logits, bin_labels)
# BCEWithLogitsLoss = sigmoid + BCE in one numerically stable op

# ── Label smoothing (reduces overconfidence)
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)(logits, labels)

# reduction parameter
nn.MSELoss(reduction="mean")    # default: mean over batch
nn.MSELoss(reduction="sum")     # sum over batch
nn.MSELoss(reduction="none")    # per-sample loss (no reduction)
What input format does nn.CrossEntropyLoss expect?
Why use nn.BCEWithLogitsLoss instead of applying sigmoid first and then nn.BCELoss?

14. What optimizers does PyTorch provide and how do you configure them?

Optimizers update model parameters based on computed gradients. PyTorch provides all major optimizers in torch.optim. Choosing and configuring the right optimizer significantly affects training speed and final performance.

PyTorch optimizers
OptimizerKey featureTypical use
SGDSimple, supports momentum and weight decayComputer vision with lr scheduling
AdamAdaptive lr per param; momentum + RMSPropDefault for NLP, general purpose
AdamWAdam with decoupled weight decayTransformers, fine-tuning (recommended over Adam)
RMSpropAdaptive lr without momentumRNNs
AdagradAccumulates squared gradients; rare todaySparse features
LBFGSSecond-order quasi-Newton; very slowSmall networks, physics-informed NNs
/div>
import torch, torch.nn as nn, torch.optim as optim

model = nn.Linear(128, 10)

# ── SGD with momentum and weight decay
opt_sgd = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,       # Nesterov-style acceleration
    weight_decay=1e-4,  # L2 regularisation
    nesterov=True,
)

# ── Adam
opt_adam = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999), # (β1, β2) — momentum terms
    eps=1e-8,
    weight_decay=0,     # Adam + L2 is suboptimal — use AdamW!
)

# ── AdamW (preferred for transformers)
opt_adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01,  # decoupled from gradient update
)

# ── Per-layer learning rates
opt_layerwise = optim.Adam([
    {"params": model.weight, "lr": 1e-4},  # slower for weight
    {"params": model.bias,   "lr": 1e-3},  # faster for bias
])

# ── Standard training step
opt = optim.AdamW(model.parameters(), lr=1e-3)
for x, y in [(torch.randn(32,128), torch.randint(0,10,(32,)))]:
    opt.zero_grad()               # 1. clear old gradients
    loss = nn.CrossEntropyLoss()(model(x), y)  # 2. forward
    loss.backward()              # 3. backward
    opt.step()                   # 4. update parameters
What is the key difference between Adam and AdamW?
What is the correct order of operations in a PyTorch training step?

15. What are learning rate schedulers in PyTorch and how do you use them?

A learning rate scheduler adjusts the learning rate during training — typically starting high for fast initial progress and decaying for fine-grained convergence. Schedulers wrap an optimizer and must be stepped after each epoch (or each batch for some schedulers).

Common LR schedulers
SchedulerBehaviourStep
StepLRMultiply lr by gamma every step_size epochsPer epoch
MultiStepLRDecay at specified milestone epochsPer epoch
ExponentialLRlr *= gamma every epochPer epoch
CosineAnnealingLRCosine decay from lr to eta_minPer epoch
OneCycleLRWarmup then cosine decay (superconvergence)Per batch
ReduceLROnPlateauReduce lr when metric stops improvingPer epoch (with metric
CosineAnnealingWarmRestartsCosine with periodic restartsPer epoch
/div>
import torch, torch.optim as optim
import torch.nn as nn

model = nn.Linear(128, 10)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# ── StepLR: multiply lr by 0.1 every 30 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# ── CosineAnnealingLR: smooth cosine decay
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# ── OneCycleLR: requires total_steps at init
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,
    total_steps=100 * len([1]*1000),  # epochs * batches_per_epoch
)

# ── ReduceLROnPlateau: triggered by validation loss
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5, verbose=True
)

# ── Training loop integration
for epoch in range(100):
    train_loss = 0.0  # ... train ...
    val_loss   = 0.0  # ... validate ...

    # Most schedulers: step after epoch
    scheduler.step()              # for StepLR, CosineAnnealingLR etc.
    # scheduler.step(val_loss)    # for ReduceLROnPlateau (needs metric)

    # Check current lr
    current_lr = optimizer.param_groups[0]["lr"]
    print(f"Epoch {epoch}: lr={current_lr:.6f}")
When should you call scheduler.step() for most epoch-based learning rate schedulers?
Which PyTorch scheduler is most suitable when you want to reduce learning rate only if validation loss stops improving?

16. What are the most common built-in layers in torch.nn and what do they do?

PyTorch's torch.nn module provides all the standard building blocks for neural networks. Understanding what each layer does mathematically and when to use it is fundamental to building effective models.

Common nn layers
LayerFormula / behaviourTypical use
nn.Linear(in, out)y = xW^T + bFully connected / dense layer
nn.Conv2d(in, out, k)2D convolution with kernel k×kImage feature extraction
nn.BatchNorm1d/2dNormalise per feature/channel over batchAfter linear/conv, before activation
nn.LayerNormNormalise over feature dim per sampleTransformers, NLP
nn.Dropout(p)Zeros random fraction p during trainRegularisation
nn.Embedding(V,d)Lookup table V vocab × d dimWord/token embeddings
nn.ReLU/GELU/TanhElement-wise activationsAfter linear/conv layers
nn.Softmax(dim)exp(x)/Σexp(x) along dimOutput probabilities (use LogSoftmax+NLLLoss or CrossEntropyLoss directly)
nn.MaxPool2dTakes max over kernel windowSpatial downsampling in CNNs
nn.LSTM/GRUGated recurrent cellsSequence modelling
/div>
import torch, torch.nn as nn

# Linear layer internals
fc = nn.Linear(4, 8)
print(fc.weight.shape)   # (8, 4) — note: output × input
print(fc.bias.shape)     # (8,)

# Embedding
emb = nn.Embedding(num_embeddings=10000, embedding_dim=128,
                   padding_idx=0)   # index 0 gets a zero vector
tokens = torch.tensor([1, 42, 7])   # shape (3,)
out = emb(tokens)                   # shape (3, 128)

# BatchNorm vs LayerNorm
bn  = nn.BatchNorm1d(64)   # input (N, 64) — normalises across N
ln  = nn.LayerNorm(64)     # input (N, 64) — normalises across 64 features

x = torch.randn(16, 64)
print(bn(x).shape)   # (16, 64)
print(ln(x).shape)   # (16, 64)

# Dropout only active during training
drop = nn.Dropout(p=0.5)
model = nn.Sequential(nn.Linear(32,32), drop, nn.ReLU())
model.train();  x_tr = model(torch.randn(4,32))  # 50% zeros
model.eval();   x_ev = model(torch.randn(4,32))  # all active
What are the weight dimensions of nn.Linear(in_features=4, out_features=8)?
What is the key difference between BatchNorm and LayerNorm?

17. How do you initialise weights in a PyTorch model?

PyTorch uses sensible default initialisations (Kaiming uniform for Linear and Conv layers), but custom initialisation is often needed to match a paper or improve convergence. The torch.nn.init module provides all standard schemes.

import torch, torch.nn as nn

# Default initialisation:
# nn.Linear  → Kaiming uniform (He init) for weight, uniform for bias
# nn.Conv2d  → Kaiming uniform
# nn.Embedding → Normal(0, 1)

# Custom initialisation using apply()
def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.xavier_uniform_(module.weight)   # Xavier/Glorot
        nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight,
                                mode="fan_out",
                                nonlinearity="relu")  # He init
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        nn.init.normal_(module.weight, mean=0.0, std=0.02)  # GPT-style

model = nn.Sequential(
    nn.Linear(128, 64), nn.ReLU(),
    nn.Linear(64,  10)
)
model.apply(init_weights)   # recursively applies to all sub-modules

# Direct initialisation with torch.no_grad()
with torch.no_grad():
    model[0].weight.fill_(0.01)
    model[0].bias.zero_()
Common init schemes
SchemeAPIBest for
Xavier / Glorot uniformnn.init.xavier_uniform_()Sigmoid / Tanh activations
Xavier / Glorot normalnn.init.xavier_normal_()Sigmoid / Tanh activations
Kaiming / He uniformnn.init.kaiming_uniform_()ReLU (PyTorch default)
Kaiming / He normalnn.init.kaiming_normal_()ReLU (often better than uniform)
Normalnn.init.normal_(mean, std)Embeddings (std=0.02 GPT-style)
Zeros / Onesnn.init.zeros_() / ones_()Biases, gates
Orthogonalnn.init.orthogonal_()RNNs
/div>
What initialisation scheme does PyTorch use by default for nn.Linear weights?
Why is Kaiming (He) initialisation preferred over Xavier when using ReLU activations?

18. What loss functions does PyTorch provide and when do you use each?

Loss functions (criteria) measure the difference between predictions and targets. PyTorch provides them in torch.nn. Choosing the right one for your task is critical — using the wrong loss gives poor training signal even if the architecture is correct.

Common PyTorch loss functions
LossClassTaskTarget dtype
Cross-entropynn.CrossEntropyLossMulti-class classificationLong (class indices)
Binary cross-entropy + logitsnn.BCEWithLogitsLossBinary / multi-labelFloat
MSEnn.MSELossRegressionFloat
MAE / L1nn.L1LossRobust regressionFloat
Huber / Smooth L1nn.HuberLoss / nn.SmoothL1LossRobust regressionFloat
NLL Lossnn.NLLLossAfter log-softmaxLong
KL Divergencenn.KLDivLossDistribution matchingFloat
Triplet Marginnn.TripletMarginLossMetric learningFloat
/div>
import torch, torch.nn as nn

# Multi-class classification: CrossEntropyLoss
# Input: (N, C) logits — raw, before softmax
# Target: (N,) class indices — dtype=long
ce = nn.CrossEntropyLoss()
logits  = torch.randn(4, 10)              # 4 samples, 10 classes
targets = torch.tensor([2, 5, 0, 9])      # true class indices
loss = ce(logits, targets)

# Binary classification: BCEWithLogitsLoss
# Numerically stable (fuses sigmoid + BCE)
bce = nn.BCEWithLogitsLoss()
preds = torch.randn(4)       # logits, NOT sigmoid output
true  = torch.tensor([1.,0.,1.,0.])
loss_b = bce(preds, true)

# Class weighting for imbalanced datasets
weights = torch.tensor([1.0]*9 + [10.0])   # class 9 is rare
ce_w = nn.CrossEntropyLoss(weight=weights)

# Label smoothing (reduces overconfidence)
ce_ls = nn.CrossEntropyLoss(label_smoothing=0.1)

# Regression: MSE vs Huber
mseLoss   = nn.MSELoss()
huberLoss = nn.HuberLoss(delta=1.0)   # L2 near 0, L1 for large errors
pred_r = torch.randn(4)
true_r = torch.randn(4)
print(mseLoss(pred_r, true_r))
print(huberLoss(pred_r, true_r))

Critical gotcha: nn.CrossEntropyLoss expects raw logits (before softmax), not probabilities. It internally applies log-softmax, so applying softmax first leads to double-softmax and incorrect training.

What dtype must the target tensor be for nn.CrossEntropyLoss?
Why is nn.BCEWithLogitsLoss preferred over applying sigmoid then nn.BCELoss?

19. What optimizers does PyTorch provide and how do you choose between them?

An optimizer updates model parameters based on computed gradients. PyTorch provides all major optimizers in torch.optim. Choosing the right optimizer and tuning its hyperparameters has a large impact on training speed and final performance.

Common PyTorch optimizers
OptimizerClassKey parametersBest for
SGDoptim.SGDlr, momentum, weight_decay, nesterovImage classification (with momentum); can generalise better than Adam
SGD + Momentumoptim.SGD(momentum=0.9)momentum=0.9 standardMost vision tasks
Adamoptim.Adamlr=1e-3, betas=(0.9,0.999), eps=1e-8Default choice; fast convergence
AdamWoptim.AdamWlr, weight_decay (decoupled)Fine-tuning transformers; correct L2
RMSpropoptim.RMSproplr, alpha=0.99RNNs
Adagradoptim.AdagradlrSparse features, NLP
/div>
import torch, torch.nn as nn, torch.optim as optim

model = nn.Linear(10, 1)

# SGD with momentum (common for vision)
sgd = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    weight_decay=1e-4,   # L2 regularisation
    nesterov=True,
)

# Adam (default for most tasks)
adam = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0,      # NOTE: weight decay in Adam is coupled (bug!)
)

# AdamW — decoupled weight decay (correct implementation)
adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01,   # decoupled from gradient update
)

# Per-layer learning rates (useful for fine-tuning)
optimizer = optim.AdamW([
    {"params": model.weight, "lr": 1e-4},   # lower lr for pretrained
    {"params": model.bias,   "lr": 1e-3},   # higher lr for new head
], weight_decay=0.01)

# Standard training step
optimizer.zero_grad()
loss = nn.MSELoss()(model(torch.randn(8,10)), torch.randn(8,1))
loss.backward()
optimizer.step()

Adam vs AdamW: In standard Adam, adding weight_decay couples the regularisation with the adaptive learning rate, weakening its effect. AdamW fixes this by applying weight decay directly to the parameters, separate from the gradient update — this is the correct L2 regularisation and is now the standard for transformer fine-tuning.

What is the key difference between Adam and AdamW?
When might SGD with momentum outperform Adam for a vision model?

20. What are learning rate schedulers in PyTorch and how do you use them?

A learning rate (LR) scheduler adjusts the learning rate during training. Starting with a high LR enables fast early progress; decaying it later allows finer convergence. PyTorch provides many schedulers in torch.optim.lr_scheduler.

Common LR schedulers
SchedulerBehaviourUse case
StepLRMultiply lr by gamma every step_size epochsSimple decay; quick experiments
MultiStepLRDecay at specific epoch milestonesResNet training schedules
CosineAnnealingLRCosine curve from lr to eta_minMost modern training runs
OneCycleLRWarmup to max_lr then cosine decaySuper-convergence; fast training
ReduceLROnPlateauReduce lr when metric stops improvingWhen training time is unknown
LinearLRLinear warm-upTransformer fine-tuning
CosineAnnealingWarmRestartsCosine + periodic restarts (SGDR)Ensemble-style training
/div>
import torch, torch.nn as nn, torch.optim as optim
from torch.optim import lr_scheduler

model = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# CosineAnnealingLR — most popular modern choice
scheduler_cos = lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# OneCycleLR — great for fast training
scheduler_1c = lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    steps_per_epoch=100,   # batches per epoch
    epochs=10,
)

# ReduceLROnPlateau — metric-driven
scheduler_plat = lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5, verbose=True
)

# Standard usage in training loop
for epoch in range(100):
    train_one_epoch(model, optimizer)   # forward + backward + step

    # --- Epoch-based schedulers ---
    scheduler_cos.step()    # call AFTER optimizer.step()

    # --- Metric-based scheduler ---
    val_loss = validate(model)
    scheduler_plat.step(val_loss)

    # --- OneCycleLR is per-batch ---
    # for batch in dataloader:
    #     optimizer.step()
    #     scheduler_1c.step()

    print(f"lr: {optimizer.param_groups[0]['lr']:.6f}")

Key rule: call scheduler.step() after optimizer.step(). For OneCycleLR and other per-batch schedulers, call scheduler.step() inside the batch loop, not the epoch loop.

When should scheduler.step() be called relative to optimizer.step()?
Which scheduler is well-suited when you don't know how many epochs you'll train for?

21. What activation functions are commonly used in PyTorch and how do you choose between them?

Activation functions introduce non-linearity, allowing networks to model complex functions. PyTorch provides them as both nn.Module classes (for use in nn.Sequential) and functional calls in torch.nn.functional.

Common PyTorch activations
Activationnn classRangeTypical use
ReLUnn.ReLU()[0, ∞)Default for hidden layers — fast, avoids vanishing gradient for x>0
LeakyReLUnn.LeakyReLU(0.01)(-∞, ∞)Fixes ReLU's dying neuron problem
Sigmoidnn.Sigmoid()(0, 1)Binary classification output layer
Tanhnn.Tanh()(-1, 1)RNN hidden states (zero-centred)
Softmaxnn.Softmax(dim=-1)(0,1), sums to 1Multi-class output (use with NLLLoss, not CrossEntropyLoss)
GELUnn.GELU()(-∞, ∞)Transformers (BERT, GPT)
/div>
import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0])

# Module form — for use inside nn.Sequential / __init__
relu = nn.ReLU()
print(relu(x))   # tensor([0.0, 0.0, 0.0, 0.5, 2.0])

# Functional form — for use directly inside forward()
print(F.relu(x))
print(F.leaky_relu(x, negative_slope=0.01))
print(F.gelu(x))

# Using inside a model
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))   # functional — common in forward()
        return torch.sigmoid(self.fc2(x))  # binary output

# IMPORTANT: never apply softmax before CrossEntropyLoss
# CrossEntropyLoss = LogSoftmax + NLLLoss internally
logits = torch.randn(4, 10)             # raw scores, NOT softmaxed
loss_fn = nn.CrossEntropyLoss()
targets = torch.randint(0, 10, (4,))
loss = loss_fn(logits, targets)         # correct — pass raw logits!

Common mistake: applying Softmax before CrossEntropyLoss — the loss function already applies LogSoftmax internally, so double-softmaxing produces incorrect gradients and degraded training.

Why should you NOT apply nn.Softmax before passing logits to nn.CrossEntropyLoss?
Which activation function commonly used in transformer models like BERT and GPT is smoother than ReLU?

22. What loss functions does PyTorch provide and how do you choose the right one?

The loss function defines the training objective. PyTorch's torch.nn module provides loss classes for classification, regression, and more specialised tasks. Choosing the wrong loss for your task is one of the most common beginner mistakes.

Common PyTorch loss functions
LossClassInput shapeUse case
MSELossnn.MSELoss()pred & target same shapeRegression
L1Lossnn.L1Loss()pred & target same shapeRegression, robust to outliers
CrossEntropyLossnn.CrossEntropyLoss()logits (N,C), target (N,) int64Multi-class classification
BCELossnn.BCELoss()probabilities (N,), target (N,) floatBinary classification (after sigmoid)
BCEWithLogitsLossnn.BCEWithLogitsLoss()raw logits (N,), target (N,) floatBinary classification (numerically stable)
NLLLossnn.NLLLoss()log-probabilities (N,C)Used after LogSoftmax manually
/div>
import torch
import torch.nn as nn

# ── Regression: MSE
mse = nn.MSELoss()
pred = torch.tensor([2.5, 3.0, 4.1])
target = torch.tensor([3.0, 3.0, 4.0])
loss = mse(pred, target)   # mean((pred-target)^2)

# ── Multi-class classification: CrossEntropyLoss
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 5)         # batch=8, 5 classes — RAW logits
targets = torch.randint(0, 5, (8,))  # class indices, dtype long
loss = ce(logits, targets)

# ── Binary classification: BCEWithLogitsLoss (preferred over BCELoss)
bce = nn.BCEWithLogitsLoss()       # combines Sigmoid + BCE, numerically stable
logits_binary = torch.randn(8, 1)
targets_binary = torch.randint(0, 2, (8, 1)).float()
loss = bce(logits_binary, targets_binary)

# ── Class-weighted CrossEntropy for imbalanced data
class_weights = torch.tensor([1.0, 1.0, 5.0, 1.0, 1.0])  # upweight class 2
ce_weighted = nn.CrossEntropyLoss(weight=class_weights)

# ── Custom loss function
class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0):
        super().__init__()
        self.gamma = gamma
        self.ce = nn.CrossEntropyLoss(reduction="none")

    def forward(self, logits, targets):
        ce_loss = self.ce(logits, targets)
        pt = torch.exp(-ce_loss)
        focal = ((1 - pt) ** self.gamma * ce_loss).mean()
        return focal
Why is nn.BCEWithLogitsLoss preferred over manually applying nn.Sigmoid() followed by nn.BCELoss()?
What dtype and shape does CrossEntropyLoss expect for the target tensor in a 10-class classification problem with batch size 16?

23. What optimizers does PyTorch provide and what is the difference between SGD, Adam, and AdamW?

Optimizers update model parameters based on computed gradients. PyTorch's torch.optim module provides many algorithms; understanding their differences helps you choose the right one and tune hyperparameters effectively.

Common PyTorch optimizers
OptimizerKey ideaTypical lrBest for
SGDPlain gradient descent, optional momentum0.01–0.1Image classification (with momentum + schedule)
SGD + momentumAccumulates velocity to smooth updates0.01–0.1Often best final generalisation
AdamAdaptive per-parameter learning rates + momentum1e-3Fast convergence, good default
AdamWAdam with decoupled weight decay1e-3 to 5e-5Fine-tuning transformers, modern default
RMSpropAdaptive lr based on recent gradient magnitude1e-3RNNs (historically popular)
/div>
import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1)

# ── SGD with momentum
opt_sgd = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,        # accelerates in consistent gradient directions
    weight_decay=1e-4,   # L2 regularisation
)

# ── Adam — adaptive learning rate per parameter
opt_adam = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),  # momentum decay rates
    eps=1e-8,
)

# ── AdamW — decoupled weight decay (recommended for fine-tuning)
opt_adamw = optim.AdamW(
    model.parameters(),
    lr=2e-5,              # typical for fine-tuning pretrained models
    weight_decay=0.01,
)

# ── Standard training step
x, y = torch.randn(16, 10), torch.randn(16, 1)
loss_fn = nn.MSELoss()

opt_adamw.zero_grad()       # 1. clear old gradients
pred = model(x)              # 2. forward pass
loss = loss_fn(pred, y)      # 3. compute loss
loss.backward()              # 4. backpropagate
opt_adamw.step()             # 5. update parameters

# ── Different learning rates per parameter group
optimizer = optim.AdamW([
    {"params": model.weight, "lr": 1e-3},
    {"params": model.bias,   "lr": 1e-4},
])
What is the key difference between Adam and AdamW?
What is the correct order of operations in a standard PyTorch training step?

24. What is the standard PyTorch training loop and what does each step do?

The PyTorch training loop follows a fixed five-step pattern repeated for every batch. Understanding exactly what each line does — and what happens if you skip or reorder a step — is essential for debugging training issues.

import torch
import torch.nn as nn
import torch.optim as optim

model     = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 2))
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
loss_fn   = nn.CrossEntropyLoss()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def train_one_epoch(model, loader, optimizer, loss_fn, device):
    model.train()                       # 0. enables Dropout, BatchNorm train mode
    total_loss = 0.0

    for X_batch, y_batch in loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        optimizer.zero_grad()           # 1. clear gradients from previous step
        logits = model(X_batch)         # 2. forward pass
        loss   = loss_fn(logits, y_batch)  # 3. compute loss
        loss.backward()                 # 4. backpropagate — fills .grad
        optimizer.step()                # 5. update weights using gradients

        total_loss += loss.item() * X_batch.size(0)

    return total_loss / len(loader.dataset)

@torch.no_grad()                       # disable gradient tracking for eval
def validate(model, loader, loss_fn, device):
    model.eval()                        # disables Dropout, BatchNorm uses running stats
    total_loss, correct = 0.0, 0

    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        logits = model(X_batch)
        loss   = loss_fn(logits, y_batch)
        total_loss += loss.item() * X_batch.size(0)
        correct    += (logits.argmax(1) == y_batch).sum().item()

    return total_loss / len(loader.dataset), correct / len(loader.dataset)

# Full training loop
for epoch in range(10):
    train_loss        = train_one_epoch(model, train_loader, optimizer, loss_fn, device)
    val_loss, val_acc  = validate(model, val_loader, loss_fn, device)
    print(f"Epoch {epoch}: train_loss={train_loss:.4f} val_loss={val_loss:.4f} val_acc={val_acc:.4f}")
Training loop steps
StepCallPurpose
0model.train()Enable Dropout, set BatchNorm to use batch statistics
1optimizer.zero_grad()Clear accumulated gradients from the previous step
2model(x)Forward pass — compute predictions
3loss_fn(pred, target)Compute scalar loss
4loss.backward()Backpropagate — populate .grad on each parameter
5optimizer.step()Update parameters using gradients and the optimizer's rule
/div>
What would happen if you forgot to call optimizer.zero_grad() before loss.backward() in every training iteration?
What is the difference between model.train() and model.eval()?

25. What are Dataset and DataLoader in PyTorch and how do they work together?

PyTorch's data pipeline follows a clean two-class design: Dataset defines how to access a single sample (index → data), and DataLoader wraps a Dataset to handle batching, shuffling, and parallel loading.

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class TabularDataset(Dataset):
    def __init__(self, X: np.ndarray, y: np.ndarray):
        # Convert once at construction — not inside __getitem__!
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self) -> int:
        """Required — tells DataLoader how many samples exist."""
        return len(self.X)

    def __getitem__(self, idx: int):
        """Required — return a single (features, label) sample."""
        return self.X[idx], self.y[idx]

# Synthetic data
X = np.random.randn(1000, 20).astype(np.float32)
y = np.random.randint(0, 3, size=1000)

dataset = TabularDataset(X, y)
print(len(dataset))          # 1000
print(dataset[0])            # (tensor of 20 features, tensor scalar label)

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,            # shuffle each epoch — essential for training
    num_workers=4,           # parallel data loading subprocesses
    pin_memory=True,         # faster CPU→GPU transfer
    drop_last=True,          # drop incomplete final batch
)

# Iterate over batches
for X_batch, y_batch in loader:
    print(X_batch.shape, y_batch.shape)  # (32, 20) (32,)
    break

# torchvision pre-built datasets
from torchvision import datasets, transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),
])
mnist = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
mnist_loader = DataLoader(mnist, batch_size=64, shuffle=True)
Dataset and DataLoader responsibilities
ComponentResponsibilityRequired methods
DatasetDefines how to access ONE sample by index__len__, __getitem__
DataLoaderBatches samples, shuffles, parallelises loadingWraps any Dataset object
/div>
What two methods must a custom PyTorch Dataset class implement?
Why is shuffle=True important when creating a DataLoader for training (but typically False for validation)?

26. How do you move tensors and models between CPU and GPU in PyTorch?

PyTorch's device abstraction allows the same code to run on CPU or GPU with minimal changes. The fundamental rule: a model and its input tensors must reside on the same device before any computation, or PyTorch raises a RuntimeError.

import torch
import torch.nn as nn

# Device-agnostic pattern — always write code this way
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move a model to the device
model = nn.Linear(10, 1).to(device)

# Move data to the same device, every batch, inside the loop
for X_batch, y_batch in loader:
    X_batch = X_batch.to(device, non_blocking=True)
    y_batch = y_batch.to(device, non_blocking=True)
    pred = model(X_batch)   # works — both on same device

# WRONG — mismatched devices raises RuntimeError
# model_cpu = nn.Linear(10, 1)              # stays on CPU
# x_gpu = torch.randn(4, 10).to("cuda")
# model_cpu(x_gpu)  # RuntimeError: Expected all tensors on same device

# Checking tensor device
t = torch.randn(3)
print(t.device)            # cpu
t_gpu = t.cuda()           # or t.to("cuda:0")
print(t_gpu.device)        # cuda:0

# GPU memory diagnostics
if torch.cuda.is_available():
    print(torch.cuda.memory_allocated() / 1e9, "GB allocated")
    print(torch.cuda.max_memory_allocated() / 1e9, "GB peak")
    torch.cuda.empty_cache()   # release unused cached memory

# Moving a tensor back to CPU (required before .numpy())
result = t_gpu.cpu().numpy()   # numpy() requires a CPU tensor

# Apple Silicon (M1/M2/M3) GPU support
mps_device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
Device transfer methods
MethodEffect
tensor.to(device)Moves to specified device — most flexible, recommended
tensor.cuda()Shorthand for .to('cuda')
tensor.cpu()Moves back to CPU (required before .numpy())
model.to(device)Moves all model parameters and buffers
non_blocking=TrueAllows async transfer when paired with pin_memory=True
/div>
What happens if you try to run a forward pass with a model on the GPU but input tensors still on the CPU?
Why must you call .cpu() on a tensor before calling .numpy() on it?

27. What is the difference between model.parameters() and model.state_dict() in PyTorch?

Both expose a model's learnable values, but they serve different purposes. parameters() returns an iterator of nn.Parameter tensor objects (used by the optimizer); state_dict() returns an OrderedDict mapping layer names to tensors (used for saving/loading and inspection).

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 1),
)

# ── parameters(): iterator of Parameter tensors (no names)
for p in model.parameters():
    print(p.shape, p.requires_grad)
# torch.Size([20, 10]) True
# torch.Size([20])     True
# torch.Size([1, 20])  True
# torch.Size([1])      True

# Used to construct optimizers
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── named_parameters(): iterator of (name, Parameter) tuples
for name, p in model.named_parameters():
    print(name, p.shape)
# 0.weight torch.Size([20, 10])
# 0.bias   torch.Size([20])
# 2.weight torch.Size([1, 20])
# 2.bias   torch.Size([1])

# ── state_dict(): OrderedDict for save/load
sd = model.state_dict()
print(type(sd))           # <class 'collections.OrderedDict'>
print(sd.keys())          # dict_keys(['0.weight', '0.bias', '2.weight', '2.bias'])

# Saving and loading via state_dict (the recommended pattern)
torch.save(model.state_dict(), "model_weights.pt")

new_model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))
new_model.load_state_dict(torch.load("model_weights.pt"))
new_model.eval()           # ALWAYS call after loading for inference

# Total parameter count
total_params = sum(p.numel() for p in model.parameters())
trainable    = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total_params}, Trainable: {trainable}")
What is the primary use of model.state_dict() compared to model.parameters()?
What should you always call after loading weights into a model with load_state_dict(), before running inference?

28. How do you save and load PyTorch models correctly, including full training checkpoints?

PyTorch supports saving either the full model object or just its weights (state_dict). Saving only the state_dict is the recommended approach because it decouples weights from the Python class definition. A full training checkpoint includes the optimizer state too, so training can resume exactly where it left off.

import torch
import torch.nn as nn

model     = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── RECOMMENDED: save/load state_dict only
torch.save(model.state_dict(), "weights.pt")

model_new = nn.Linear(10, 5)        # must define the SAME architecture first
model_new.load_state_dict(torch.load("weights.pt"))
model_new.eval()                     # always call before inference

# ── NOT recommended: save the entire model object
# Fragile — breaks if the class definition moves or changes
torch.save(model, "full_model.pt")
loaded_model = torch.load("full_model.pt", weights_only=False)

# ── Full training checkpoint — for resuming training
def save_checkpoint(path, epoch, model, optimizer, best_val_loss):
    torch.save({
        "epoch":            epoch,
        "model_state":      model.state_dict(),
        "optimizer_state":  optimizer.state_dict(),  # Adam momentum buffers etc.
        "best_val_loss":    best_val_loss,
    }, path)

def load_checkpoint(path, model, optimizer):
    ckpt = torch.load(path, map_location="cpu")  # always load to CPU first
    model.load_state_dict(ckpt["model_state"])
    optimizer.load_state_dict(ckpt["optimizer_state"])
    return ckpt["epoch"], ckpt["best_val_loss"]

save_checkpoint("ckpt.pt", epoch=5, model=model, optimizer=optimizer, best_val_loss=0.42)
epoch, best_loss = load_checkpoint("ckpt.pt", model_new, optimizer)

# ── Loading on a different device than it was saved
model.load_state_dict(
    torch.load("weights.pt", map_location="cpu")  # avoid GPU OOM if GPU unavailable
)
model = model.to("cuda")  # then move to the desired device
Why is saving model.state_dict() preferred over saving the entire model object with torch.save(model)?
Why does a full training checkpoint include the optimizer's state_dict, not just the model's?

29. What is overfitting and what regularization techniques does PyTorch support to address it?

Overfitting occurs when a model memorises the training data instead of learning generalisable patterns — visible as low training loss but high validation loss. PyTorch provides several built-in tools to combat overfitting.

PyTorch regularization techniques
TechniqueHow to applyEffect
Dropoutnn.Dropout(p=0.5) layerRandomly zeroes activations during training, preventing co-adaptation
Weight decay (L2)optimizer weight_decay= parameterPenalises large weights, encourages simpler models
Early stoppingManual: track val_loss, stop when it plateausPrevents training past the point of generalisation
Data augmentationtorchvision.transformsIncreases effective dataset size and diversity
Batch Normalizationnn.BatchNorm1d/2dStabilises training; has a mild regularising side effect
Label smoothingCrossEntropyLoss(label_smoothing=0.1)Prevents overconfident predictions
/div>
import torch
import torch.nn as nn

class RegularizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1  = nn.Linear(784, 256)
        self.bn1  = nn.BatchNorm1d(256)
        self.drop = nn.Dropout(p=0.5)        # 50% dropout
        self.fc2  = nn.Linear(256, 10)

    def forward(self, x):
        x = torch.relu(self.bn1(self.fc1(x)))
        x = self.drop(x)                      # active in train(), off in eval()
        return self.fc2(x)

model = RegularizedNet()

# Weight decay — L2 penalty added by the optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2,    # penalise large weights
)

# Label smoothing — softens hard one-hot targets
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

# Early stopping pattern
best_val_loss = float("inf")
patience, patience_counter = 5, 0

for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader, optimizer, criterion, device)
    val_loss, _ = validate(model, val_loader, criterion, device)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        torch.save(model.state_dict(), "best_model.pt")  # save best checkpoint
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print(f"Early stopping at epoch {epoch}")
        break
How does Dropout help prevent overfitting?
What is the purpose of weight_decay in an optimizer like AdamW?

30. What is the vanishing/exploding gradient problem and how do you detect and fix it in PyTorch?

During backpropagation, gradients are computed via repeated multiplication through the chain rule. In deep networks, this can cause gradients to shrink toward zero (vanishing) or grow toward infinity (exploding) as they propagate backward through many layers, preventing effective training.

import torch
import torch.nn as nn

model     = nn.LSTM(input_size=10, hidden_size=128, num_layers=3, batch_first=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

x = torch.randn(32, 20, 10)
output, _ = model(x)
loss = output.sum()

optimizer.zero_grad()
loss.backward()

# ── Detect: monitor gradient norms
total_norm = 0.0
for p in model.parameters():
    if p.grad is not None:
        total_norm += p.grad.data.norm(2).item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm: {total_norm:.4f}")
# Very small (~1e-6) → vanishing; very large (~1e3+) → exploding

# ── Fix 1: Gradient clipping — caps the gradient norm before the step
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

# ── Fix 2: Better weight initialisation (He init for ReLU networks)
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
        nn.init.zeros_(m.bias)

mlp = nn.Sequential(nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, 64))
mlp.apply(init_weights)

# ── Fix 3: Batch Normalization — stabilises layer input distributions
class StableNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(64, 64), nn.BatchNorm1d(64), nn.ReLU(),
            nn.Linear(64, 64), nn.BatchNorm1d(64), nn.ReLU(),
        )
    def forward(self, x):
        return self.net(x)

# ── Fix 4: Residual / skip connections — gradient highway
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
    def forward(self, x):
        return x + self.net(x)   # gradient flows through x directly
What does nn.utils.clip_grad_norm_() do and when should it be called?
How do residual (skip) connections help mitigate vanishing gradients in very deep networks?

31. What is weight initialization in PyTorch and why does it matter?

How a network's weights are initialised at the start of training significantly affects whether training converges quickly, slowly, or not at all. PyTorch's default initialisation (Kaiming uniform for Linear/Conv layers) works well in most cases, but understanding the principles helps when debugging training issues.

import torch
import torch.nn as nn

# PyTorch default: Linear layers use Kaiming Uniform initialisation
layer = nn.Linear(256, 128)
print(layer.weight.std().item())   # approximately sqrt(2/256) ≈ 0.088

# Explicit initialisation methods
def init_weights(m):
    if isinstance(m, nn.Linear):
        # Xavier/Glorot — good for Tanh/Sigmoid activations
        nn.init.xavier_uniform_(m.weight)

        # He/Kaiming — good for ReLU-family activations (PyTorch default)
        # nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")

        nn.init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10),
)
model.apply(init_weights)   # applies init_weights to every sub-module

# Why initialisation matters: too small → vanishing activations
# too large → exploding activations, especially in deep nets
x = torch.randn(100, 784)
for layer in model:
    x = layer(x)
    if hasattr(layer, "weight"):
        print(f"{layer}: activation std={x.std().item():.4f}")
# With good init, std should stay roughly stable across layers

# Custom initialisation from scratch
with torch.no_grad():
    layer.weight.normal_(mean=0.0, std=0.02)  # common for transformer init
    layer.bias.zero__()
Initialization strategies
MethodFormula (roughly)Best for
Xavier/GlorotVar = 2/(fan_in+fan_out)Tanh, Sigmoid activations
Kaiming/He (PyTorch default for Linear)Var = 2/fan_inReLU, LeakyReLU activations
Zero initAll weights = 0NEVER for weights — breaks symmetry; OK for biases
Small normal (std≈0.02)N(0, 0.02²)Transformer architectures (BERT, GPT)
/div>
Why should weights never be initialised to all zeros in a neural network?
Why does Kaiming/He initialisation use variance 2/fan_in specifically, tuned for ReLU?

32. What is the difference between nn.Parameter and a regular tensor attribute in nn.Module?

nn.Parameter is a special tensor subclass that, when assigned as an attribute of an nn.Module, is automatically registered in the module's parameter list — meaning it appears in model.parameters(), gets moved by .to(device), and is saved in state_dict(). A plain tensor attribute does none of this.

import torch
import torch.nn as nn

class CustomLayer(nn.Module):
    def __init__(self, dim: int):
        super().__init__()

        # nn.Parameter — automatically registered, tracked, trained
        self.weight = nn.Parameter(torch.randn(dim, dim))
        self.bias   = nn.Parameter(torch.zeros(dim))

        # Plain tensor — NOT registered, NOT trained, invisible to optimizer
        self.scale  = torch.tensor(2.0)  # WRONG if meant to be learnable!

        # register_buffer — for non-trainable state that SHOULD move with
        # the model and be saved (e.g. BatchNorm running mean/var)
        self.register_buffer("running_mean", torch.zeros(dim))

    def forward(self, x):
        return x @ self.weight + self.bias

layer = CustomLayer(10)

# Check what appears in parameters()
for name, p in layer.named_parameters():
    print(name, p.shape)
# weight torch.Size([10, 10])
# bias   torch.Size([10])
# scale and running_mean do NOT appear here!

# Check state_dict — includes parameters AND buffers, but not plain tensors
print(layer.state_dict().keys())
# odict_keys(['weight', 'bias', 'running_mean'])

# .to(device) moves Parameters and registered buffers, but NOT plain tensor attrs
layer.to("cuda") if torch.cuda.is_available() else None
# layer.scale would STILL be on CPU — a common silent bug!
Attribute types in nn.Module
Attribute typeIn parameters()?In state_dict()?Moved by .to(device)?Trained by optimizer?
nn.ParameterYesYesYesYes
register_buffer tensorNoYesYesNo
Plain tensor attributeNoNoNo (silent bug risk!)No
/div>
What happens if you assign a plain torch.Tensor (not nn.Parameter) as an attribute of an nn.Module meant to be learnable?
When should you use register_buffer() instead of nn.Parameter?

33. How do you implement and use learning rate schedulers in PyTorch?

A fixed learning rate throughout training is rarely optimal — too high late in training prevents fine convergence, while too low early on wastes time. PyTorch's torch.optim.lr_scheduler module adjusts the learning rate systematically as training progresses.

import torch
import torch.nn as nn
import torch.optim as optim

model     = nn.Linear(10, 1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# ── StepLR: multiply lr by gamma every step_size epochs
scheduler_step = optim.lr_scheduler.StepLR(
    optimizer, step_size=10, gamma=0.1
)

# ── CosineAnnealingLR: smooth decay following a cosine curve
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6
)

# ── ReduceLROnPlateau: reduce lr when a metric stops improving
scheduler_plateau = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5
)

# ── OneCycleLR: warmup then decay — fast convergence ("super-convergence")
n_epochs, steps_per_epoch = 10, 100
scheduler_1cycle = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    total_steps=n_epochs * steps_per_epoch,
    pct_start=0.3,    # 30% of steps used for warmup
)

# ── Training loop with scheduler
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer, loss_fn, device)
    val_loss, _ = validate(model, val_loader, loss_fn, device)

    scheduler_cos.step()              # epoch-based scheduler — call once per epoch
    scheduler_plateau.step(val_loss)  # metric-based — pass the metric value

    current_lr = optimizer.param_groups[0]["lr"]
    print(f"Epoch {epoch}: lr={current_lr:.6f}")

# Note: OneCycleLR and some schedulers are called PER BATCH, not per epoch
# for step in range(total_steps):
#     train_step(...)
#     scheduler_1cycle.step()  # called inside the batch loop
Common LR schedulers
SchedulerBehaviourCall frequency
StepLRMultiply lr by gamma every N epochsPer epoch
CosineAnnealingLRSmooth cosine decayPer epoch
ReduceLROnPlateauReduce lr when validation metric plateausPer epoch, after computing metric
OneCycleLRWarmup then decay in one cyclePer batch/step
LinearLR / warmup schedulesLinear ramp from low to target lrPer step, common for transformers
/div>
What is the key difference between ReduceLROnPlateau and most other PyTorch schedulers?
Why is OneCycleLR typically stepped once per batch rather than once per epoch?

34. How do you debug a PyTorch training loop where the loss is not decreasing or is NaN?

Diagnosing a stuck or diverging training loop is one of the most valuable practical PyTorch skills. The shape of the loss curve and a few targeted checks usually reveal the root cause.

Common training failure modes
SymptomLikely causeFix
Loss is NaN from step 1Exploding gradients, bad data (inf/NaN inputs), lr too highCheck input data, add gradient clipping, lower lr
Loss never decreasesVanishing gradients, lr too low, forgot optimizer.step()Check gradient norms, raise lr, verify training loop order
Loss decreases then plateaus highModel too small, lr too high for fine convergenceIncrease capacity, add lr scheduler
Train loss low, val loss highOverfittingAdd dropout, weight decay, more data, early stopping
Loss oscillates wildlylr too high, batch size too smallLower lr, increase batch size, use lr warmup
/div>
import torch
import torch.nn as nn

model     = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for step, (X, y) in enumerate(loader):
    optimizer.zero_grad()
    logits = model(X)
    loss = criterion(logits, y)

    # ── Check 1: is the loss finite?
    if not torch.isfinite(loss):
        print(f"Step {step}: non-finite loss = {loss.item()}")
        print("Input contains NaN:", torch.isnan(X).any().item())
        print("Input contains Inf:", torch.isinf(X).any().item())
        break

    loss.backward()

    # ── Check 2: gradient norms — are gradients flowing at all?
    total_norm = sum(
        p.grad.norm().item() ** 2 for p in model.parameters() if p.grad is not None
    ) ** 0.5
    if step % 50 == 0:
        print(f"Step {step}: loss={loss.item():.4f} grad_norm={total_norm:.4f}")

    # ── Check 3: are any gradients None? (means that param was unused!)
    for name, p in model.named_parameters():
        if p.grad is None:
            print(f"WARNING: {name} has no gradient — is it used in forward()?")

    optimizer.step()

# ── Check 4: verify model output shape and range make sense
with torch.no_grad():
    sample_out = model(X[:1])
    print("Output range:", sample_out.min().item(), sample_out.max().item())

# ── Check 5: overfit a tiny batch — sanity check the architecture
# If the model cannot drive loss near zero on 5 examples, there is a bug
tiny_X, tiny_y = X[:5], y[:5]
for _ in range(200):
    optimizer.zero_grad()
    loss = criterion(model(tiny_X), tiny_y)
    loss.backward()
    optimizer.step()
print(f"Tiny-batch overfit loss: {loss.item():.6f}")  # should approach 0
If a training loss is NaN starting from the very first step, what should you check first?
What does the 'overfit a tiny batch' sanity check (training on 5 examples until loss ≈ 0) verify?

35. What is the difference between torch.tensor() and torch.Tensor() (capital T) for creating tensors?

This is a subtle but important PyTorch gotcha. torch.tensor() (lowercase, a function) infers dtype from the input data and copies it — the recommended way to create tensors from data. torch.Tensor() (uppercase, a class constructor) is an alias for torch.FloatTensor and behaves inconsistently depending on the argument type.

import torch

# ── torch.tensor() — RECOMMENDED, infers dtype, copies data
a = torch.tensor([1, 2, 3])
print(a.dtype)   # torch.int64 — inferred from Python ints

b = torch.tensor([1.0, 2.0, 3.0])
print(b.dtype)   # torch.float32 — inferred from Python floats

c = torch.tensor([1, 2, 3], dtype=torch.float32)  # explicit override
print(c.dtype)   # torch.float32

# ── torch.Tensor() — confusing, AVOID for creating tensors from data
d = torch.Tensor([1, 2, 3])
print(d.dtype)   # torch.float32 — ALWAYS float32, ignores int input!

e = torch.Tensor(3, 4)   # interprets ints as a SHAPE, not data!
print(e.shape)    # torch.Size([3, 4]) — uninitialised memory, random values

# Common gotcha: these look similar but behave VERY differently
f1 = torch.tensor(3)      # scalar tensor with value 3
f2 = torch.Tensor(3)      # tensor of SHAPE (3,) with garbage/uninitialised values!
print(f1)   # tensor(3)
print(f2)   # tensor([4.6e-41, 0.0, 1.4e-45])  — random uninitialised memory!

# Recommended explicit constructors for empty/typed tensors:
g = torch.empty(3, 4)               # uninitialised, explicit intent
h = torch.zeros(3, 4, dtype=torch.float32)
i = torch.ones(3, 4, dtype=torch.int64)

Rule of thumb: always use lowercase torch.tensor() when creating a tensor from existing data (a list, NumPy array, or scalar). Use torch.zeros(), torch.ones(), torch.empty(), or torch.rand() when you want a new tensor of a given shape. Avoid torch.Tensor() entirely in new code.

What is the critical difference between torch.tensor(3) and torch.Tensor(3)?
Which function should you use to create a PyTorch tensor from an existing Python list or NumPy array?

36. How does gradient accumulation work in PyTorch and when would you use it?

Gradient accumulation simulates a larger effective batch size than fits in GPU memory by summing gradients over several smaller forward/backward passes before calling optimizer.step(). This is useful when training large models on limited GPU memory.

import torch
import torch.nn as nn

model     = nn.Linear(100, 10)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Simulate effective batch_size=128 using micro_batch=32 (4 accumulation steps)
accumulation_steps = 4

optimizer.zero_grad()
for step, (X_micro, y_micro) in enumerate(loader):  # loader yields micro-batches
    logits = model(X_micro)
    loss = criterion(logits, y_micro)

    # CRITICAL: scale loss by 1/accumulation_steps before backward
    # so the accumulated gradient matches what a single large-batch
    # backward pass would have produced
    loss = loss / accumulation_steps
    loss.backward()           # gradients ACCUMULATE (not cleared)

    if (step + 1) % accumulation_steps == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()                          # update only every N micro-batches
        optimizer.zero_grad(set_to_none=True)     # clear for next accumulation cycle

# Effective batch size = micro_batch_size * accumulation_steps
# This trades extra forward/backward compute for lower peak memory usage
Gradient accumulation trade-offs
AspectEffect
GPU memoryStays at micro-batch level — much lower peak usage
Wall-clock timeSlightly slower than one large batch (more Python overhead)
Effective batch sizemicro_batch_size × accumulation_steps
BatchNorm caveatStatistics computed per micro-batch, not the full effective batch — can behave differently than true large-batch training
/div>
Why must the loss be divided by accumulation_steps before calling backward() in gradient accumulation?
What is the main motivation for using gradient accumulation in PyTorch?

37. What is mixed precision training in PyTorch and how do you implement it with torch.cuda.amp?

Mixed precision training runs most operations in FP16 (or BF16) for speed while keeping a master copy of weights in FP32 for numerical stability. Modern GPUs (Volta and later) have dedicated hardware (Tensor Cores) that make FP16 matrix multiplication significantly faster than FP32.

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model     = nn.Linear(1024, 512).cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler    = GradScaler()   # manages loss scaling to prevent FP16 underflow

x = torch.randn(256, 1024).cuda()
y = torch.randn(256, 512).cuda()

for step in range(100):
    optimizer.zero_grad()

    # autocast: automatically runs eligible ops in FP16/BF16
    with autocast(device_type="cuda", dtype=torch.float16):
        y_hat = model(x)                  # matmul runs in FP16 — faster!
        loss  = nn.MSELoss()(y_hat, y)

    # Loss scaling: inflate loss before backward to prevent small
    # gradients from underflowing to zero in FP16's limited range
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)            # restore original gradient magnitudes
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)                # skips the step if grads are inf/NaN
    scaler.update()                       # adjusts scale factor for next iteration

# BFloat16: no GradScaler needed (same exponent range as FP32)
with autocast(device_type="cuda", dtype=torch.bfloat16):
    y_hat = model(x)   # no underflow risk — scaling unnecessary
What problem does GradScaler solve in FP16 mixed precision training?
Why doesn't BFloat16 require a GradScaler while Float16 does?

38. What is torch.compile() and how does it speed up PyTorch model execution?

Introduced in PyTorch 2.0, torch.compile() performs just-in-time compilation of a model. Instead of executing each tensor operation eagerly (PyTorch's default), it captures the computation graph, fuses operations, and generates optimised kernels — primarily reducing GPU memory round-trips.

import torch
import torch.nn as nn
import time

model = nn.Sequential(
    nn.Linear(1024, 1024), nn.GELU(),
    nn.Linear(1024, 512),  nn.GELU(),
    nn.Linear(512, 10),
).cuda()

# Compile the model — wraps it, does NOT change the API
compiled_model = torch.compile(model)

x = torch.randn(256, 1024).cuda()

# First call triggers compilation (slow — may take 10-60 seconds)
out = compiled_model(x)

# Subsequent calls use the compiled, optimised kernels (fast)
for _ in range(5):
    out = compiled_model(x)

# Compilation modes — trade compile time for runtime speed
model_default = torch.compile(model)                              # balanced
model_reduce  = torch.compile(model, mode="reduce-overhead")      # less Python overhead
model_max     = torch.compile(model, mode="max-autotune")         # slowest compile, fastest run

# Benchmark comparison
def benchmark(fn, x, n=100):
    for _ in range(5): fn(x)            # warmup
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(n): fn(x)
    torch.cuda.synchronize()
    return time.time() - start

eager_time    = benchmark(model, x)
compiled_time = benchmark(compiled_model, x)
print(f"Eager: {eager_time:.3f}s, Compiled: {compiled_time:.3f}s")
What is the primary technique torch.compile() uses to speed up model execution?
Why does the first call to a torch.compile()'d model take significantly longer than subsequent calls?

39. What is the difference between batch size, epoch, and iteration in PyTorch training?

These three terms are fundamental to understanding any training loop, and confusing them is a common source of bugs when computing metrics or setting up learning rate schedules.

Training terminology
TermDefinitionExample
Batch sizeNumber of samples processed together in one forward/backward pass32
Iteration (step)One forward + backward + optimizer.step() call — processes one batch1 step = 1 batch processed
EpochOne complete pass through the entire training dataset1 epoch = dataset_size / batch_size iterations
/div>
import torch
from torch.utils.data import DataLoader, TensorDataset

# Example: 1000 training samples, batch size 32
X = torch.randn(1000, 20)
y = torch.randint(0, 5, (1000,))
dataset = TensorDataset(X, y)
loader  = DataLoader(dataset, batch_size=32, shuffle=True)

iterations_per_epoch = len(loader)   # = ceil(1000 / 32) = 32
print(f"Iterations per epoch: {iterations_per_epoch}")

n_epochs = 10
total_iterations = n_epochs * iterations_per_epoch
print(f"Total training iterations: {total_iterations}")  # 320

global_step = 0
for epoch in range(n_epochs):
    for batch_idx, (X_batch, y_batch) in enumerate(loader):
        # This inner loop body executes once PER ITERATION
        # X_batch.shape[0] == batch_size (32, except possibly the last batch)
        global_step += 1
        if global_step % 10 == 0:
            print(f"Epoch {epoch}, iteration {batch_idx}, global step {global_step}")

    print(f"--- Completed epoch {epoch} ---")  # runs once PER EPOCH

# Common pitfall: confusing scheduler.step() granularity
# Some schedulers (StepLR) expect ONE call per epoch
# Others (OneCycleLR) expect ONE call per iteration/step
# Mixing these up silently breaks the intended learning rate schedule
If a dataset has 10,000 samples and the batch size is 50, how many iterations occur in one epoch?
Why is it important to know whether a learning rate scheduler should be stepped per epoch or per iteration?

40. How do you compute and track evaluation metrics like accuracy during PyTorch training?

Tracking metrics correctly requires accumulating values across all batches (not just averaging per-batch metrics naively, which can be biased if the last batch has a different size) and ensuring computations happen without gradient tracking.

import torch
import torch.nn as nn

@torch.no_grad()   # disable gradient tracking for the entire evaluation function
def evaluate(model, loader, criterion, device):
    model.eval()                       # disable dropout, use BN running stats

    total_loss    = 0.0
    total_correct = 0
    total_samples = 0

    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        batch_size = X_batch.size(0)

        logits = model(X_batch)
        loss   = criterion(logits, y_batch)

        # Weight by batch_size — correct even if the last batch is smaller
        total_loss += loss.item() * batch_size

        preds = logits.argmax(dim=1)
        total_correct += (preds == y_batch).sum().item()
        total_samples += batch_size

    avg_loss = total_loss / total_samples
    accuracy = total_correct / total_samples
    return avg_loss, accuracy

# WRONG pattern — naively averaging per-batch averages
# is biased if batch sizes are unequal (e.g. last batch is smaller)
def evaluate_wrong(model, loader, criterion):
    losses = []
    for X_batch, y_batch in loader:
        loss = criterion(model(X_batch), y_batch)
        losses.append(loss.item())   # all batches weighted EQUALLY — wrong!
    return sum(losses) / len(losses)  # biased if last batch has fewer samples

# Using torchmetrics for more complex metrics (F1, precision, AUROC)
# pip install torchmetrics
from torchmetrics import Accuracy, F1Score

acc_metric = Accuracy(task="multiclass", num_classes=5).to(device)
f1_metric  = F1Score(task="multiclass", num_classes=5, average="macro").to(device)

for X_batch, y_batch in loader:
    preds = model(X_batch).argmax(dim=1)
    acc_metric.update(preds, y_batch)   # accumulates internally across batches
    f1_metric.update(preds, y_batch)

print(f"Accuracy: {acc_metric.compute():.4f}")  # final correct aggregate
print(f"F1: {f1_metric.compute():.4f}")
Why is naively averaging per-batch loss values across an epoch potentially biased?
Why is the @torch.no_grad() decorator applied to an evaluation function?

41. What is the purpose of torch.manual_seed() and how do you ensure reproducibility in PyTorch?

PyTorch uses pseudo-random number generators for weight initialisation, dropout masks, data shuffling, and more. Setting seeds explicitly ensures experiments are reproducible — critical for debugging, comparing model variants fairly, and scientific rigor.

import torch
import numpy as np
import random
import os

def set_seed(seed: int = 42):
    """Set all relevant seeds for full reproducibility."""
    random.seed(seed)                       # Python's random module
    np.random.seed(seed)                    # NumPy
    torch.manual_seed(seed)                 # PyTorch CPU
    torch.cuda.manual_seed_all(seed)        # PyTorch all GPUs

    # Force deterministic algorithms (may be slower!)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False   # disable auto-tuner (non-deterministic)

    os.environ["PYTHONHASHSEED"] = str(seed)

set_seed(42)

# Verify reproducibility
model1 = torch.nn.Linear(10, 5)
set_seed(42)
model2 = torch.nn.Linear(10, 5)
print(torch.equal(model1.weight, model2.weight))  # True — identical init

# DataLoader reproducibility — also needs a worker_init_fn for num_workers > 0
def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

generator = torch.Generator()
generator.manual_seed(42)

from torch.utils.data import DataLoader
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    worker_init_fn=seed_worker,   # seeds each worker process
    generator=generator,           # seeds the shuffling order
)
Reproducibility checklist
Source of randomnessHow to control it
Weight initialisationtorch.manual_seed(seed)
Dropout masksCovered by torch.manual_seed (same RNG stream)
Data shufflingDataLoader(generator=torch.Generator().manual_seed(seed))
Multi-worker DataLoaderworker_init_fn to seed each subprocess
GPU non-determinismtorch.backends.cudnn.deterministic = True
cuDNN auto-tunertorch.backends.cudnn.benchmark = False
/div>
Why is torch.backends.cudnn.deterministic = True sometimes necessary in addition to torch.manual_seed()?
Why does a multi-worker DataLoader (num_workers > 0) require special handling for reproducibility beyond just calling torch.manual_seed()?

42. How does PyTorch handle multi-dimensional indexing and slicing of tensors?

PyTorch tensor indexing follows NumPy-style conventions, including basic slicing, advanced (fancy) indexing with integer/boolean tensors, and the powerful ... (ellipsis) operator for indexing high-dimensional tensors concisely.

import torch

x = torch.arange(24).reshape(2, 3, 4)   # shape (2, 3, 4)

# ── Basic slicing — same as Python lists/NumPy
print(x[0])           # shape (3, 4) — first "batch"
print(x[0, 1])         # shape (4,)  — first batch, second row
print(x[0, 1, 2])      # scalar — single element
print(x[:, 0, :])      # shape (2, 4) — all batches, first row, all cols
print(x[..., 0])       # shape (2, 3) — ellipsis: all leading dims, last dim index 0
print(x[0:1, :, -1])    # shape (1, 3) — slice + negative index

# ── Boolean (mask) indexing
mask = x > 10
print(x[mask])         # 1D tensor of all elements > 10
x_clamped = x.clone()
x_clamped[x_clamped > 10] = 0   # zero out values > 10

# ── Fancy (advanced) integer indexing
idx = torch.tensor([0, 2])
print(x[:, idx, :])    # shape (2, 2, 4) — select specific indices along dim 1

# ── torch.gather: select elements using an index tensor
scores = torch.tensor([[0.1, 0.7, 0.2], [0.3, 0.3, 0.4]])  # (2, 3)
top_idx = scores.argmax(dim=1, keepdim=True)                 # (2, 1)
top_val = scores.gather(dim=1, index=top_idx)                 # (2, 1)
print(top_val)   # tensor([[0.7], [0.4]])

# ── torch.where: conditional element selection
result = torch.where(x > 10, x, torch.zeros_like(x))   # keep if >10, else 0

# ── Important: most slicing returns a VIEW, not a copy!
y = x[0]
y[0, 0] = 999
print(x[0, 0, 0])      # 999 — x was modified too! (shared memory)
# Use x[0].clone() to get an independent copy
Indexing patterns
PatternExampleReturns
Basic slicingx[:, 0]View (shares memory)
Boolean maskx[x > 0]Copy (1D, new memory)
Fancy indexingx[:, [0,2]]Copy (new memory)
Ellipsisx[..., 0]View — skips middle dims
gatherx.gather(dim, index)Copy — selects per index
/div>
What happens when you modify a slice obtained via basic indexing, like y = x[0]; y[0] = 999?
What does the ellipsis (...) do in PyTorch tensor indexing like x[..., 0]?

43. What is the difference between.view(),.reshape(), and.contiguous() in PyTorch, and why does it matter?

These three methods deal with how a tensor's underlying memory is interpreted as a different shape. Understanding the difference prevents a class of confusing runtime errors related to tensor memory layout.

import torch

x = torch.arange(12).reshape(3, 4)    # shape (3, 4), contiguous memory

# ── .view(): ALWAYS returns a view (no copy), but requires contiguous memory
y = x.view(4, 3)     # works — x is contiguous
print(y.shape)        # (4, 3)

# ── Transpose breaks contiguity — the data is NOT rearranged in memory,
# only the strides describing how to read it change
xt = x.t()             # transpose — x.t() is a VIEW with different strides
print(xt.is_contiguous())  # False!

# This FAILS — view() cannot reinterpret non-contiguous memory
try:
    xt.view(3, 4)
except RuntimeError as e:
    print(f"Error: {e}")
# RuntimeError: view size is not compatible with input tensor's size and stride

# ── .reshape(): tries view() first; falls back to copying if needed
z = xt.reshape(3, 4)   # WORKS — automatically copies if necessary
print(z.shape)          # (3, 4)

# ── .contiguous(): explicitly forces a contiguous copy in memory
xt_contig = xt.contiguous()
print(xt_contig.is_contiguous())  # True
xt_contig.view(3, 4)   # now works, since it is contiguous

# Strides explain WHY this happens
print(x.stride())   # (4, 1) — contiguous: move 1 step = 1 memory address
print(xt.stride())  # (1, 4) — transposed: strides reflect the swap, no copy made
.view() vs .reshape() vs .contiguous()
MethodCopies data?Requires contiguous input?Safety
.view()Never — always a viewYes — raises RuntimeError otherwiseFails loudly on non-contiguous tensors
.reshape()Only if necessaryNo — handles either case automaticallySafer general-purpose choice
.contiguous()Yes, if not already contiguousN/A — this is what fixes itUse before .view() on a transposed/permuted tensor
/div>
Why does calling .view() on a transposed tensor often raise a RuntimeError?
What is the practical advantage of .reshape() over .view() in most code?

44. How do you freeze layers and perform transfer learning / fine-tuning in PyTorch?

Transfer learning reuses a model pretrained on a large dataset and adapts it to a new task. Freezing layers (setting requires_grad=False) prevents their weights from updating during backpropagation — useful when you want to keep pretrained features fixed and only train a new task-specific head.

import torch
import torch.nn as nn
import torchvision.models as models

# Load a pretrained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# ── Strategy 1: Feature extraction — freeze ALL pretrained layers
for param in backbone.parameters():
    param.requires_grad = False     # excluded from gradient computation

# Replace the final classification layer for our task (e.g. 5 classes)
in_features = backbone.fc.in_features    # 2048 for ResNet-50
backbone.fc = nn.Linear(in_features, 5)  # NEW layer — requires_grad=True by default

# Only backbone.fc parameters will be updated by the optimizer
optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, backbone.parameters()),  # only trainable params
    lr=1e-3,
)

# ── Strategy 2: Full fine-tuning with layer-wise (discriminative) learning rates
backbone2 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
backbone2.fc = nn.Linear(backbone2.fc.in_features, 5)

optimizer2 = torch.optim.AdamW([
    {"params": backbone2.layer1.parameters(), "lr": 1e-5},  # earliest layers — smallest lr
    {"params": backbone2.layer4.parameters(), "lr": 1e-4},  # later layers — bigger lr
    {"params": backbone2.fc.parameters(),     "lr": 1e-3},  # new head — largest lr
])

# ── Verify which parameters are trainable
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total     = sum(p.numel() for p in backbone.parameters())
print(f"Trainable: {trainable:,} / Total: {total:,} ({100*trainable/total:.1f}%)")

# ── Common pattern: train head first, then unfreeze and fine-tune everything
# Phase 1: only train backbone.fc for a few epochs
# Phase 2: unfreeze all layers, train with a small lr to fine-tune end-to-end
for param in backbone.parameters():
    param.requires_grad = True   # unfreeze for phase 2
What does setting param.requires_grad = False accomplish for a layer in a pretrained model?
Why might you use a smaller learning rate for early (pretrained) layers and a larger one for the new task-specific head during fine-tuning?

45. What is the purpose of torch.utils.data.random_split() and how do you create train/validation/test splits in PyTorch?

Splitting a dataset into training, validation, and test subsets is a fundamental step before training. PyTorch's random_split() creates non-overlapping random subsets from a single Dataset, while preserving the lazy-loading behaviour of the original Dataset.

import torch
from torch.utils.data import Dataset, DataLoader, random_split

class MyDataset(Dataset):
    def __init__(self, n=1000):
        self.data   = torch.randn(n, 20)
        self.labels = torch.randint(0, 3, (n,))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

full_dataset = MyDataset(n=1000)

# Split: 70% train, 15% val, 15% test
train_size = int(0.7 * len(full_dataset))
val_size   = int(0.15 * len(full_dataset))
test_size  = len(full_dataset) - train_size - val_size  # remainder, avoids rounding loss

# Use a generator for reproducible splits
generator = torch.Generator().manual_seed(42)
train_ds, val_ds, test_ds = random_split(
    full_dataset,
    [train_size, val_size, test_size],
    generator=generator,
)

print(len(train_ds), len(val_ds), len(test_ds))   # 700 150 150

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=32, shuffle=False)  # no shuffle needed
test_loader  = DataLoader(test_ds,  batch_size=32, shuffle=False)

# IMPORTANT GOTCHA: if your Dataset applies different transforms
# (e.g. data augmentation only for training), random_split alone
# does NOT let you apply different transforms per split, because
# all splits reference the SAME underlying Dataset object.
# Common workaround: split INDICES, then wrap with two separate
# Dataset instances using different transforms
from torch.utils.data import Subset
indices = torch.randperm(len(full_dataset), generator=generator).tolist()
train_idx = indices[:train_size]
val_idx   = indices[train_size:train_size+val_size]
# train_dataset_aug = Subset(MyDatasetWithAugmentation(...), train_idx)
# val_dataset_plain = Subset(MyDatasetPlain(...), val_idx)
Why is shuffle=False typically used for validation and test DataLoaders, while shuffle=True is used for training?
What is a key limitation of torch.utils.data.random_split() when you want different data augmentation for the train vs validation split?

46. What is Batch Normalization in PyTorch and how does it differ from Layer Normalization?

Normalization layers stabilise training by re-centring and re-scaling activations. PyTorch provides several variants; Batch Normalization (BatchNorm) and Layer Normalization (LayerNorm) are the two most widely used, but they normalise over different dimensions and suit different architectures.

BatchNorm vs LayerNorm
FeatureBatchNorm (nn.BatchNorm1d/2d)LayerNorm (nn.LayerNorm)
Normalises overBatch dimension (per-feature statistics)Feature dimension (per-sample statistics)
Statistics at trainComputed from current mini-batchComputed from current sample's features
Statistics at evalUses running mean/var accumulated during trainingAlways computed fresh from current input
Batch size dependencyNoisy with very small batches (< 8)Independent of batch size — works with batch=1
Best forCNNs (image models)Transformers, RNNs, NLP models
Parametersgamma (scale), beta (shift) per featureSame, but normalised per sample
/div>
import torch
import torch.nn as nn

# ── BatchNorm — for feedforward / CNN models
class BNModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.bn1 = nn.BatchNorm1d(64)   # 64 features
        self.fc2 = nn.Linear(64, 10)

    def forward(self, x):
        x = torch.relu(self.bn1(self.fc1(x)))
        return self.fc2(x)

# BatchNorm behaves differently in train vs eval mode!
# train: normalise using batch mean/var, update running stats
# eval:  use accumulated running_mean / running_var
model = BNModel()
model.train()   # must be in train mode during training!

# ── LayerNorm — for transformers and sequence models
class LNModel(nn.Module):
    def __init__(self, d_model=64):
        super().__init__()
        self.fc1 = nn.Linear(20, d_model)
        self.ln1 = nn.LayerNorm(d_model)   # normalise over last dim
        self.fc2 = nn.Linear(d_model, 10)

    def forward(self, x):
        x = torch.relu(self.ln1(self.fc1(x)))
        return self.fc2(x)

# LayerNorm produces the SAME result at train and eval
ln_model = LNModel()
ln_model.train()
x = torch.randn(8, 20)
out_train = ln_model(x)
ln_model.eval()
out_eval  = ln_model(x)
print(torch.allclose(out_train, out_eval))  # True — LayerNorm is mode-independent!

Common bug: forgetting to call model.train() before training and model.eval() before validation when using BatchNorm — at eval, it uses accumulated running statistics, and if these were never updated (because the model was always in eval mode), predictions will be incorrect.

Why does BatchNorm produce different outputs depending on whether the model is in train() or eval() mode?
For which type of architecture is LayerNorm preferred over BatchNorm, and why?

47. How do you implement and use a custom loss function in PyTorch?

When built-in loss functions do not fit your task, you can write a custom loss as either a plain function or an nn.Module subclass. As long as the loss is computed from PyTorch tensor operations with requires_grad=True parameters, autograd handles differentiation automatically.

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── Option 1: Plain function (simple, no learnable parameters)
def smooth_l1_custom(pred: torch.Tensor, target: torch.Tensor, beta: float = 1.0) -> torch.Tensor:
    """Huber loss — L1 outside beta, L2 inside beta."""
    diff = torch.abs(pred - target)
    loss = torch.where(
        diff < beta,
        0.5 * diff ** 2 / beta,       # quadratic region
        diff - 0.5 * beta,            # linear region
    )
    return loss.mean()

# ── Option 2: nn.Module subclass (recommended when loss has hyper-parameters
# or learnable parameters you want saved in state_dict)
class FocalLoss(nn.Module):
    """Focal loss for class-imbalanced multi-class problems."""
    def __init__(self, gamma: float = 2.0, weight: torch.Tensor | None = None):
        super().__init__()
        self.gamma  = gamma
        self.weight = weight   # class weights (optional)

    def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        # logits: (N, C)  targets: (N,) int64
        ce_loss = F.cross_entropy(logits, targets, weight=self.weight, reduction="none")
        pt      = torch.exp(-ce_loss)         # probability of correct class
        focal   = (1 - pt) ** self.gamma * ce_loss
        return focal.mean()

# Usage — identical to built-in loss functions
model     = nn.Linear(10, 5)
focal_fn  = FocalLoss(gamma=2.0)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

X      = torch.randn(16, 10)
target = torch.randint(0, 5, (16,))

optimizer.zero_grad()
logits = model(X)
loss   = focal_fn(logits, target)   # custom loss used exactly like nn.CrossEntropyLoss
loss.backward()                      # autograd differentiates through our custom ops
optimizer.step()

print(f"Focal loss: {loss.item():.4f}")

# ── Combining multiple losses
rec_loss  = F.mse_reconstruction_loss(output, target_img)  # reconstruction
kl_loss   = -0.5 * (1 + log_var - mu**2 - log_var.exp()).mean()  # KL divergence
total_loss = rec_loss + 0.001 * kl_loss   # weighted combination

Key insight: any PyTorch computation graph built from differentiable operations is automatically differentiable via autograd — you do not need to manually derive or implement gradients for custom losses. If you use standard PyTorch operations (torch.*, F.*), autograd takes care of the rest.

Do you need to manually implement the backward() gradient computation for a custom PyTorch loss function built from standard tensor operations?
What is the advantage of implementing a custom loss as an nn.Module subclass instead of a plain function?

48. What is torch.compile() vs TorchScript and how do you export a PyTorch model for production deployment?

PyTorch offers two main paths for production deployment beyond running the Python interpreter: TorchScript (serialises the model as a language-independent IR) and torch.compile() (JIT compiles for speed within Python). For cross-language/cross-framework deployment, ONNX export is also widely used.

PyTorch deployment options
MethodBest forRequires Python runtime?Portable across languages?
torch.compile()Fastest Python inference; no code changesYesNo
TorchScript (trace)Production servers; models with fixed control flowNoYes (C++ API)
TorchScript (script)Models with data-dependent control flow (if/loops)NoYes
ONNX exportCross-framework deployment (TensorRT, ONNX Runtime, CoreML)NoYes (many runtimes)
/div>
import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 5))
model.eval()   # ALWAYS call eval() before exporting

# ── 1. torch.compile() — fast Python-based inference (PyTorch 2.0+)
compiled = torch.compile(model)
with torch.no_grad():
    out = compiled(torch.randn(4, 10))

# ── 2. TorchScript trace — captures a concrete execution trace
#       Works best when control flow does NOT depend on input data
example_input = torch.randn(1, 10)
traced = torch.jit.trace(model, example_input)
torch.jit.save(traced, "model_traced.pt")

# Load and run without the original Python class
loaded_traced = torch.jit.load("model_traced.pt")
out = loaded_traced(torch.randn(4, 10))

# ── 3. TorchScript script — handles dynamic control flow
class DynamicModel(nn.Module):
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if x.mean() > 0:        # data-dependent branch — trace would miss this!
            return torch.relu(x)
        return torch.tanh(x)

scripted = torch.jit.script(DynamicModel())
torch.jit.save(scripted, "model_scripted.pt")

# ── 4. ONNX export — deploy with ONNX Runtime, TensorRT, CoreML
torch.onnx.export(
    model,
    example_input,
    "model.onnx",
    input_names=["features"],
    output_names=["logits"],
    dynamic_axes={"features": {0: "batch_size"}},  # variable batch size
    opset_version=17,
)

# Inference with ONNX Runtime (no PyTorch dependency on deployment host!)
# import onnxruntime as ort
# sess = ort.InferenceSession("model.onnx")
# out  = sess.run(["logits"], {"features": x.numpy()})
What is the key difference between torch.jit.trace() and torch.jit.script() for TorchScript export?
Why should you always call model.eval() before exporting or scripting a PyTorch model for production?
«
»

Comments & Discussions