Prev Next

Python / Python Mathematical Intuition and Scikit Learn Interview Questions

1. Why does linear regression minimise the sum of squared errors instead of, say, absolute errors? 2. Explain the mathematical intuition behind gradient descent and why the learning rate matters. 3. Why do you need to scale features before using gradient descent-based models or distance-based algorithms like KNN? 4. Explain the bias-variance tradeoff mathematically and how it relates to model complexity. 5. What is the mathematical difference between L1 (Lasso) and L2 (Ridge) regularization, and why does L1 produce sparse solutions? 6. How does maximum likelihood estimation connect to the logistic regression cost function? 7. How do decision trees decide which feature and threshold to split on? Explain Gini impurity and entropy. 8. Why does a random forest reduce variance compared to a single decision tree, and what role does feature randomness play? 9. What is the mathematical intuition behind gradient boosting? How does it differ from random forests? 10. Explain the mathematical foundation of PCA. What do eigenvectors and eigenvalues represent in this context? 11. What is the mathematical concept of the margin in Support Vector Machines, and why does maximizing it improve generalization? 12. What is the kernel trick in SVMs and why does it avoid explicitly computing high-dimensional feature mappings? 13. Why does K-Nearest Neighbors suffer from the curse of dimensionality, mathematically? 14. What is the mathematical objective function K-Means optimises, and why can it converge to a local minimum? 15. What is the statistical rationale behind k-fold cross-validation, and why are k=5 or k=10 commonly used? 16. What does the ROC-AUC score mathematically represent, and why is it threshold-independent? 17. Explain the mathematical tradeoff between precision and recall, and why F1 score is the harmonic mean rather than the arithmetic mean. 18. What is the 'naive' independence assumption in Naive Bayes, and why does it still work well in practice despite being unrealistic? 19. Why is a log transformation commonly applied to skewed numerical features before modeling, mathematically? 20. What is multicollinearity, mathematically, and how does the Variance Inflation Factor (VIF) detect it? 21. Why must features be standardized before applying Ridge or Lasso regularization, mathematically? 22. What is the mathematical relationship between learning_rate and n_estimators in gradient boosting? 23. How does the softmax function generalize logistic regression to multiclass classification, mathematically? 24. Why does fitting a scaler or transformer on the entire dataset (before train/test split) cause data leakage, mathematically? 25. How does the class_weight parameter mathematically address class imbalance in scikit-learn classifiers? 26. Why does using simple label encoding (integers) for nominal categorical features mislead most machine learning models, mathematically? 27. What is the difference between a single train/validation/test split and k-fold cross-validation for hyperparameter tuning, statistically? 28. Why is PCA sensitive to feature scaling while decision tree feature importance is not, mathematically? 29. Why is the decision boundary of standard logistic regression always a straight line (or hyperplane), mathematically? 30. Why can R-squared be a misleading metric for model comparison, and how does adjusted R-squared address this? 31. Derive mathematically why bagging (bootstrap aggregating) reduces variance, and under what condition it does NOT help. 32. Why does convexity of the loss function matter for optimization algorithms like gradient descent, mathematically? 33. Mathematically, why does RobustScaler handle outliers better than StandardScaler? 34. What does it mean for a classifier's predicted probabilities to be 'well-calibrated', and why don't all models produce calibrated probabilities naturally? 35. Mathematically, why does stochastic gradient descent (SGD) scale to large datasets better than batch gradient descent? 36. Beyond scaling, why must feature selection methods also be included inside a cross-validation pipeline rather than applied beforehand?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. Why does linear regression minimise the sum of squared errors instead of, say, absolute errors?

Ordinary least squares (OLS) minimises squared residuals for both mathematical and statistical reasons. Mathematically, the squared error loss L(β) = Σ(yᵢ - Xᵢβ)² is smooth and differentiable everywhere, so its minimum can be found analytically by setting the gradient to zero — this gives the closed-form normal equation β = (XᵀX)⁻¹Xᵀy. Absolute error |yᵢ - Xᵢβ| has a non-differentiable kink at zero, which prevents a clean closed-form solution and requires iterative optimisation (like linear programming) instead.

Statistically, minimising squared error is equivalent to maximum likelihood estimation under the assumption that residuals are normally distributed with constant variance (homoscedastic). The squared loss heavily penalises large errors — a residual twice as large contributes four times the loss — which makes OLS very sensitive to outliers. This is precisely why Mean Absolute Error (MAE) or Huber loss are preferred when the data contains outliers: they grow linearly rather than quadratically with the error.

import numpy as np
from sklearn.linear_model import LinearRegression

# Closed-form normal equation
X = np.array([[1, 1], [1, 2], [1, 3]])  # design matrix with intercept column
y = np.array([2, 4, 5])
beta = np.linalg.inv(X.T @ X) @ X.T @ y
print(beta)  # [intercept, slope]

# scikit-learn does the same thing under the hood
model = LinearRegression().fit(X[:, 1:], y)
print(model.intercept_, model.coef_)

Geometric intuition: minimising squared error is equivalent to finding the orthogonal projection of y onto the column space of X. This projection is unique and has a clean geometric interpretation — the residual vector is perpendicular to every column of X, which is exactly what XᵀXβ = Xᵀy encodes.

Why does squared error loss allow a closed-form solution while absolute error does not?
Why is OLS particularly sensitive to outliers compared to a model trained with MAE?
2. Explain the mathematical intuition behind gradient descent and why the learning rate matters.

Gradient descent is an iterative optimisation algorithm that finds a local minimum of a differentiable function by repeatedly stepping in the direction of steepest descent — the negative gradient. The gradient ∇L(θ) points in the direction of steepest increase, so subtracting a scaled version of it moves the parameters toward lower loss: θ_{t+1} = θ_t - η∇L(θ_t), where η is the learning rate.

The learning rate controls the step size. If η is too small, convergence is correct but painfully slow, requiring many iterations to reach the minimum. If η is too large, the update can overshoot the minimum and oscillate or even diverge — the loss increases instead of decreasing, because the linear approximation the gradient provides is only locally accurate.

import numpy as np

def gradient_descent(X, y, lr=0.01, n_iters=1000):
    n, d = X.shape
    theta = np.zeros(d)
    for _ in range(n_iters):
        predictions = X @ theta
        error = predictions - y
        gradient = (2 / n) * X.T @ error   # gradient of MSE w.r.t. theta
        theta -= lr * gradient             # step toward lower loss
    return theta

# Demonstrating learning rate sensitivity
X = np.column_stack([np.ones(100), np.linspace(0, 10, 100)])
y = 3 + 2 * X[:, 1] + np.random.randn(100) * 0.5

theta_good = gradient_descent(X, y, lr=0.01)   # converges smoothly
theta_bad  = gradient_descent(X, y, lr=0.9)    # likely diverges or oscillates
print(theta_good)

Why this matters in scikit-learn: estimators like SGDRegressor and SGDClassifier expose learning_rate and eta0 parameters precisely because of this trade-off. Adaptive schedules (e.g. learning_rate='adaptive') reduce the step size over time, balancing fast initial progress against fine-grained convergence near the minimum.

What happens if the learning rate in gradient descent is set too high?
Why does the gradient point in the direction of steepest increase rather than decrease?
3. Why do you need to scale features before using gradient descent-based models or distance-based algorithms like KNN?

Feature scaling matters for two distinct mathematical reasons depending on the algorithm family. For gradient-based optimisation (logistic regression, SVM, neural networks), features with very different scales create an elongated, elliptical loss surface. Gradient descent then zig-zags inefficiently across the narrow dimension instead of taking a direct path to the minimum, slowing convergence dramatically. Scaling features to similar ranges makes the loss surface more circular, so gradient descent converges in far fewer iterations.

For distance-based algorithms (KNN, K-Means, SVM with RBF kernel), the Euclidean distance formula √Σ(xᵢ - yᵢ)² is dominated by whichever feature has the largest numeric range. A feature measured in thousands (like income) would completely swamp a feature measured in single digits (like age) when computing distances, even if age is more predictive.

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

# Without scaling: income (range ~50000) dominates age (range ~80)
X = [[25, 45000], [30, 52000], [45, 110000]]

# StandardScaler: (x - mean) / std  -> mean 0, unit variance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Always fit scaler on train, transform both train and test
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
pipe.fit(X_train, y_train)

# MinMaxScaler: rescales to [0, 1] range — sensitive to outliers
mm = MinMaxScaler()
X_mm = mm.fit_transform(X)
When Scaling Matters
Algorithm familySensitive to scale?Why
Linear/Logistic Regression (gradient descent)YesElongated loss surface slows convergence
KNN, K-Means, SVM (RBF)YesDistance metric dominated by large-range features
Decision Trees, Random ForestNoSplits are based on feature thresholds, not magnitudes
Gradient Boosting (tree-based)NoSame reason — split-based, scale-invariant
Why does feature scaling speed up convergence for gradient descent-based models?
Which type of model is generally NOT sensitive to feature scaling?
4. Explain the bias-variance tradeoff mathematically and how it relates to model complexity.

The expected test error of a model can be decomposed into three components: E[(y - f̂(x))²] = Bias(f̂(x))² + Var(f̂(x)) + σ², where σ² is irreducible noise. Bias measures how far the average prediction (over many training sets) is from the true function — high bias means the model is too simple to capture the underlying pattern (underfitting). Variance measures how much the prediction would change if trained on a different sample — high variance means the model is overly sensitive to the specific training data (overfitting).

As model complexity increases (more polynomial terms, deeper trees, more parameters), bias decreases because the model can fit more intricate patterns, but variance increases because the model has more freedom to fit noise in the training data. The goal is to find the complexity level that minimises the sum, not either component alone.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Demonstrating bias-variance via polynomial degree
degrees = [1, 4, 15]  # underfit, good fit, overfit
for d in degrees:
    pipe = make_pipeline(
        PolynomialFeatures(degree=d),
        Ridge(alpha=0.1)
    )
    scores = cross_val_score(pipe, X, y, cv=5, scoring='neg_mean_squared_error')
    print(f'Degree {d}: CV MSE = {-scores.mean():.3f} (+/- {scores.std():.3f})')
    # degree=1: high bias, low variance (underfit)
    # degree=4: balanced
    # degree=15: low bias, high variance (overfit — CV scores vary widely)

Practical signal: high variance shows up as a large gap between training and validation error (model memorises training data); high bias shows up as both training and validation error being similarly poor (model can't even fit the training data well). Regularisation (Ridge, Lasso) deliberately introduces a small amount of bias to substantially reduce variance.

What does high variance in the bias-variance decomposition indicate about a model?
How does increasing model complexity typically affect bias and variance?
5. What is the mathematical difference between L1 (Lasso) and L2 (Ridge) regularization, and why does L1 produce sparse solutions?

Both methods add a penalty term to the loss function to discourage large coefficients. Ridge (L2) adds λΣβᵢ²; Lasso (L1) adds λΣ|βᵢ|. The mathematical consequence of this difference is profound: L1 regularization can drive coefficients to exactly zero (producing sparse, interpretable models with automatic feature selection), while L2 shrinks coefficients toward zero but almost never makes them exactly zero.

The geometric explanation: think of the regularization term as a constraint region — L2's constraint Σβᵢ² ≤ t is a smooth circle/sphere, while L1's constraint Σ|βᵢ| ≤ t is a diamond/polytope with sharp corners on the axes. When you find the point where the elliptical loss contours first touch the constraint region, the circular L2 region rarely touches exactly on an axis (rarely zeroing a coefficient), but the diamond-shaped L1 region has corners precisely on the axes — the loss contours are statistically much more likely to first touch at one of these corners, zeroing out one or more coefficients.

from sklearn.linear_model import Ridge, Lasso
import numpy as np

# Ridge: shrinks coefficients but rarely to exactly zero
ridge = Ridge(alpha=1.0).fit(X_train, y_train)
print('Ridge coefficients:', ridge.coef_)
# e.g. [0.31, 0.02, 0.18, 0.04]  — small but non-zero

# Lasso: drives some coefficients to exactly zero
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
print('Lasso coefficients:', lasso.coef_)
# e.g. [0.45, 0.0, 0.22, 0.0]   — automatic feature selection!

selected_features = np.where(lasso.coef_ != 0)[0]
print(f'Lasso selected {len(selected_features)} of {len(lasso.coef_)} features')

# ElasticNet: combines both penalties
from sklearn.linear_model import ElasticNet
en = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_train, y_train)
Why does L1 regularization tend to produce exactly zero coefficients while L2 does not?
What is the practical benefit of Lasso's tendency to zero out coefficients?
6. How does maximum likelihood estimation connect to the logistic regression cost function?

Logistic regression models the probability of a binary outcome using the sigmoid function: p(y=1|x) = σ(xᵀβ) = 1/(1+e^(-xᵀβ)). To fit β, we use maximum likelihood estimation (MLE) — finding the parameters that make the observed labels most probable under the model.

For a single example, the likelihood is p^y · (1-p)^(1-y) (this equals p when y=1 and 1-p when y=0 — a compact way to write both cases in one expression). Taking the log of the product of likelihoods across all n examples gives the log-likelihood, and negating it (since we minimise rather than maximise) yields the binary cross-entropy loss: L = -1/n Σ[yᵢ log(pᵢ) + (1-yᵢ) log(1-pᵢ)]. This is exactly the loss function scikit-learn's LogisticRegression minimises.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

model = LogisticRegression().fit(X_train, y_train)
probs = model.predict_proba(X_test)[:, 1]

# log_loss computes exactly the negative log-likelihood (cross-entropy)
loss = log_loss(y_test, probs)
print(f'Cross-entropy loss: {loss:.4f}')

# Manual implementation of the cross-entropy formula
def manual_log_loss(y_true, p_pred, eps=1e-15):
    p = np.clip(p_pred, eps, 1 - eps)  # avoid log(0)
    return -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))

print(manual_log_loss(y_test, probs))  # matches log_loss above

Why log instead of raw likelihood: multiplying many small probabilities together causes numerical underflow; taking the log converts the product into a sum, which is numerically stable and also turns the optimisation into a convex problem that gradient-based methods can solve reliably.

What loss function does maximum likelihood estimation lead to for logistic regression?
Why is the log-likelihood used instead of the raw likelihood for optimisation?
7. How do decision trees decide which feature and threshold to split on? Explain Gini impurity and entropy.

At each node, a decision tree evaluates every feature and every possible threshold, and selects the split that produces the greatest reduction in impurity between the parent node and the weighted average impurity of the two child nodes. Two common impurity measures are Gini impurity and entropy.

Gini impurity for a node is G = 1 - Σpᵢ², where pᵢ is the proportion of class i in the node. It measures the probability that two randomly selected samples from the node would have different class labels. Entropy is H = -Σpᵢ log₂(pᵢ), borrowed from information theory, measuring the average uncertainty (bits needed to encode the class label). Both reach their maximum when classes are perfectly mixed (50/50 for binary) and zero when a node is pure (all one class).

from sklearn.tree import DecisionTreeClassifier, plot_tree
import numpy as np

def gini(y):
    _, counts = np.unique(y, return_counts=True)
    p = counts / counts.sum()
    return 1 - np.sum(p ** 2)

def entropy(y):
    _, counts = np.unique(y, return_counts=True)
    p = counts / counts.sum()
    return -np.sum(p * np.log2(p + 1e-15))

y_pure  = np.array([1, 1, 1, 1])       # gini=0.0, entropy=0.0
y_mixed = np.array([1, 1, 0, 0])       # gini=0.5, entropy=1.0

print(gini(y_pure), gini(y_mixed))      # 0.0  0.5
print(entropy(y_pure), entropy(y_mixed))# 0.0  1.0

# scikit-learn: choose criterion explicitly
tree_gini    = DecisionTreeClassifier(criterion='gini').fit(X, y)
tree_entropy = DecisionTreeClassifier(criterion='entropy').fit(X, y)

Practical note: Gini and entropy almost always produce very similar trees — Gini is slightly faster to compute (no logarithm) and is scikit-learn's default. The bigger lever for tree quality is usually max_depth, min_samples_split, and min_samples_leaf, which control overfitting rather than the choice of impurity measure.

What does a Gini impurity of 0 indicate about a node in a decision tree?
How does a decision tree decide where to split at each node?

8. Why does a random forest reduce variance compared to a single decision tree, and what role does feature randomness play?

A random forest builds many decision trees, each trained on a bootstrap sample of the data (bagging), and averages their predictions. The variance of the average of n independent, identically distributed random variables each with variance σ² is σ²/n — averaging reduces variance proportionally to the number of estimators, provided the trees are independent.

In practice, trees trained on bootstrap samples of the same dataset are correlated, not independent, because they share much of the same underlying data. The variance of the average of n correlated variables with pairwise correlation ρ is ρσ² + (1-ρ)σ²/n — as n grows large, this approaches ρσ², not zero. This is why random forests also randomly restrict the features considered at each split (max_features): this decorrelates the trees from each other, reducing ρ and allowing variance reduction to continue benefiting from larger n.

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

single_tree = DecisionTreeClassifier(max_depth=None)
forest      = RandomForestClassifier(
    n_estimators=200,
    max_features='sqrt',   # randomly consider sqrt(n_features) per split
    bootstrap=True,        # sample with replacement
)

tree_scores   = cross_val_score(single_tree, X, y, cv=5)
forest_scores = cross_val_score(forest, X, y, cv=5)

print(f'Single tree: {tree_scores.mean():.3f} (+/- {tree_scores.std():.3f})')
print(f'Forest:      {forest_scores.mean():.3f} (+/- {forest_scores.std():.3f})')
# Forest typically has similar mean but MUCH lower std (variance)

Out-of-bag (OOB) error: because each tree is trained on roughly 63% of samples (bootstrap sampling), the remaining ~37% can be used to validate that specific tree — giving a free, built-in validation estimate without needing a separate holdout set. Set oob_score=True to access this.

Why does randomly restricting features per split (max_features) help random forests reduce variance further?
What does the out-of-bag (OOB) score in a random forest represent?
9. What is the mathematical intuition behind gradient boosting? How does it differ from random forests?

Gradient boosting builds an ensemble of weak learners (typically shallow decision trees) sequentially, where each new tree is trained to predict the negative gradient of the loss function with respect to the current ensemble's predictions — essentially, each tree learns to correct the errors (residuals, for squared loss) of all previous trees combined. The final prediction is F(x) = F₀(x) + η·Σ hₘ(x), where each hₘ is a tree fit to the current residuals and η is a shrinkage/learning rate.

This is fundamentally different from random forests, which build trees independently and in parallel on bootstrap samples and average them to reduce variance. Gradient boosting builds trees sequentially and dependently — each tree depends on all previous ones — primarily to reduce bias by progressively fitting the parts of the function the ensemble has gotten wrong so far.

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

# Conceptual implementation of gradient boosting for regression (squared loss)
import numpy as np
from sklearn.tree import DecisionTreeRegressor

def simple_gradient_boost(X, y, n_estimators=50, lr=0.1, max_depth=3):
    F = np.zeros(len(y))             # initial prediction: all zeros
    trees = []
    for m in range(n_estimators):
        residuals = y - F            # negative gradient for squared loss
        tree = DecisionTreeRegressor(max_depth=max_depth)
        tree.fit(X, residuals)       # fit tree to the residuals
        F += lr * tree.predict(X)    # update ensemble prediction
        trees.append(tree)
    return trees, F

# scikit-learn's production implementation
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,   # shrinkage — smaller values need more trees but generalise better
    max_depth=3,
)
gb.fit(X_train, y_train)
What does each new tree in a gradient boosting ensemble learn to predict?
What is the key structural difference between random forests and gradient boosting?
10. Explain the mathematical foundation of PCA. What do eigenvectors and eigenvalues represent in this context?

Principal Component Analysis (PCA) finds a new orthogonal coordinate system where the axes (principal components) are ordered by the amount of variance they capture in the data. Mathematically, PCA computes the eigenvectors and eigenvalues of the data's covariance matrix Σ = (1/n)XᵀX (after centering X to zero mean).

Each eigenvector of the covariance matrix points in a direction in the original feature space; the corresponding eigenvalue equals the variance of the data when projected onto that eigenvector's direction. The eigenvector with the largest eigenvalue is the first principal component — the direction of maximum variance in the data. PCA sorts eigenvectors by descending eigenvalue and keeps the top k to reduce dimensionality while retaining as much variance (information) as possible.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# ALWAYS standardize before PCA — PCA is scale-sensitive
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print('Explained variance ratio:', pca.explained_variance_ratio_)
# e.g. [0.45, 0.23] — first PC explains 45% of variance, second 23%

print('Principal components (eigenvectors):', pca.components_)
print('Eigenvalues:', pca.explained_variance_)

# Manual computation via covariance matrix eigendecomposition
cov_matrix = np.cov(X_scaled, rowvar=False)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort descending — eigh returns ascending order
idx = np.argsort(eigenvalues)[::-1]
eigenvalues, eigenvectors = eigenvalues[idx], eigenvectors[:, idx]

Choosing k: a common heuristic is to plot cumulative explained variance ratio and pick the smallest k that retains, say, 90-95% of total variance — this is exactly what PCA(n_components=0.95) automates in scikit-learn.

What does the eigenvalue associated with a principal component represent?
Why should you standardize features before applying PCA?
11. What is the mathematical concept of the margin in Support Vector Machines, and why does maximizing it improve generalization?

A linear SVM finds the hyperplane wᵀx + b = 0 that separates two classes while maximising the margin — the distance between the hyperplane and the nearest points from either class. This distance for a normalised hyperplane is 2/‖w‖, so maximising the margin is equivalent to minimising ‖w‖² subject to the constraint that all points are correctly classified with a margin of at least 1: yᵢ(wᵀxᵢ + b) ≥ 1.

The points that lie exactly on the margin boundary are called support vectors — they are the only points that determine the position of the decision boundary; all other points could be moved or removed without changing the solution. Maximising the margin is a form of structural risk minimisation: a wider margin means the decision boundary has more room before misclassifying nearby unseen points, which translates to better generalisation (related to VC-dimension theory bounding generalisation error by margin size).

from sklearn.svm import SVC
import numpy as np

svm = SVC(kernel='linear', C=1.0)
svm.fit(X_train, y_train)

# Support vectors — the points that define the margin
print('Number of support vectors:', len(svm.support_vectors_))
print('Support vector indices:', svm.support_)

# The decision boundary: w.x + b = 0
w = svm.coef_[0]
b = svm.intercept_[0]
margin_width = 2 / np.linalg.norm(w)
print(f'Margin width: {margin_width:.4f}')

# C controls the tradeoff between margin width and misclassification:
# Small C: wider margin, more tolerance for misclassified points (soft margin)
# Large C: narrower margin, less tolerance — can overfit
What are 'support vectors' in an SVM?
Why does maximizing the margin tend to improve generalization in SVMs?
12. What is the kernel trick in SVMs and why does it avoid explicitly computing high-dimensional feature mappings?

Many datasets are not linearly separable in their original feature space but become separable after mapping to a higher-dimensional space via some function φ(x). Computing this mapping explicitly (especially for infinite-dimensional mappings like the RBF kernel implies) would be computationally infeasible. The kernel trick exploits the fact that SVM's optimisation and prediction only ever require the dot product φ(xᵢ)ᵀφ(xⱼ) between mapped points — never the mapped vectors themselves.

A kernel function K(xᵢ, xⱼ) computes this dot product directly in the original space, without ever materialising φ(x). The RBF (Gaussian) kernel K(xᵢ, xⱼ) = exp(-γ‖xᵢ - xⱼ‖²) implicitly corresponds to an infinite-dimensional feature mapping, yet evaluating it costs the same as a simple distance computation — this is the mathematical magic of the kernel trick.

from sklearn.svm import SVC
from sklearn.datasets import make_circles

# Data that is NOT linearly separable in 2D
X, y = make_circles(n_samples=200, noise=0.05, factor=0.3)

# Linear kernel fails on this data
linear_svm = SVC(kernel='linear').fit(X, y)
print('Linear SVM accuracy:', linear_svm.score(X, y))  # poor

# RBF kernel implicitly maps to higher dimensions, separates easily
rbf_svm = SVC(kernel='rbf', gamma='scale').fit(X, y)
print('RBF SVM accuracy:', rbf_svm.score(X, y))  # much better

# gamma controls the 'reach' of each training example's influence
# Small gamma: smoother decision boundary (far-reaching influence)
# Large gamma: tighter decision boundary around each point (can overfit)
What does the kernel trick allow SVMs to avoid computing explicitly?
What does a small gamma value in an RBF kernel correspond to?
13. Why does K-Nearest Neighbors suffer from the curse of dimensionality, mathematically?

KNN relies on the assumption that nearby points in feature space share similar labels — its entire predictive power comes from local neighborhoods being meaningful. As the number of dimensions d increases, two related mathematical phenomena destroy this assumption.

First, the volume of a hypersphere relative to its bounding hypercube shrinks rapidly as d increases — most of the volume of a high-dimensional cube concentrates near its corners, far from the center. Second, and more critically, distances between randomly distributed points become increasingly similar to each other as d grows: the ratio of the distance to the nearest neighbor versus the farthest neighbor approaches 1. This means in high dimensions, the concept of 'nearest' neighbor becomes statistically meaningless — every point is approximately equidistant from every other point.

import numpy as np

def distance_ratio_demo(n_dims_list, n_points=1000):
    for d in n_dims_list:
        points = np.random.uniform(0, 1, size=(n_points, d))
        query = np.random.uniform(0, 1, size=d)
        dists = np.linalg.norm(points - query, axis=1)
        ratio = dists.min() / dists.max()
        print(f'd={d:4d}: nearest/farthest distance ratio = {ratio:.4f}')

distance_ratio_demo([2, 10, 50, 200, 1000])
# Output shows the ratio approaching 1.0 as d grows —
# nearest and farthest neighbors become almost equidistant!

# Mitigation: dimensionality reduction before KNN
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(PCA(n_components=20), KNeighborsClassifier(n_neighbors=5))
pipe.fit(X_train, y_train)
What happens to the ratio between the nearest and farthest neighbor distances as dimensionality increases?
What is a common mitigation strategy for the curse of dimensionality in KNN?
14. What is the mathematical objective function K-Means optimises, and why can it converge to a local minimum?

K-Means seeks to partition n points into k clusters by minimising the within-cluster sum of squares (WCSS), also called inertia: J = Σₖ Σ_{x ∈ Cₖ} ‖x - μₖ‖², where μₖ is the centroid (mean) of cluster k. This is a non-convex combinatorial optimisation problem — finding the global optimum requires checking every possible partition of n points into k groups, which is computationally infeasible (NP-hard) for any realistic dataset size.

Lloyd's algorithm (the standard K-Means algorithm) solves this approximately through alternating optimisation: (1) assign each point to its nearest centroid, (2) recompute each centroid as the mean of its assigned points, repeat until convergence. Each step is guaranteed to never increase J, so the algorithm converges — but only to a local minimum that depends heavily on the random initial centroid positions.

from sklearn.cluster import KMeans
import numpy as np

# k-means++ initialization (default) spreads initial centroids apart
# to reduce the chance of poor local minima — much better than random init
km = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
km.fit(X)

print('Inertia (WCSS):', km.inertia_)
print('Cluster centers:', km.cluster_centers_)

# n_init=10 runs the algorithm 10 times with different initializations
# and keeps the result with lowest inertia — mitigates local minima

# Elbow method to choose k: plot inertia vs k, look for the 'elbow'
inertias = []
for k in range(1, 10):
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
    inertias.append(km.inertia_)
# inertia always decreases with more clusters — look for diminishing returns
What objective function does K-Means minimise?
Why does K-Means only guarantee convergence to a local minimum rather than the global minimum?
15. What is the statistical rationale behind k-fold cross-validation, and why are k=5 or k=10 commonly used?

Cross-validation estimates how well a model generalises to unseen data by repeatedly splitting the training data into a training fold and a validation fold, training on the former and evaluating on the latter, then averaging the results. K-fold CV divides data into k equal partitions, using each partition once as validation while training on the remaining k-1 folds, giving k separate performance estimates that are then averaged.

This addresses a fundamental statistical tension: using more folds (larger k) means each training set is larger and closer to using all the data, which reduces bias in the performance estimate, but the k estimates become more correlated with each other (since training sets overlap heavily), which can increase variance of the final averaged estimate. The extreme case, k=n (leave-one-out CV), has very low bias but high variance and is computationally expensive. Empirically, k=5 or k=10 has been found to offer a good bias-variance balance for the estimate itself, while remaining computationally tractable.

from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Standard k-fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)
print(f'Mean: {scores.mean():.3f}, Std: {scores.std():.3f}')

# StratifiedKFold preserves class proportions in each fold —
# CRITICAL for imbalanced classification problems
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf)

# Leave-one-out: k=n, very low bias but high variance, expensive
from sklearn.model_selection import LeaveOneOut
# loo_scores = cross_val_score(model, X, y, cv=LeaveOneOut())  # slow!
Why does increasing k in k-fold cross-validation reduce bias in the performance estimate?
Why might leave-one-out cross-validation (k=n) have higher variance than k=10 despite having lower bias?
16. What does the ROC-AUC score mathematically represent, and why is it threshold-independent?

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (TPR/recall) against the False Positive Rate (FPR) as the classification decision threshold is varied from 0 to 1. The Area Under this Curve (AUC) has an elegant probabilistic interpretation: AUC equals the probability that a randomly chosen positive example receives a higher predicted score than a randomly chosen negative example: AUC = P(score(positive) > score(negative)).

This is why AUC is threshold-independent — it measures the model's ability to rank positive examples above negative examples across all possible thresholds simultaneously, rather than evaluating performance at one specific cutoff. A perfect classifier achieves AUC=1.0 (every positive ranked above every negative); random guessing achieves AUC=0.5 (equivalent to a coin flip ranking).

from sklearn.metrics import roc_auc_score, roc_curve
import numpy as np

y_true  = np.array([0, 0, 1, 1, 1, 0, 1])
y_score = np.array([0.1, 0.4, 0.35, 0.8, 0.65, 0.2, 0.9])

auc = roc_auc_score(y_true, y_score)
print(f'AUC: {auc:.3f}')

# Manual verification: count concordant pairs (Mann-Whitney U statistic)
pos_scores = y_score[y_true == 1]
neg_scores = y_score[y_true == 0]
concordant = sum(p > n for p in pos_scores for n in neg_scores)
total_pairs = len(pos_scores) * len(neg_scores)
print(f'Manual AUC: {concordant / total_pairs:.3f}')  # matches roc_auc_score

fpr, tpr, thresholds = roc_curve(y_true, y_score)
# Each point on the curve corresponds to a different threshold

Caution with imbalanced classes: AUC can be misleadingly optimistic on highly imbalanced datasets because the False Positive Rate denominator (total negatives) is large, making even a meaningful number of false positives look small. In such cases, Precision-Recall AUC is usually more informative.

What probability does the AUC score represent?
Why might ROC-AUC give a misleadingly optimistic picture on highly imbalanced datasets?
17. Explain the mathematical tradeoff between precision and recall, and why F1 score is the harmonic mean rather than the arithmetic mean.

Precision is TP/(TP+FP) — of everything predicted positive, what fraction was actually positive. Recall is TP/(TP+FN) — of everything that was actually positive, what fraction did the model find. Adjusting the classification threshold creates an inherent tradeoff: lowering the threshold to capture more true positives (raising recall) inevitably captures more false positives too (lowering precision), and vice versa.

F1 score is the harmonic mean of precision and recall: F1 = 2·(P·R)/(P+R), rather than the simple arithmetic mean (P+R)/2. The harmonic mean punishes extreme imbalance between the two values much more heavily — if precision is 1.0 and recall is 0.0, the arithmetic mean gives a deceptively decent 0.5, while the harmonic mean correctly gives 0.0, since a model with zero recall is useless regardless of how precise its rare positive predictions are.

from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_curve
import numpy as np

# Demonstrating why harmonic mean is preferred
precision, recall = 1.0, 0.01  # extreme imbalance
arithmetic_mean = (precision + recall) / 2
harmonic_mean = 2 * (precision * recall) / (precision + recall)
print(f'Arithmetic: {arithmetic_mean:.3f}')  # 0.505 — misleadingly good
print(f'Harmonic:   {harmonic_mean:.3f}')    # 0.020 — correctly reflects uselessness

# Using scikit-learn metrics
y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0]
print('Precision:', precision_score(y_true, y_pred))
print('Recall:',    recall_score(y_true, y_pred))
print('F1:',        f1_score(y_true, y_pred))

# Tuning threshold to favor one metric over the other
y_scores = [0.9, 0.2, 0.4, 0.85, 0.1, 0.7, 0.55, 0.3]
precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)
Why is the F1 score computed as a harmonic mean rather than an arithmetic mean of precision and recall?
What happens to precision when you lower the classification threshold to increase recall?
18. What is the 'naive' independence assumption in Naive Bayes, and why does it still work well in practice despite being unrealistic?

Naive Bayes applies Bayes' theorem to classify: P(y|x₁,...,xₙ) ∝ P(y)·P(x₁,...,xₙ|y). Computing the joint likelihood P(x₁,...,xₙ|y) exactly would require modelling all interactions between features — infeasible with limited data. The 'naive' simplification assumes all features are conditionally independent given the class: P(x₁,...,xₙ|y) = ∏ P(xᵢ|y), reducing the problem to estimating n simple univariate distributions instead of one complex n-dimensional joint distribution.

This independence assumption is almost always technically false (features usually correlate), yet Naive Bayes frequently performs well because classification only requires getting the relative ranking of class probabilities correct, not their exact values. Even with a biased probability estimate, if the bias affects all classes similarly, the argmax decision (which class has highest probability) often remains correct — a well-known result is that NB's classification accuracy can be good even when its probability estimates are poorly calibrated.

from sklearn.naive_bayes import GaussianNB, MultinomialNB
import numpy as np

# GaussianNB: assumes each feature is normally distributed within each class
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Manual demonstration of the independence factorization
def naive_bayes_predict(x, class_priors, feature_likelihoods):
    scores = {}
    for c in class_priors:
        # log space to avoid numerical underflow from many small probabilities
        log_prob = np.log(class_priors[c])
        for i, xi in enumerate(x):
            log_prob += np.log(feature_likelihoods[c][i](xi) + 1e-10)
        scores[c] = log_prob
    return max(scores, key=scores.get)

# MultinomialNB: common for text classification (word counts)
mnb = MultinomialNB()
mnb.fit(X_train_counts, y_train)
What does the 'naive' independence assumption in Naive Bayes simplify?
Why can Naive Bayes still classify accurately even though its independence assumption is usually false?
19. Why is a log transformation commonly applied to skewed numerical features before modeling, mathematically?

Many real-world quantities — income, population, word frequencies, prices — follow a right-skewed (long right tail) distribution, often approximately log-normal. The mathematical property of the logarithm that makes it useful here is that it compresses large values much more than small ones: log(1000) - log(100) ≈ 2.3 while log(100) - log(10) ≈ 2.3 as well — equal ratios become equal differences after a log transform. This converts multiplicative relationships into additive ones and pulls in the long tail, making the distribution closer to symmetric/normal.

This matters for linear models because OLS assumes residuals are normally distributed with constant variance (homoscedasticity); a skewed target or feature violates this and can lead to heteroscedastic residuals where prediction error grows with the magnitude of the target. It also matters for distance-based and gradient-based methods, where a few extreme outliers in the raw scale would otherwise dominate the loss or distance calculations.

import numpy as np
from sklearn.preprocessing import FunctionTransformer, PowerTransformer
import pandas as pd

# Simulating right-skewed income data
income = np.random.lognormal(mean=10, sigma=1, size=1000)
print('Skewness before:', pd.Series(income).skew())     # highly positive

log_income = np.log1p(income)  # log1p handles zero values safely: log(1+x)
print('Skewness after:', pd.Series(log_income).skew())  # close to 0

# Integrate into a scikit-learn pipeline
log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.fit_transform(X[['income']])

# Box-Cox / Yeo-Johnson: more general power transforms that
# find the optimal transformation parameter automatically
pt = PowerTransformer(method='yeo-johnson')  # handles negative values too
X_transformed = pt.fit_transform(X)
What mathematical property of the logarithm makes it useful for right-skewed data?
Why is np.log1p often preferred over np.log when transforming features that may contain zero?
20. What is multicollinearity, mathematically, and how does the Variance Inflation Factor (VIF) detect it?

Multicollinearity occurs when two or more predictor features are highly linearly correlated with each other. Mathematically, this means the design matrix X approaches rank deficiency — the columns become nearly linearly dependent, causing XᵀX to become nearly singular (its determinant approaches zero). Since OLS requires inverting XᵀX, near-singularity makes (XᵀX)⁻¹ numerically unstable: small changes in the data produce wildly different coefficient estimates, and standard errors of the coefficients inflate dramatically.

The Variance Inflation Factor for feature j is computed by regressing feature j against all other features and measuring VIF_j = 1/(1-R²_j), where R²_j is the R-squared of that auxiliary regression. If feature j is well-predicted by the others (high R²_j), VIF is large — a VIF of 10 means the variance of that coefficient's estimate is 10 times what it would be if the feature were uncorrelated with the others. A common rule of thumb flags VIF > 5 or 10 as concerning.

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
import numpy as np

def calculate_vif(X_df):
    vif_data = pd.DataFrame()
    vif_data['feature'] = X_df.columns
    vif_data['VIF'] = [
        variance_inflation_factor(X_df.values, i)
        for i in range(X_df.shape[1])
    ]
    return vif_data

print(calculate_vif(X_train_df))
# feature      VIF
# age          1.2
# income       8.7   <- concerning, correlated with other features
# debt_ratio   9.1   <- concerning

# Mitigation: use Ridge regression, which handles multicollinearity
# gracefully by shrinking correlated coefficients together
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0).fit(X_train, y_train)
What mathematical problem does severe multicollinearity cause for OLS regression?
How is the Variance Inflation Factor for a feature computed?
21. Why must features be standardized before applying Ridge or Lasso regularization, mathematically?

Ridge and Lasso add a penalty proportional to coefficient magnitude — λΣβⱼ² or λΣ|βⱼ| respectively. The magnitude of a coefficient βⱼ is inversely related to the scale of its corresponding feature: if feature j is measured in millions (e.g. company revenue) its coefficient will naturally be tiny, while a feature measured in single digits (e.g. years of experience) will need a much larger coefficient to have comparable predictive impact. Without standardization, the penalty term unfairly penalises features on small scales (which need large coefficients) far more than features on large scales (which need small coefficients), regardless of their actual importance to the prediction.

After standardizing all features to have mean 0 and standard deviation 1, every coefficient represents "effect per one standard deviation change" on a comparable scale, so the regularization penalty treats all features fairly based on their actual predictive contribution rather than an arbitrary measurement unit.

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# WRONG: regularizing on raw, unscaled features
ridge_unscaled = Ridge(alpha=1.0).fit(X_train, y_train)
print('Unscaled coefficients:', ridge_unscaled.coef_)
# A revenue-in-dollars feature might get coef ~0.00001
# A years-of-experience feature might get coef ~500
# The penalty unfairly shrinks the experience coefficient much more

# CORRECT: always scale before regularized linear models
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0)),
])
pipeline.fit(X_train, y_train)

scaled_coefs = pipeline.named_steps['ridge'].coef_
print('Scaled coefficients (comparable):', scaled_coefs)

# Note: scikit-learn's LinearRegression and tree models don't
# need this — only penalized linear models (Ridge, Lasso, ElasticNet)
Why does the regularization penalty in Ridge/Lasso unfairly affect unscaled features?
After standardizing features, what does each Ridge/Lasso coefficient represent?
22. What is the mathematical relationship between learning_rate and n_estimators in gradient boosting?

In gradient boosting, the final ensemble prediction is F(x) = F₀(x) + η · Σₘ hₘ(x), where η is the learning rate (also called shrinkage) and the sum runs over n_estimators trees. The learning rate scales down the contribution of each individual tree. A smaller η means each tree contributes less to the final prediction, so more trees (larger n_estimators) are needed to reach the same total predictive capacity — there is a direct multiplicative tradeoff between the two parameters.

The reason smaller learning rates with more estimators usually generalise better, despite requiring more compute, is regularisation through gradual fitting: taking many small steps allows the ensemble to average out noise in individual trees' residual-fitting, similar to how a smaller step size in gradient descent finds a more precise minimum. A large learning rate with few trees can overfit aggressively to the training residuals in just a few steps.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

configs = [
    {'learning_rate': 0.3, 'n_estimators': 50},    # fast, fewer trees
    {'learning_rate': 0.1, 'n_estimators': 150},   # balanced
    {'learning_rate': 0.01, 'n_estimators': 1500}, # slow, many trees
]

for cfg in configs:
    gb = GradientBoostingClassifier(**cfg, max_depth=3, random_state=42)
    scores = cross_val_score(gb, X, y, cv=5)
    print(f"lr={cfg['learning_rate']}, n_est={cfg['n_estimators']}: "
          f"{scores.mean():.3f}")
    # Smaller lr + more trees often generalizes slightly better,
    # at the cost of significantly longer training time

# Practical rule of thumb: lower the learning rate, increase n_estimators
# proportionally, and use early stopping (validation_fraction,
# n_iter_no_change) to find the right number of trees automatically
gb_early_stop = GradientBoostingClassifier(
    learning_rate=0.05,
    n_estimators=1000,
    validation_fraction=0.1,
    n_iter_no_change=10,   # stop if no improvement for 10 rounds
)
What does the learning_rate (shrinkage) parameter control in gradient boosting?
Why does a smaller learning rate combined with more trees often generalize better than a large learning rate with few trees?
23. How does the softmax function generalize logistic regression to multiclass classification, mathematically?

Binary logistic regression uses the sigmoid function to convert a single linear score into a probability between 0 and 1. For k classes, the softmax function generalises this: given k linear scores (logits) z₁,...,zₖ, softmax computes p_i = e^{z_i} / Σⱼ e^{z_j} for each class i. This produces a valid probability distribution — all values are positive and sum to exactly 1 — by exponentiating each score (making them positive) and normalising by the sum of all exponentials.

Softmax reduces exactly to the sigmoid function in the binary case: with two classes and scores z₁, z₂, p_1 = e^{z_1}/(e^{z_1}+e^{z_2}) = 1/(1+e^{-(z_1-z_2)}) — the sigmoid of the score difference. The training objective generalises from binary cross-entropy to categorical cross-entropy: L = -Σᵢ yᵢ log(pᵢ), summed over all classes for each example, which reduces to the negative log of the predicted probability for the true class.

import numpy as np
from sklearn.linear_model import LogisticRegression

def softmax(z):
    z_shifted = z - np.max(z)         # numerical stability trick
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z)

logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs)        # [0.659, 0.242, 0.099] — sums to 1.0
print(probs.sum())  # 1.0

# scikit-learn handles multinomial logistic regression natively
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)
probs_sklearn = model.predict_proba(X_test)
print(probs_sklearn.sum(axis=1))  # each row sums to 1.0

Why subtract the max before exponentiating: e^z can overflow to infinity for large z; subtracting the maximum logit before exponentiating keeps all values ≤ 1 inside the exponential while leaving the final softmax output mathematically unchanged (since the normalisation cancels the shift).

What property does the softmax function guarantee for its output values?
Why is the maximum logit subtracted from all logits before applying the exponential in softmax?
24. Why does fitting a scaler or transformer on the entire dataset (before train/test split) cause data leakage, mathematically?

Data leakage occurs when information from outside the training set improperly influences the model. If you fit a StandardScaler on the full dataset before splitting, the computed mean and standard deviation incorporate statistics from the test set. The scaled training data therefore implicitly contains information about the test set's distribution — even though no test labels are involved, the model's effective input distribution has been informed by test data it should never have seen.

This violates the assumption underlying generalisation estimates: the test set should represent a complete simulation of unseen future data, where you have no access to its statistics at training time. Although the leakage from this specific mistake is often small in magnitude, it systematically biases test performance to look better than true generalisation performance — and the bias compounds when more elaborate preprocessing or feature engineering is involved.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# WRONG: fit scaler on all data, THEN split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)   # leakage! mean/std include test rows
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# CORRECT: split first, fit scaler ONLY on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit on train only
X_test_scaled  = scaler.transform(X_test)        # transform test with train's stats

# BEST PRACTICE: use a Pipeline — guarantees correct fit/transform separation
# automatically, especially important inside cross-validation loops
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression()),
])
pipeline.fit(X_train, y_train)  # scaler only sees X_train internally
What statistical information leaks into the model when you fit a scaler before splitting train and test sets?
Why is using a scikit-learn Pipeline recommended for preventing this kind of leakage during cross-validation?
25. How does the class_weight parameter mathematically address class imbalance in scikit-learn classifiers?

When classes are imbalanced (e.g. 95% negative, 5% positive), a model trained with the standard loss function will naturally lean toward predicting the majority class, since doing so minimises average loss across the imbalanced training set even while completely ignoring the minority class. The class_weight='balanced' option modifies the loss function to multiply each sample's contribution by a weight inversely proportional to its class frequency: weight_c = n_samples / (n_classes × n_samples_c).

This re-weighting effectively makes errors on minority-class examples count more in the total loss, forcing the optimisation to pay attention to them despite their rarity. For logistic regression, this directly modifies the cross-entropy loss term per sample; for SVMs, it modifies the penalty for margin violations on each class; for trees, it modifies the impurity calculation to weight minority-class samples more heavily when computing splits.

from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

y = np.array([0]*950 + [1]*50)  # 95%/5% imbalance

weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
print(weights)  # [0.526, 10.0] — minority class weighted 19x more

# weight_0 = 1000 / (2 * 950) = 0.526
# weight_1 = 1000 / (2 * 50)  = 10.0

# Apply during training
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

# Custom weights also supported
model_custom = LogisticRegression(class_weight={0: 1, 1: 15})

# Tree-based models support the same parameter
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(class_weight='balanced')
How is the 'balanced' class weight computed for a given class c?
What practical problem does class_weight='balanced' address during model training?
26. Why does using simple label encoding (integers) for nominal categorical features mislead most machine learning models, mathematically?

Label encoding assigns each category an arbitrary integer: e.g. Red=0, Green=1, Blue=2. The problem is that most models — linear regression, logistic regression, distance-based methods, and even many tree splitting algorithms that treat features as ordered — implicitly assume numeric features have a meaningful order and magnitude. A linear model would learn a single coefficient β for this feature, implying Blue (2β) is "twice as much" of something as Green (1β), and the effect of going from Red to Green is identical in size to going from Green to Blue. For a nominal (unordered) category like colour, this numeric relationship is meaningless and introduces a false signal.

One-hot encoding solves this by representing each category as a separate binary indicator column, removing any implied ordering or magnitude relationship: the model learns an independent coefficient for each category, with no false constraint linking them. The mathematical price is increased dimensionality — k categories become k (or k-1, with drop='first' to avoid the dummy variable trap) separate columns instead of one.

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import numpy as np

colors = np.array(['Red', 'Green', 'Blue', 'Green']).reshape(-1, 1)

# Label encoding — implies false ordering (Blue=2 > Green=1 > Red=0)
le = LabelEncoder()
label_encoded = le.fit_transform(colors.ravel())
print(label_encoded)  # [2, 1, 0, 1] — numerically meaningless order

# One-hot encoding — no implied order, each category is independent
ohe = OneHotEncoder(sparse_output=False, drop='first')
one_hot = ohe.fit_transform(colors)
print(one_hot)
# [[0, 1],   # Red    (Green=0, Red=0 -> dropped baseline)
#  [1, 0],   # Green
#  [0, 0],   # Blue   (baseline, dropped category)
#  [1, 0]]   # Green

# EXCEPTION: tree-based models can sometimes handle label-encoded
# nominal features reasonably well since they split on thresholds
# rather than assuming linear magnitude relationships, but one-hot
# encoding (or target encoding) is still generally safer
What false assumption does label encoding introduce for a nominal categorical feature?
What is the mathematical tradeoff introduced by one-hot encoding compared to label encoding?
27. What is the difference between a single train/validation/test split and k-fold cross-validation for hyperparameter tuning, statistically?

A single validation split estimates a hyperparameter's performance using just one specific subset of data — this estimate has high variance because it depends entirely on which particular samples happened to land in the validation fold. If you tune hyperparameters against this single estimate, you risk overfitting to the quirks of that specific split (sometimes called "validation set overfitting").

K-fold cross-validation produces k separate performance estimates by rotating which fold serves as validation, then averages them. The variance of this average is mathematically lower than the variance of a single estimate (by a factor related to the correlation between folds, as discussed in the bias-variance tradeoff of CV), giving a more statistically reliable signal for comparing hyperparameter choices, at the cost of k times the computation.

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC

# Single validation split approach (faster, noisier)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
best_score, best_C = 0, None
for C in [0.1, 1, 10, 100]:
    model = SVC(C=C).fit(X_train, y_train)
    score = model.score(X_val, y_val)
    if score > best_score:
        best_score, best_C = score, C

# K-fold cross-validation approach (slower, more reliable signal)
param_grid = {'C': [0.1, 1, 10, 100]}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_full, y_train_full)
print('Best C:', grid_search.best_params_)
print('CV score:', grid_search.best_score_)

# Nested CV for unbiased performance estimate after tuning:
# outer loop estimates generalization, inner loop tunes hyperparameters
from sklearn.model_selection import cross_val_score
nested_scores = cross_val_score(grid_search, X, y, cv=5)
Why does a single train/validation split have higher variance than k-fold cross-validation for hyperparameter selection?
What is the purpose of nested cross-validation when both tuning hyperparameters and estimating final model performance?
28. Why is PCA sensitive to feature scaling while decision tree feature importance is not, mathematically?

PCA's objective is to find directions of maximum variance in the data, computed from the covariance matrix. Variance is measured in squared units of the original feature, so a feature measured in large units (e.g. salary in dollars, variance in the millions) will dominate the covariance matrix and consequently the principal components, regardless of whether that feature is actually more informative than a feature measured in small units (e.g. age in years, variance in the tens). This makes PCA fundamentally scale-dependent.

Decision trees, by contrast, choose splits based on threshold comparisons (feature ≤ t) and evaluate the resulting impurity reduction — neither the comparison nor the impurity calculation depends on the numeric scale of the feature, only its relative ordering and how well a split separates classes/reduces variance. Multiplying a feature by 1000 doesn't change which split point achieves the best separation, so tree-based feature importance (computed from total impurity reduction attributable to a feature across all trees/splits) is naturally scale-invariant.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# Demonstrating PCA's scale sensitivity
X_unscaled = np.column_stack([
    np.random.randn(100) * 1,        # small variance feature
    np.random.randn(100) * 1000,     # huge variance feature (different units)
])

pca_unscaled = PCA(n_components=2).fit(X_unscaled)
print(pca_unscaled.explained_variance_ratio_)
# Almost entirely dominated by the large-variance feature!

pca_scaled = PCA(n_components=2).fit(StandardScaler().fit_transform(X_unscaled))
print(pca_scaled.explained_variance_ratio_)
# Closer to 50/50 — reflects each feature's TRUE informativeness

# Tree-based feature importance is scale-invariant — no scaling needed
rf = RandomForestClassifier(n_estimators=100).fit(X_unscaled, y)
print(rf.feature_importances_)  # unaffected by the artificial scale difference
Why is decision tree feature importance unaffected by feature scaling?
Why can an unscaled feature with artificially large variance dominate the first principal component in PCA?
29. Why is the decision boundary of standard logistic regression always a straight line (or hyperplane), mathematically?

Logistic regression predicts class 1 when p(y=1|x) = σ(wᵀx + b) ≥ 0.5. Since the sigmoid function σ is monotonically increasing and equals exactly 0.5 when its input is 0, this condition simplifies to wᵀx + b ≥ 0 — a linear inequality in x. The boundary where the model is exactly undecided (p=0.5) is therefore the set of points satisfying wᵀx + b = 0, which is precisely the equation of a hyperplane (a line in 2D, a plane in 3D, and so on).

This is mathematically guaranteed regardless of how the weights w are learned — the sigmoid transformation only reshapes the probability output, it never changes the fact that the underlying decision rule depends linearly on x. To capture non-linear decision boundaries, you must either engineer non-linear features (e.g. polynomial terms x², x₁x₂) before applying logistic regression, or switch to inherently non-linear models like kernel SVMs, trees, or neural networks.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=300, noise=0.1, factor=0.4)

# Plain logistic regression: linear boundary, FAILS on circular data
plain_logreg = LogisticRegression().fit(X, y)
print('Plain accuracy:', plain_logreg.score(X, y))  # poor, ~50%

# Add polynomial features to create a non-linear boundary
# in the ORIGINAL space (still linear in the TRANSFORMED space)
poly_logreg = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    LogisticRegression()
)
poly_logreg.fit(X, y)
print('Polynomial accuracy:', poly_logreg.score(X, y))  # much better
# The model is STILL linear in the transformed feature space
# (x1, x2, x1^2, x1*x2, x2^2), but the boundary curves in original space
Why does logistic regression always produce a linear decision boundary in its original input space?
How can you make logistic regression produce a non-linear decision boundary?
30. Why can R-squared be a misleading metric for model comparison, and how does adjusted R-squared address this?

R-squared is defined as R² = 1 - SS_res/SS_tot, where SS_res is the sum of squared residuals and SS_tot is the total sum of squares (variance of y). It represents the proportion of variance in y explained by the model. The mathematical issue is that R² is monotonically non-decreasing as you add more features to a linear model — even adding a completely random, uninformative feature cannot decrease R², because OLS will always find some coefficient for it that fits the training data at least marginally better (or, at worst, exactly zero, leaving R² unchanged).

This makes R² unsuitable for comparing models with different numbers of features, since it will always favor the more complex model even if the added complexity is just overfitting noise. Adjusted R² corrects for this by introducing a penalty for the number of predictors p: R²_adj = 1 - (1-R²)(n-1)/(n-p-1). This formula can actually decrease when an added feature doesn't improve the fit enough to outweigh the penalty for the added degree of freedom, making it a fairer basis for comparing models with different feature counts.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

def adjusted_r2(r2, n, p):
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

# Demonstrating R^2 always increasing with more (even random) features
n_samples = 100
X_real = np.random.randn(n_samples, 3)
y = X_real @ [2, -1, 0.5] + np.random.randn(n_samples) * 0.5

model1 = LinearRegression().fit(X_real, y)
r2_1 = r2_score(y, model1.predict(X_real))

# Add 10 completely random, uninformative columns
X_noisy = np.column_stack([X_real, np.random.randn(n_samples, 10)])
model2 = LinearRegression().fit(X_noisy, y)
r2_2 = r2_score(y, model2.predict(X_noisy))

print(f'R^2 with 3 features:  {r2_1:.4f}')
print(f'R^2 with 13 features: {r2_2:.4f}')  # always >= r2_1, even with noise!

adj_r2_1 = adjusted_r2(r2_1, n_samples, 3)
adj_r2_2 = adjusted_r2(r2_2, n_samples, 13)
print(f'Adjusted R^2 (3 features):  {adj_r2_1:.4f}')
print(f'Adjusted R^2 (13 features): {adj_r2_2:.4f}')  # often lower!
Why can R-squared be misleading when comparing linear models with different numbers of features?
How does adjusted R-squared correct for the bias of plain R-squared?
31. Derive mathematically why bagging (bootstrap aggregating) reduces variance, and under what condition it does NOT help.

Suppose you have B independent models, each with the same variance σ² and the same expected prediction (no bias change from averaging). If the predictions were truly independent, the variance of their average would be Var(average) = σ²/B — variance shrinks proportionally to the number of models, approaching zero as B grows. This is the textbook justification for averaging predictions.

In practice, bagged models are trained on bootstrap samples drawn from the same original dataset, so their predictions are correlated with some pairwise correlation ρ, not independent. The correct formula for the variance of an average of B correlated variables is Var(average) = ρσ² + (1-ρ)σ²/B. As B → ∞, this converges to ρσ², not zero — meaning bagging's benefit is capped by how correlated the base models are. If the base models are highly correlated (ρ close to 1, e.g. deep, low-variance, very similar decision trees on similar data), bagging provides little benefit. This is exactly why random forests add feature-level randomness on top of bagging: to drive ρ down further.

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Plain bagging: bootstrap samples only, no feature randomness
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    bootstrap=True,
    max_features=1.0,   # ALL features considered each split — higher correlation
)

# Random forest: bootstrap samples AND feature randomness
forest = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',  # only sqrt(n) features per split — lower correlation
)

bagging_scores = cross_val_score(bagging, X, y, cv=5)
forest_scores  = cross_val_score(forest, X, y, cv=5)

print('Bagging std:', bagging_scores.std())
print('Forest std:',  forest_scores.std())
# Forest typically has lower variance due to decorrelated trees
In the variance formula for an average of B correlated estimators, Var = ρσ² + (1-ρ)σ²/B, what happens as B grows very large?
Why does bagging provide little benefit when the base models are highly correlated with each other?
32. Why does convexity of the loss function matter for optimization algorithms like gradient descent, mathematically?

A function is convex if a line segment connecting any two points on its graph lies above (or on) the graph itself — equivalently, its second derivative (or Hessian, in multiple dimensions) is non-negative everywhere. The critical property of a convex function is that any local minimum is also the global minimum — there are no other low points the optimisation could get stuck in.

This is why gradient descent on convex losses like OLS's squared error or logistic regression's cross-entropy is guaranteed to converge to the globally optimal solution (given a small enough learning rate), regardless of where the parameters are initialised. Non-convex loss landscapes — like those of neural networks — can have many local minima and saddle points, meaning gradient descent's final result depends on initialisation and may not find the best possible solution; this is also why neural network training often relies on heuristics (different initialisations, momentum, learning rate schedules) that convex optimisation never needs.

import numpy as np

# Convex function: a simple parabola has exactly one minimum
def convex_loss(theta):
    return (theta - 3) ** 2 + 1

# Non-convex function: multiple local minima
def nonconvex_loss(theta):
    return np.sin(theta) * theta**0.5 if theta > 0 else theta**2

# Verify convexity numerically via second derivative sign
def second_derivative_check(f, x, h=1e-5):
    return (f(x + h) - 2 * f(x) + f(x - h)) / h**2

thetas = np.linspace(0.1, 10, 50)
second_derivs = [second_derivative_check(convex_loss, t) for t in thetas]
print('All non-negative (convex)?', all(d >= 0 for d in second_derivs))

# scikit-learn's LinearRegression, LogisticRegression, and Ridge/Lasso
# all use convex losses, so the solver's result is deterministic
# given the same data, regardless of how it's initialized internally
What guarantee does convexity of a loss function provide for gradient-based optimization?
Why does neural network training rely on techniques like multiple initializations and momentum, while linear/logistic regression typically don't need them?
33. Mathematically, why does RobustScaler handle outliers better than StandardScaler?

StandardScaler transforms features using the mean and standard deviation: z = (x - μ)/σ. Both the mean and standard deviation are heavily influenced by extreme values — a single huge outlier can shift the mean substantially and dramatically inflate the standard deviation (since it involves squared deviations from the mean). This means StandardScaler can compress the majority of "normal" data points into a very narrow range near zero while the outlier dominates the scale, distorting the relative spacing among typical values.

RobustScaler instead uses the median and the interquartile range (IQR = Q3 - Q1): z = (x - median)/IQR. The median and IQR are robust statistics — by definition, they depend only on the middle portion of the sorted data and are unaffected by how extreme the tail values are (moving the maximum value further out doesn't change the median or IQR at all, as long as it stays in the same tail). This makes RobustScaler's transformation insensitive to the presence of outliers, preserving meaningful relative differences among the bulk of the data.

import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler

# Data with one extreme outlier
data = np.array([10, 12, 11, 13, 12, 11, 1000]).reshape(-1, 1)  # 1000 is an outlier

ss = StandardScaler()
rs = RobustScaler()

print('StandardScaler:', ss.fit_transform(data).ravel())
# The outlier dominates: normal values get squashed close together near 0

print('RobustScaler:', rs.fit_transform(data).ravel())
# Normal values retain meaningful spread; outlier is scaled but doesn't
# distort the relative positions of the typical values

print('Mean:', data.mean(), 'Std:', data.std())     # both skewed by outlier
print('Median:', np.median(data))                   # robust to the outlier
q75, q25 = np.percentile(data, [75, 25])
print('IQR:', q75 - q25)                             # also robust
Why is the standard deviation in StandardScaler sensitive to outliers?
Why are the median and interquartile range considered 'robust statistics'?
34. What does it mean for a classifier's predicted probabilities to be 'well-calibrated', and why don't all models produce calibrated probabilities naturally?

A classifier is well-calibrated if, among all the examples it assigns a predicted probability of (say) 0.7 to belonging to the positive class, approximately 70% of them actually are positive. Mathematically, calibration requires P(y=1 | p̂(x)=p) ≈ p for all probability values p the model outputs. This is a stronger requirement than just having good ranking ability (which is what AUC measures) — a model can perfectly rank examples (always score true positives higher than true negatives) while being badly calibrated (e.g., consistently outputting 0.9 for examples that are only 60% likely to be positive).

Models trained by directly optimising a proper probabilistic loss (like logistic regression's cross-entropy) tend to be naturally well-calibrated, because the loss function itself rewards accurate probability estimates, not just correct rankings. Models like SVMs (which optimise margin, not probability) or unregularised tree ensembles can produce poorly calibrated scores even when their predictions and rankings are good, because their training objective never explicitly targets calibrated probability output.

from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

svm = SVC(probability=True).fit(X_train, y_train)
logreg = LogisticRegression().fit(X_train, y_train)

for name, model in [('SVM', svm), ('LogReg', logreg)]:
    probs = model.predict_proba(X_test)[:, 1]
    frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    # Well-calibrated: frac_pos should closely track mean_pred
    print(f'{name}: predicted vs actual', list(zip(mean_pred, frac_pos)))

# Fix poor calibration with a calibration wrapper
calibrated_svm = CalibratedClassifierCV(svm, method='isotonic', cv=5)
calibrated_svm.fit(X_train, y_train)
calibrated_probs = calibrated_svm.predict_proba(X_test)[:, 1]
What does it mean for a classifier to be 'well-calibrated'?
Why can a model with excellent ranking ability (high AUC) still have poorly calibrated probability outputs?
35. Mathematically, why does stochastic gradient descent (SGD) scale to large datasets better than batch gradient descent?

Batch gradient descent computes the exact gradient of the loss using all n training examples before taking a single parameter update step: ∇L(θ) = (1/n)Σᵢ ∇Lᵢ(θ). This requires O(n) computation per update — for datasets with millions of examples, even one update step becomes expensive, and you typically need many updates to converge.

SGD instead estimates the gradient using a single randomly sampled example (or a small mini-batch): ∇L_i(θ) for a random i. This is an unbiased estimator of the true gradient — its expected value equals the true gradient — but with added noise/variance. The key insight is that SGD can take many more update steps in the same amount of computation (since each step is O(1) or O(batch_size) instead of O(n)), and despite the noisier individual steps, the overall trajectory converges because the noise averages out over many iterations. For very large datasets, this tradeoff strongly favours SGD: you converge faster in wall-clock time even though each individual step is less precise.

from sklearn.linear_model import SGDRegressor, LinearRegression
import numpy as np
import time

# Simulating a large dataset
n_samples = 1_000_000
X = np.random.randn(n_samples, 10)
y = X @ np.random.randn(10) + np.random.randn(n_samples) * 0.1

# SGDRegressor processes data in small batches, scales to large n
start = time.time()
sgd = SGDRegressor(max_iter=5, tol=1e-3)
sgd.fit(X, y)
print(f'SGD time: {time.time() - start:.3f}s')

# Closed-form OLS (LinearRegression) computes (X^T X)^-1 X^T y
# Cost scales with O(n*d^2 + d^3) — fine here but problematic
# for very high-dimensional or extremely large n cases
start = time.time()
lr = LinearRegression().fit(X, y)
print(f'Closed-form time: {time.time() - start:.3f}s')

# partial_fit allows incremental learning on streaming/chunked data —
# impossible with the closed-form or full-batch approach
sgd2 = SGDRegressor()
for chunk_start in range(0, n_samples, 10000):
    chunk = slice(chunk_start, chunk_start + 10000)
    sgd2.partial_fit(X[chunk], y[chunk])
Why is a single-example gradient estimate in SGD still useful despite being noisier than the full batch gradient?
What capability does SGD's partial_fit provide that closed-form OLS cannot offer?
36. Beyond scaling, why must feature selection methods also be included inside a cross-validation pipeline rather than applied beforehand?

Feature selection methods like SelectKBest choose features based on a statistical test (e.g. ANOVA F-value, mutual information) computed between each feature and the target across the available data. If you perform feature selection on the entire dataset before cross-validation, the selected features were chosen using information from what will later become both training and validation folds — even though no model has been fit yet, the choice of which features matter already encodes information about the validation fold's relationship between X and y.

This is a particularly insidious form of leakage because it doesn't involve fitting a predictive model — yet it still systematically inflates cross-validated performance estimates, since the selected feature subset is implicitly tuned to perform well on the data used to select it, including the validation folds. The correct procedure performs feature selection independently within each CV fold, using only that fold's training data, exactly mirroring how a Pipeline correctly handles scaling.

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# WRONG: select features using ALL data, then cross-validate
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)  # leakage: uses all of y
wrong_scores = cross_val_score(LogisticRegression(), X_selected, y, cv=5)
# This score is optimistically biased!

# CORRECT: feature selection inside the pipeline, refit per fold
pipeline = Pipeline([
    ('selector', SelectKBest(score_func=f_classif, k=10)),
    ('classifier', LogisticRegression()),
])
correct_scores = cross_val_score(pipeline, X, y, cv=5)
# Each fold independently selects its own top-10 features
# using only that fold's training data

print('Leaked estimate:  ', wrong_scores.mean())   # often higher
print('Honest estimate: ', correct_scores.mean())  # more realistic
Why does selecting features using the entire dataset before cross-validation cause leakage, even without fitting a predictive model first?
What is the correct way to perform feature selection within a cross-validation workflow?
«
»
Python Deep Learning and Neural Networks Interview Questions

Comments & Discussions