// TABLE OF CONTENTS

UNIT 1 — Math Foundations & ML Basics Q1: Eigenvalues & Eigenvectors Q2: PCA Steps Q3: PMF / Probability Q4: CNN Math Overview Q5: Eigen Decomposition A=[[3,1],[0,2]] Q6: Supervised vs Unsupervised + Bayesian Q7: Eigenvalues A=[[2,1],[1,2]] & B (diagonal) Q8: Overfitting vs Underfitting Q9: House Price Model (Bias-Variance) Q10: Model Complexity Q11: Compare Overfitting vs Underfitting (5 aspects) Q12: Batch vs SGD vs Mini-batch GD Q13: Gradient Descent Iterations
UNIT 2 — ANNs, CNNs, RNNs, LSTMs Q14: CNN for MNIST Q15: Biological Neuron + Activation Functions Q16: LSTM Gates (Theory) Q17: RNN Unfolding + BPTT Q18: Perceptron Learning Algorithm Q19: CNN Terminologies Q20: Convolution Numeric (4×4 image)
UNIT 3 — Autoencoders & VAE Q21: VAE Math Background Q22: VAE Working Principle Q23: Deep Autoencoder vs VAE

UNIT 1 — Mathematical Foundations & Machine Learning Basics

Q1 Eigenvalues & Eigenvectors — Concept + Importance in Deep Learning

// WHAT ARE EIGENVALUES & EIGENVECTORS?

For a square matrix A, if multiplying it by a vector v only scales it (doesn't rotate it), then v is called an eigenvector and the scaling factor λ is called an eigenvalue.

A · v = λ · v Where: A = square matrix v = eigenvector (non-zero vector) λ = eigenvalue (a scalar number)

// HOW TO FIND EIGENVALUES?

STEP 1 — Characteristic Equation det(A − λI) = 0 ← Solve this polynomial for λ
STEP 2 — Find Eigenvectors For each λ, solve: (A − λI) · v = 0

// EXAMPLE: A = [[4, 1], [2, 3]]

det([[4-λ, 1], [2, 3-λ]]) = 0 (4-λ)(3-λ) - (1)(2) = 0 12 - 7λ + λ² - 2 = 0 λ² - 7λ + 10 = 0 (λ - 5)(λ - 2) = 0 → λ₁ = 5, λ₂ = 2 For λ₁ = 5: (A - 5I)v = 0 → [[-1,1],[2,-2]]v=0 → v₁ = [1, 1] For λ₂ = 2: (A - 2I)v = 0 → [[2,1],[2,1]]v=0 → v₂ = [1, -2]

// IMPORTANCE IN DEEP LEARNING

ApplicationHow Eigenvalues Help
PCA (Dimensionality Reduction)Eigenvectors = principal components; Eigenvalues = importance/variance
Gradient DescentEigenvalues of Hessian tell us if a point is min/max/saddle
Vanishing/Exploding GradientsEigenvalues of weight matrix <1 → vanishing; >1 → exploding
Weight InitializationKeep eigenvalues ≈ 1 to stabilize training
Q2 PCA Steps for X = [[2,0], [0,2], [3,1]] — Mean Centering, Covariance, Eigenvalues

// STEP 1 — MEAN CENTERING

Data X: x₁ = [2, 0, 3] (column 1) x₂ = [0, 2, 1] (column 2) Mean of x₁ = (2 + 0 + 3)/3 = 5/3 ≈ 1.667 Mean of x₂ = (0 + 2 + 1)/3 = 3/3 = 1.000 Centered Data X̄: Row 1: [2 - 1.667, 0 - 1] = [ 0.333, -1] Row 2: [0 - 1.667, 2 - 1] = [-1.667, 1] Row 3: [3 - 1.667, 1 - 1] = [ 1.333, 0]

// STEP 2 — COVARIANCE MATRIX

C = (1/(n-1)) × X̄ᵀ × X̄ where n = 3 Compute sums: Σ(x₁)² = 0.333² + 1.667² + 1.333² = 0.111 + 2.779 + 1.777 = 4.667 Σ(x₂)² = (-1)² + 1² + 0² = 1 + 1 + 0 = 2 Σ(x₁x₂)= 0.333×(-1) + (-1.667)×1 + 1.333×0 = -0.333 - 1.667 + 0 = -2 C = (1/2) × [[4.667, -2], [-2, 2]] C = [[2.333, -1], [-1, 1]]

// STEP 3 — EIGENVALUES

det(C - λI) = 0 |(2.333 - λ) -1 | | -1 (1 - λ) | = 0 (2.333 - λ)(1 - λ) - (-1)(-1) = 0 2.333 - 2.333λ - λ + λ² - 1 = 0 λ² - 3.333λ + 1.333 = 0 Using quadratic formula: λ = (3.333 ± √(3.333² - 4×1.333)) / 2 = (3.333 ± √(11.11 - 5.33)) / 2 = (3.333 ± √5.78) / 2 = (3.333 ± 2.404) / 2 λ₁ = (3.333 + 2.404)/2 ≈ 2.869 ← larger eigenvalue (1st principal component) λ₂ = (3.333 - 2.404)/2 ≈ 0.465 ← smaller eigenvalue (2nd principal component)
✅ Key Insight: The larger eigenvalue (2.869) corresponds to the direction of maximum variance in the data. In PCA, we keep only the top eigenvectors, reducing dimensions.
Q3 PMF — Find a, P(X<3), Mean, Variance

Given: P(X) = a, 3a, 5a, 7a, 9a, 11a, 13a, 15a, 17a for X = 0,1,2,...,8

// (i) FIND VALUE OF a

Sum of all probabilities = 1 a + 3a + 5a + 7a + 9a + 11a + 13a + 15a + 17a = 1 81a = 1 a = 1/81 ≈ 0.01235

// (ii) P(X < 3)

P(X < 3) = P(X=0) + P(X=1) + P(X=2) = a + 3a + 5a = 9a = 9/81 = 1/9 ≈ 0.111

// (iii) MEAN = E(X) = Σ x·P(x)

E(X) = 0·a + 1·3a + 2·5a + 3·7a + 4·9a + 5·11a + 6·13a + 7·15a + 8·17a = a(0 + 3 + 10 + 21 + 36 + 55 + 78 + 105 + 136) = a × 444 = 444/81 = 148/27 ≈ 5.481

// (iv) VARIANCE = E(X²) − [E(X)]²

E(X²) = 0²·a + 1²·3a + 2²·5a + 3²·7a + 4²·9a + 5²·11a + 6²·13a + 7²·15a + 8²·17a = a(0 + 3 + 20 + 63 + 144 + 275 + 468 + 735 + 1088) = a × 2796 = 2796/81 Variance = 2796/81 - (444/81)² = 2796/81 - 197136/6561 = 226476/6561 - 197136/6561 = 29340/6561 ≈ 4.47
Answers: a = 1/81 | P(X<3) = 1/9 | Mean ≈ 5.48 | Variance ≈ 4.47
Q4 CNN Math — Filters, Stride, Pooling, Dropout vs Fully Connected ANNs

// CONVOLUTION FORMULA

Output(i, j) = Σ Σ Input(i+m, j+n) × Filter(m, n) Output feature map size: H_out = (H_in - F + 2P) / S + 1 W_out = (W_in - F + 2P) / S + 1 Where: F = filter size, P = padding, S = stride

// KEY COMPONENTS

ComponentWhat it doesWhy it helps
Filter/KernelSlides over image, extracts featuresDetects edges, textures, shapes
StrideStep size when sliding filterControls output size; larger stride = smaller output
PaddingAdds zeros around inputPreserves border info, controls output size
Pooling (Max/Avg)Reduces spatial dimensionsReduces parameters, adds translation invariance
DropoutRandomly turns off neurons during trainingPrevents overfitting
Deep LayersStack many conv layersLearn hierarchical features (edge→shape→object)

// CNN vs FULLY CONNECTED ANN

AspectCNNANN (Fully Connected)
ParametersFew (shared weights)Huge (28×28 image = 784 inputs × all neurons)
Spatial AwarenessYes — preserves 2D structureNo — flattens image, loses position info
Translation InvarianceYes — detects feature anywhereNo
Overfitting RiskLow (weight sharing)High (too many params)
Q5 Eigen Decomposition of A = [[3,1],[0,2]]

// STEP 1 — EIGENVALUES

det(A - λI) = 0 |(3-λ) 1 | | 0 (2-λ)| = 0 (3-λ)(2-λ) - 0 = 0 λ² - 5λ + 6 = 0 (λ - 3)(λ - 2) = 0 → λ₁ = 3, λ₂ = 2

// STEP 2 — EIGENVECTORS

For λ₁ = 3: (A - 3I)v = [[0, 1],[0,-1]] · v = 0 → v₂ = 0, v₁ = free → Eigenvector v₁ = [1, 0] For λ₂ = 2: (A - 2I)v = [[1, 1],[0, 0]] · v = 0 → v₁ + v₂ = 0 → v₁ = -v₂ → Eigenvector v₂ = [-1, 1] (or [1, -1])

// EIGEN DECOMPOSITION

A matrix A can be written as: A = P · D · P⁻¹

P = [[1, -1], D = [[3, 0], [0, 1]] [0, 2]] Condition to diagonalize: A must have n linearly independent eigenvectors.
✅ Result: λ₁ = 3 with v₁ = [1,0] | λ₂ = 2 with v₂ = [-1, 1]
Q6 Supervised vs Unsupervised Learning + Bayesian Classification + ML Steps

// SUPERVISED vs UNSUPERVISED

AspectSupervisedUnsupervised
LabelsUses labeled data (input + output)No labels — finds patterns itself
GoalLearn a mapping f(x) → yFind structure/clusters in data
ExamplesClassification, RegressionClustering, PCA, Autoencoders
GeneralizationMeasured by test errorMeasured by reconstruction or cluster quality
CapacityModel complexity must match data complexitySame — risk of overfitting latent space

// BAYESIAN CLASSIFICATION

Based on Bayes' Theorem: update prior belief using observed evidence.

P(Class | Data) = P(Data | Class) × P(Class) ─────────────────────────── P(Data) In simple terms: Posterior ∝ Likelihood × Prior Assign x to class C that maximizes P(C|x)

// STEPS TO BUILD AN ML ALGORITHM

  1. Collect & clean data — Handle missing values, outliers
  2. Exploratory Data Analysis — Understand distributions
  3. Feature Engineering — Select/transform useful features
  4. Choose Model — Based on problem type (classification, regression)
  5. Train the Model — Fit on training data
  6. Validate — Use validation set to tune hyperparameters
  7. Test — Evaluate on unseen test data
  8. Deploy & Monitor
Q7 Eigenvalues & Eigenvectors of A=[[2,1],[1,2]] and B=[[1,0,0],[0,5,0],[0,0,9]]

// MATRIX A = [[2,1],[1,2]]

det(A - λI) = (2-λ)² - 1 = 0 λ² - 4λ + 4 - 1 = 0 λ² - 4λ + 3 = 0 (λ-1)(λ-3) = 0 → λ₁ = 1, λ₂ = 3 For λ₁ = 1: (A-I)v=0 → [[1,1],[1,1]]v=0 → v₁+v₂=0 → v = [1,-1]/√2 For λ₂ = 3: (A-3I)v=0 → [[-1,1],[1,-1]]v=0 → v₁=v₂ → v = [1,1]/√2

// MATRIX B = [[1,0,0],[0,5,0],[0,0,9]] (Diagonal Matrix)

SHORTCUT: For diagonal matrices, eigenvalues = diagonal entries! λ₁ = 1, λ₂ = 5, λ₃ = 9 Eigenvectors = standard basis vectors: v₁ = [1,0,0] for λ₁=1 v₂ = [0,1,0] for λ₂=5 v₃ = [0,0,1] for λ₃=9
💡 Exam Tip: Always check if a matrix is diagonal or triangular — eigenvalues are just the diagonal entries!
Q8 Overfitting vs Underfitting — Concept, Graphs, Bias-Variance Tradeoff

// DEFINITIONS

OverfittingUnderfitting
DefinitionModel memorizes training data, fails on new dataModel too simple, can't even learn training data
Training ErrorVery LowHigh
Test/Validation ErrorVery HighHigh
BiasLowHigh
VarianceHighLow
Model ComplexityToo complexToo simple

// BIAS-VARIANCE TRADEOFF (ASCII Graph)

Error | | \ ← Total Error (U-shape) | \ / | \ / | \ / ← sweet spot |Bias \/ Variance | ← → +─────────────────── Model Complexity Simple Complex Bias² decreases as complexity increases Variance increases as complexity increases Total Error = Bias² + Variance + Irreducible Noise

// 3 TECHNIQUES TO REDUCE OVERFITTING

  1. Dropout — Randomly disable neurons during training. Forces the network to not rely on any single neuron → improves generalization.
  2. Regularization (L1/L2) — Adds a penalty term to the loss function for large weights. Keeps weights small → smoother model.
  3. Early Stopping — Stop training when validation error starts increasing, even if training error is still decreasing.
  4. Data Augmentation — Create more training samples by flipping, rotating, cropping images.
Q9 House Price Model — Identify Overfitting/Underfitting
ModelTraining ErrorValidation ErrorProblem
Model 14% (very low)22% (very high)🔴 OVERFITTING
Model 218% (high)20% (similar)🟡 UNDERFITTING

// JUSTIFICATION

Model 1 — Overfitting: The model learned the training data too well (memorized noise). It performs great on training (4%) but poorly on unseen data (22%). Large gap = high variance.

Model 2 — Underfitting: Both training and validation errors are high (18%, 20%), meaning the model is too simple to capture the underlying pattern. Small gap but both errors high = high bias.

// BIAS-VARIANCE IN CONTEXT

Q10 Model Complexity vs Training & Validation Error
Error | | ╔═══ Validation Error (U-shaped) | ╔╝ | ╔╝ | ╔╝ ← Overfitting zone (gap widens) | ╔╝ | ╔╝─────────── Training Error (always decreasing) | ╔╝ |────╔╝ +─────────────────────────────── Model Complexity Low Optimal High

Training Error always decreases as complexity increases (more complex model = better fit to training data).

Validation Error first decreases, then increases. It forms a U-shape:

Q11 Compare Overfitting vs Underfitting — 5 Aspects
AspectOverfittingUnderfitting
a. DefinitionModel learns training data too well, including noiseModel is too simple to capture the true pattern
b. IdentifyingLow train error, high test error (big gap)High train error AND high test error
c. Common CausesToo many parameters, too little data, no regularizationToo few layers, too few neurons, underpowered model
d. Train/Test Error & ComplexityLow train error, high test error, complex modelHigh train error, high test error, simple model
e. FixingDropout, regularization, more data, early stoppingAdd more layers/neurons, train longer, reduce regularization
Q12 Batch GD vs Stochastic GD vs Mini-Batch GD
AspectBatch GDStochastic GD (SGD)Mini-Batch GD
Data Used Per UpdateAll N samples1 sample at a timeSmall batch (e.g., 32, 64)
Update FrequencyOnce per epochN times per epochN/batch_size times
SpeedSlow (large computation)Fast per updateFast + stable
ConvergenceSmooth, stableNoisy/oscillatingBalanced
MemoryNeeds all data in RAMVery low memoryModerate
GPU EfficiencyGoodPoorBest (vectorized)
Used In Practice?RarelySometimes✅ Most Common
Update rule (same for all, just changes what data is used): w = w - η × ∇J(w)
Q13 Gradient Descent on J(w) = (w-3)², w₀=6, η=0.1, 0.5, 1.2
J(w) = (w - 3)² dJ/dw = 2(w - 3) ← gradient Update rule: w_new = w - η × 2(w - 3) Minimum is at w = 3

// η = 0.1 (Small — Slow Convergence)

w₀ = 6 Grad₀ = 2(6-3) = 6 w₁ = 6 - 0.1×6 = 5.4 Grad₁ = 2(5.4-3) = 4.8 w₂ = 5.4 - 0.1×4.8 = 4.92 Grad₂ = 2(4.92-3) = 3.84 w₃ = 4.92 - 0.1×3.84 = 4.536 → Moving toward 3 slowly ✓

// η = 0.5 (Perfect — Converges Immediately!)

w₀ = 6 Grad₀ = 2(6-3) = 6 w₁ = 6 - 0.5×6 = 3.0 ✅ Minimum reached! Grad₁ = 2(3-3) = 0 w₂ = 3.0 (no change) w₃ = 3.0

// η = 1.2 (Too Large — Diverges!)

w₀ = 6 Grad₀ = 2(6-3) = 6 w₁ = 6 - 1.2×6 = -1.2 Grad₁ = 2(-1.2-3) = -8.4 w₂ = -1.2 - 1.2×(-8.4) = 8.88 Grad₂ = 2(8.88-3) = 11.76 w₃ = 8.88 - 1.2×11.76 = -5.232 → Oscillating and DIVERGING ✗
⚠️ Conclusion:
• η too small → converges but very slowly
• η = 0.5 → converges perfectly in 1 step (lucky for this function)
• η too large (1.2) → overshoots minimum, weights diverge

UNIT 2 — Neural Networks, CNNs, RNNs & LSTMs

Q14 CNN for MNIST — Architecture & Layer Functions

// CNN ARCHITECTURE FOR MNIST (28×28 grayscale, 10 classes)

Input (28×28×1) ↓ Conv Layer 1 (32 filters, 3×3, ReLU) → 26×26×32 ↓ Max Pooling (2×2) → 13×13×32 ↓ Conv Layer 2 (64 filters, 3×3, ReLU) → 11×11×64 ↓ Max Pooling (2×2) → 5×5×64 ↓ Flatten → 1600 neurons ↓ Dense (128, ReLU) ↓ Dropout (0.5) ↓ Dense (10, Softmax) → 10 class probabilities

// LAYER FUNCTIONALITIES

LayerWhat It Does
ConvolutionApplies filters/kernels to detect features like edges, curves, textures. Learns spatial patterns using shared weights.
Max PoolingTakes the maximum value in each region. Reduces size, keeps important features, provides translation invariance.
FlattenConverts the 2D feature map into a 1D vector so it can be fed into a fully connected layer.
Dense (Fully Connected)Every neuron connects to every neuron in the next layer. Makes the final classification decision.
SoftmaxConverts final scores to probabilities (all sum to 1). Picks the most likely class.
Q15 Biological Neuron + ANN Components + Activation Functions + Vanishing Gradient

// A. BIOLOGICAL NEURON vs ANN

BIOLOGICAL NEURON | ANN EQUIVALENT ───────────────────────────────────────────── Dendrites (receive signals) | Inputs (x₁, x₂, ...) Synapse (connection strength)| Weights (w₁, w₂, ...) Cell Body (processes signal) | Summation: z = Σwᵢxᵢ + b Axon (sends output) | Activation Function: output = f(z) Threshold firing | Activation threshold

// B. COMPONENTS OF ANN

// C. SINGLE-LAYER vs MULTI-LAYER

Single-Layer (Perceptron)Multi-Layer (Deep Network)
Hidden LayersNoneOne or more
SolvesOnly linearly separable problemsNon-linear, complex problems
ExampleAND, OR gatesXOR, image recognition

// D. VANISHING GRADIENT PROBLEM (in Sigmoid)

Sigmoid output is between 0 and 1. Its gradient is: σ'(x) = σ(x)(1−σ(x)) — max value is 0.25.

In backpropagation, gradients multiply through each layer: gradient = dL/dw ∝ σ'(x₁) × σ'(x₂) × σ'(x₃) × ... ∝ 0.25 × 0.25 × 0.25 × ... → approaches ZERO In deep networks, gradients become so tiny that early layers learn almost nothing — this is the VANISHING GRADIENT PROBLEM.

Solution: Use ReLU instead of Sigmoid in hidden layers.

// E. ACTIVATION FUNCTIONS

FunctionFormulaRangeUse Case
Sigmoidσ(x) = 1/(1+e⁻ˣ)(0, 1)Binary classification output
Tanhtanh(x) = (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)(-1, 1)Hidden layers (better than sigmoid)
ReLUmax(0, x)[0, ∞)Hidden layers (most popular)
Leaky ReLUmax(0.01x, x)(-∞, ∞)Fixes "dying ReLU" problem
Softmaxeˣⁱ / Σeˣʲ(0,1), sums to 1Multi-class output layer
Q16 LSTM Architecture — Input Gate, Forget Gate, Output Gate

// WHY LSTM?

Traditional RNNs suffer from vanishing gradients — they forget long-term information. LSTM solves this using a special memory cell and gates.

// LSTM ARCHITECTURE OVERVIEW

┌─────────────────────────────────────────┐ x(t) ──▶│ Forget Gate → Input Gate → Cell Update │──▶ h(t) h(t-1)─▶│ → Output Gate │ └─────────────────────────────────────────┘ │ c(t) (Cell State = Long-term memory)

// THE THREE GATES

1. FORGET GATE — "What to forget from old memory?" f(t) = σ(Wf · [h(t-1), x(t)] + bf) Output between 0 and 1: 0 = completely forget | 1 = completely keep
2. INPUT GATE — "What new info to store?" i(t) = σ(Wi · [h(t-1), x(t)] + bi) ← how much to add C̃(t) = tanh(Wc · [h(t-1), x(t)] + bc) ← candidate values
3. CELL STATE UPDATE — "Update long-term memory" C(t) = f(t) × C(t-1) + i(t) × C̃(t) ↑ forget old ↑ add new info
4. OUTPUT GATE — "What to output right now?" o(t) = σ(Wo · [h(t-1), x(t)] + bo) h(t) = o(t) × tanh(C(t))

// HOW LSTM SOLVES VANISHING GRADIENT

The Cell State C(t) flows through time with only addition (not multiplication), preserving gradients. This is called the "constant error carousel" — gradients can flow backward without vanishing.

// APPLICATIONS

Q17 RNN Unfolding + BPTT + Vanishing/Exploding Gradients

// RNN — HOW IT WORKS

An RNN processes sequences. At each time step, it takes current input x(t) AND the previous hidden state h(t-1).

h(t) = tanh(Wh · h(t-1) + Wx · x(t) + b) y(t) = Wy · h(t) ← output at each step

// UNFOLDING ACROSS TIME

x(1) ──▶ [RNN] ──▶ h(1) ──▶ [RNN] ──▶ h(2) ──▶ [RNN] ──▶ h(3) ↓ ↓ ↓ y(1) y(2) y(3) (Same weights W are reused at each time step — weight sharing!)

// BPTT — BACKPROPAGATION THROUGH TIME

Like regular backprop, but the gradient flows backward through time steps.

Total Loss: L = Σ L(t) Gradient: ∂L/∂Wh = Σ ∂L(t)/∂Wh At each step, gradient gets multiplied by Wh and σ'(z): ∂h(t)/∂h(k) = Π (Wh · diag(σ'(h(i)))) ← product for i=k to t

// VANISHING & EXPLODING GRADIENTS

Vanishing GradientExploding Gradient
Cause|Wh| < 1 → product goes to 0|Wh| > 1 → product grows to ∞
EffectEarly layers don't learn (long-term memory lost)Weights blow up, NaN values
FixUse LSTM/GRUGradient Clipping
Q18 Perceptron Learning — Student Performance Dataset (η=0.6)
Step activation: f(net) = 1 if net ≥ 0, else 0 Update rule: w_new = w_old + η × (d - ŷ) × x b_new = b_old + η × (d - ŷ) η = 0.6 Initial weights: w1 = 0, w2 = 0, b = 0 (assumed, as not given) Data: [4,6] → d=0 [3,4] → d=1 [7,6] → d=1 [6,7] → d=1

// EPOCH 1 — TRAINING EXAMPLE BY EXAMPLE

Example 1: x=[4,6], d=0 net = 0×4 + 0×6 + 0 = 0 → ŷ = f(0) = 1 (since 0 ≥ 0) Error = d - ŷ = 0 - 1 = -1 w1 = 0 + 0.6×(-1)×4 = -2.4 w2 = 0 + 0.6×(-1)×6 = -3.6 b = 0 + 0.6×(-1) = -0.6
Example 2: x=[3,4], d=1 net = -2.4×3 + (-3.6)×4 + (-0.6) = -7.2 - 14.4 - 0.6 = -22.2 → ŷ = 0 Error = 1 - 0 = 1 w1 = -2.4 + 0.6×1×3 = -0.6 w2 = -3.6 + 0.6×1×4 = -1.2 b = -0.6 + 0.6×1 = 0.0
Example 3: x=[7,6], d=1 net = -0.6×7 + (-1.2)×6 + 0 = -4.2 - 7.2 = -11.4 → ŷ = 0 Error = 1 - 0 = 1 w1 = -0.6 + 0.6×1×7 = 3.6 w2 = -1.2 + 0.6×1×6 = 2.4 b = 0.0 + 0.6×1 = 0.6
Example 4: x=[6,7], d=1 net = 3.6×6 + 2.4×7 + 0.6 = 21.6 + 16.8 + 0.6 = 39 → ŷ = 1 Error = 1 - 1 = 0 (No update needed!) w1 = 3.6, w2 = 2.4, b = 0.6
After 1 epoch: w1=3.6, w2=2.4, b=0.6
Q19 CNN Terminologies — Why CNN? Layers, Convolution, Feature Map

// A. WHY CNNs OVER TRADITIONAL ANN FOR IMAGES?

// B. MAIN LAYERS IN CNN

  1. Convolutional Layer — Extracts features using filters
  2. Activation Layer (ReLU) — Adds non-linearity
  3. Pooling Layer — Reduces spatial size
  4. Flatten Layer — Converts to 1D
  5. Fully Connected (Dense) Layer — Classification

// C. CONVOLUTION OPERATION

A filter (small matrix) slides over the input image. At each position, element-wise multiplication happens, and the results are summed to produce one output value.

Output(i,j) = Σm Σn Input(i+m, j+n) × Filter(m, n)

// D. FEATURE MAP

The output produced after applying one filter to the entire input is called a feature map (also called an activation map). It represents where certain features (edges, curves, etc.) were detected in the image.

Q20 Convolution Numeric — 4×4 Image, 3×3 Filter, ReLU, Max Pooling

Input (4×4):

1 2 0 1 3 1 2 2 0 1 3 1 2 2 1 0

Filter (3×3):

1 0 -1 1 0 -1 1 0 -1

// a) CONVOLUTION OUTPUT

Output size = (4-3)/1 + 1 = 2 × 2 Position (0,0) — top-left 3×3 of input: [[1,2,0],[3,1,2],[0,1,3]] ⊙ [[1,0,-1],[1,0,-1],[1,0,-1]] = 1×1 + 2×0 + 0×(-1) + 3×1 + 1×0 + 2×(-1) + 0×1 + 1×0 + 3×(-1) = 1 + 0 + 0 + 3 + 0 - 2 + 0 + 0 - 3 = -1 Position (0,1) — cols 1-3 of rows 0-2: [[2,0,1],[1,2,2],[1,3,1]] ⊙ filter = 2×1 + 0×0 + 1×(-1) + 1×1 + 2×0 + 2×(-1) + 1×1 + 3×0 + 1×(-1) = 2 + 0 - 1 + 1 + 0 - 2 + 1 + 0 - 1 = 0 Position (1,0) — rows 1-3, cols 0-2: [[3,1,2],[0,1,3],[2,2,1]] ⊙ filter = 3×1 + 1×0 + 2×(-1) + 0×1 + 1×0 + 3×(-1) + 2×1 + 2×0 + 1×(-1) = 3 + 0 - 2 + 0 + 0 - 3 + 2 + 0 - 1 = -1 Position (1,1) — rows 1-3, cols 1-3: [[1,2,2],[1,3,1],[2,1,0]] ⊙ filter = 1×1 + 2×0 + 2×(-1) + 1×1 + 3×0 + 1×(-1) + 2×1 + 1×0 + 0×(-1) = 1 + 0 - 2 + 1 + 0 - 1 + 2 + 0 + 0 = 1 Feature Map = [[-1, 0], [-1, 1]]

// b) OUTPUT SIZE

Height = (4 - 3)/1 + 1 = 2 Width = (4 - 3)/1 + 1 = 2 → Output size: 2 × 2

// c) ReLU ACTIVATION: max(0, x)

Input: [[-1, 0], [-1, 1]] ReLU: [[ 0, 0], [ 0, 1]] (negative → 0, positive/zero → unchanged)

// d) 2×2 MAX POOLING (stride=1)

Applying 2×2 pool with stride=1 on 2×2 ReLU output: Only one position fits: rows 0-1, cols 0-1 = max(0, 0, 0, 1) = 1 Pooled Feature Map: [[1]] Pooled Size: 1 × 1
Feature Map = [[-1,0],[-1,1]] | ReLU = [[0,0],[0,1]] | Max Pool = [[1]] (size 1×1)

// HOW CONV + ACTIVATION + POOLING WORK TOGETHER

UNIT 3 — Autoencoders & Variational Autoencoders (VAE)

Q21 Variational Autoencoder (VAE) — Full Explanation

// TRADITIONAL AUTOENCODER vs VAE

AspectTraditional AutoencoderVAE
Latent SpaceFixed point (deterministic)Probability distribution (mean μ, std σ)
Can Generate?No (not a generative model)Yes (sample from latent space)
Latent Space StructureIrregular, gaps possibleContinuous, smooth, structured
Loss FunctionReconstruction loss onlyReconstruction loss + KL Divergence

// VAE ARCHITECTURE

Input x ↓ [ENCODER] ──▶ μ (mean), σ (std deviation) ← learn distribution, not a point ↓ Latent Space z = μ + ε·σ (ε ~ N(0,1) — reparameterization trick) ↓ [DECODER] ──▶ Reconstructed x̂

// COMPONENTS

ENCODER Maps input x to a distribution in latent space. Outputs: μ (mean vector) and log(σ²) (log variance) Does NOT output a single point — outputs parameters of a Gaussian distribution.
LATENT SPACE A compressed, continuous probabilistic space. Each dimension represents a meaningful feature. Similar inputs → nearby regions in latent space. Sampling from it enables GENERATION of new data.
DECODER Takes a sampled latent vector z and reconstructs the output x̂. Learns to reverse the encoding.

// VAE LOSS FUNCTION

Total Loss = Reconstruction Loss + KL Divergence 1. Reconstruction Loss: L_recon = ||x - x̂||² (or Binary Cross-Entropy for images) → How well does the decoder recreate the input? 2. KL Divergence: L_KL = -0.5 × Σ(1 + log(σ²) - μ² - σ²) → Forces the learned distribution to stay close to N(0,1) → Keeps latent space structured and continuous Total: L = L_recon + β × L_KL

// REPARAMETERIZATION TRICK

To allow backpropagation through the random sampling step:

Instead of sampling z ~ N(μ, σ²) [not differentiable] We write: z = μ + ε × σ where ε ~ N(0,1) [differentiable!] Now gradients can flow back through μ and σ.

// APPLICATIONS

Q22 Deep Autoencoder (DAE) vs VAE — Critical Comparison

// DEEP AUTOENCODER ARCHITECTURE

Input → [Dense] → [Dense] → [Bottleneck z] → [Dense] → [Dense] → Reconstructed Output Encoder (many layers) Decoder (many layers)

Uses multiple hidden layers for non-linear dimensionality reduction. Similar to PCA but non-linear.

// COMPARISON TABLE

AspectDeep Autoencoder (DAE)VAE
Latent SpaceFixed point — irregular, no structureProbability distribution — smooth & continuous
Reconstruction LossMSE or Binary Cross-Entropy onlyReconstruction Loss + KL Divergence
Generative Capability❌ Cannot generate new samples✅ Can generate new samples
GeneralizationMay not generalize wellBetter generalization due to regularized latent space
Feature ExtractionGood non-linear featuresAlso good, plus interpretable (mean/variance)
Use CaseDimensionality reduction, denoisingGeneration, interpolation, anomaly detection
InterpolationMay produce invalid outputsSmooth, meaningful interpolation possible
Q23 VAE as a Generative Model — Difference, Loss Function, Applications

// 1. AUTOENCODER vs VAE (Key Difference)

AutoencoderVAE
Encodingz = f(x) ← single pointz ~ N(μ(x), σ²(x)) ← distribution
Generative?NoYes — sample any z from latent space
Latent SpaceUnstructuredRegularized (Gaussian prior)

// 2. VAE LOSS FUNCTION

L_VAE = L_reconstruction + L_KL Reconstruction Loss: Measures how well x̂ matches x → For images: Binary Cross-Entropy or MSE KL Divergence: D_KL(q(z|x) || p(z)) = -0.5 × Σ[1 + log(σ²) - μ² - σ²] → Regularizes encoder to produce distributions close to N(0,1) → This ensures the latent space is smooth and continuous The balance between these two terms controls: High reconstruction weight → better fidelity High KL weight → more structured/generalizable latent space

// 3. APPLICATIONS

DomainApplication
Computer VisionFace generation, image editing (change hair color, add glasses)
Drug DiscoveryGenerate new molecule structures with desired properties
Anomaly DetectionNormal data has low reconstruction error; anomalies have high error
Data AugmentationGenerate synthetic training samples
Representation LearningLearn compact, meaningful features for downstream tasks