CSA3007 Deep Learning — Complete Exam Answers

Q1 Eigenvalues & Eigenvectors — Concept + Importance in Deep Learning ▼

// WHAT ARE EIGENVALUES & EIGENVECTORS?

For a square matrix A, if multiplying it by a vector v only scales it (doesn't rotate it), then v is called an eigenvector and the scaling factor λ is called an eigenvalue.

A · v = λ · v Where: A = square matrix v = eigenvector (non-zero vector) λ = eigenvalue (a scalar number)

// HOW TO FIND EIGENVALUES?

STEP 1 — Characteristic Equation det(A − λI) = 0 ← Solve this polynomial for λ

STEP 2 — Find Eigenvectors For each λ, solve: (A − λI) · v = 0

// EXAMPLE: A = [[4, 1], [2, 3]]

det([[4-λ, 1], [2, 3-λ]]) = 0 (4-λ)(3-λ) - (1)(2) = 0 12 - 7λ + λ² - 2 = 0 λ² - 7λ + 10 = 0 (λ - 5)(λ - 2) = 0 → λ₁ = 5, λ₂ = 2 For λ₁ = 5: (A - 5I)v = 0 → [[-1,1],[2,-2]]v=0 → v₁ = [1, 1] For λ₂ = 2: (A - 2I)v = 0 → [[2,1],[2,1]]v=0 → v₂ = [1, -2]

// IMPORTANCE IN DEEP LEARNING

Application	How Eigenvalues Help
PCA (Dimensionality Reduction)	Eigenvectors = principal components; Eigenvalues = importance/variance
Gradient Descent	Eigenvalues of Hessian tell us if a point is min/max/saddle
Vanishing/Exploding Gradients	Eigenvalues of weight matrix <1 → vanishing; >1 → exploding
Weight Initialization	Keep eigenvalues ≈ 1 to stabilize training

Q2 PCA Steps for X = [[2,0], [0,2], [3,1]] — Mean Centering, Covariance, Eigenvalues ▼

// STEP 1 — MEAN CENTERING

Data X: x₁ = [2, 0, 3] (column 1) x₂ = [0, 2, 1] (column 2) Mean of x₁ = (2 + 0 + 3)/3 = 5/3 ≈ 1.667 Mean of x₂ = (0 + 2 + 1)/3 = 3/3 = 1.000 Centered Data X̄: Row 1: [2 - 1.667, 0 - 1] = [ 0.333, -1] Row 2: [0 - 1.667, 2 - 1] = [-1.667, 1] Row 3: [3 - 1.667, 1 - 1] = [ 1.333, 0]

// STEP 2 — COVARIANCE MATRIX

C = (1/(n-1)) × X̄ᵀ × X̄ where n = 3 Compute sums: Σ(x₁)² = 0.333² + 1.667² + 1.333² = 0.111 + 2.779 + 1.777 = 4.667 Σ(x₂)² = (-1)² + 1² + 0² = 1 + 1 + 0 = 2 Σ(x₁x₂)= 0.333×(-1) + (-1.667)×1 + 1.333×0 = -0.333 - 1.667 + 0 = -2 C = (1/2) × [[4.667, -2], [-2, 2]] C = [[2.333, -1], [-1, 1]]

// STEP 3 — EIGENVALUES

det(C - λI) = 0 |(2.333 - λ) -1 | | -1 (1 - λ) | = 0 (2.333 - λ)(1 - λ) - (-1)(-1) = 0 2.333 - 2.333λ - λ + λ² - 1 = 0 λ² - 3.333λ + 1.333 = 0 Using quadratic formula: λ = (3.333 ± √(3.333² - 4×1.333)) / 2 = (3.333 ± √(11.11 - 5.33)) / 2 = (3.333 ± √5.78) / 2 = (3.333 ± 2.404) / 2 λ₁ = (3.333 + 2.404)/2 ≈ 2.869 ← larger eigenvalue (1st principal component) λ₂ = (3.333 - 2.404)/2 ≈ 0.465 ← smaller eigenvalue (2nd principal component)

✅ Key Insight: The larger eigenvalue (2.869) corresponds to the direction of maximum variance in the data. In PCA, we keep only the top eigenvectors, reducing dimensions.

Q3 PMF — Find a, P(X<3), Mean, Variance ▼

Given: P(X) = a, 3a, 5a, 7a, 9a, 11a, 13a, 15a, 17a for X = 0,1,2,...,8

// (i) FIND VALUE OF a

Sum of all probabilities = 1 a + 3a + 5a + 7a + 9a + 11a + 13a + 15a + 17a = 1 81a = 1 a = 1/81 ≈ 0.01235

// (ii) P(X < 3)

P(X < 3) = P(X=0) + P(X=1) + P(X=2) = a + 3a + 5a = 9a = 9/81 = 1/9 ≈ 0.111

// (iii) MEAN = E(X) = Σ x·P(x)

E(X) = 0·a + 1·3a + 2·5a + 3·7a + 4·9a + 5·11a + 6·13a + 7·15a + 8·17a = a(0 + 3 + 10 + 21 + 36 + 55 + 78 + 105 + 136) = a × 444 = 444/81 = 148/27 ≈ 5.481

// (iv) VARIANCE = E(X²) − [E(X)]²

E(X²) = 0²·a + 1²·3a + 2²·5a + 3²·7a + 4²·9a + 5²·11a + 6²·13a + 7²·15a + 8²·17a = a(0 + 3 + 20 + 63 + 144 + 275 + 468 + 735 + 1088) = a × 2796 = 2796/81 Variance = 2796/81 - (444/81)² = 2796/81 - 197136/6561 = 226476/6561 - 197136/6561 = 29340/6561 ≈ 4.47

Answers: a = 1/81 | P(X<3) = 1/9 | Mean ≈ 5.48 | Variance ≈ 4.47

Q4 CNN Math — Filters, Stride, Pooling, Dropout vs Fully Connected ANNs ▼

// CONVOLUTION FORMULA

Output(i, j) = Σ Σ Input(i+m, j+n) × Filter(m, n) Output feature map size: H_out = (H_in - F + 2P) / S + 1 W_out = (W_in - F + 2P) / S + 1 Where: F = filter size, P = padding, S = stride

// KEY COMPONENTS

Component	What it does	Why it helps
Filter/Kernel	Slides over image, extracts features	Detects edges, textures, shapes
Stride	Step size when sliding filter	Controls output size; larger stride = smaller output
Padding	Adds zeros around input	Preserves border info, controls output size
Pooling (Max/Avg)	Reduces spatial dimensions	Reduces parameters, adds translation invariance
Dropout	Randomly turns off neurons during training	Prevents overfitting
Deep Layers	Stack many conv layers	Learn hierarchical features (edge→shape→object)

// CNN vs FULLY CONNECTED ANN

Aspect	CNN	ANN (Fully Connected)
Parameters	Few (shared weights)	Huge (28×28 image = 784 inputs × all neurons)
Spatial Awareness	Yes — preserves 2D structure	No — flattens image, loses position info
Translation Invariance	Yes — detects feature anywhere	No
Overfitting Risk	Low (weight sharing)	High (too many params)

Q5 Eigen Decomposition of A = [[3,1],[0,2]] ▼

// STEP 1 — EIGENVALUES

det(A - λI) = 0 |(3-λ) 1 | | 0 (2-λ)| = 0 (3-λ)(2-λ) - 0 = 0 λ² - 5λ + 6 = 0 (λ - 3)(λ - 2) = 0 → λ₁ = 3, λ₂ = 2

// STEP 2 — EIGENVECTORS

For λ₁ = 3: (A - 3I)v = [[0, 1],[0,-1]] · v = 0 → v₂ = 0, v₁ = free → Eigenvector v₁ = [1, 0] For λ₂ = 2: (A - 2I)v = [[1, 1],[0, 0]] · v = 0 → v₁ + v₂ = 0 → v₁ = -v₂ → Eigenvector v₂ = [-1, 1] (or [1, -1])

// EIGEN DECOMPOSITION

A matrix A can be written as: A = P · D · P⁻¹

P = [[1, -1], D = [[3, 0], [0, 1]] [0, 2]] Condition to diagonalize: A must have n linearly independent eigenvectors.

✅ Result: λ₁ = 3 with v₁ = [1,0] | λ₂ = 2 with v₂ = [-1, 1]

Q6 Supervised vs Unsupervised Learning + Bayesian Classification + ML Steps ▼

// SUPERVISED vs UNSUPERVISED

Aspect	Supervised	Unsupervised
Labels	Uses labeled data (input + output)	No labels — finds patterns itself
Goal	Learn a mapping f(x) → y	Find structure/clusters in data
Examples	Classification, Regression	Clustering, PCA, Autoencoders
Generalization	Measured by test error	Measured by reconstruction or cluster quality
Capacity	Model complexity must match data complexity	Same — risk of overfitting latent space

// BAYESIAN CLASSIFICATION

Based on Bayes' Theorem: update prior belief using observed evidence.

P(Class | Data) = P(Data | Class) × P(Class) ─────────────────────────── P(Data) In simple terms: Posterior ∝ Likelihood × Prior Assign x to class C that maximizes P(C|x)

// STEPS TO BUILD AN ML ALGORITHM

Collect & clean data — Handle missing values, outliers
Exploratory Data Analysis — Understand distributions
Feature Engineering — Select/transform useful features
Choose Model — Based on problem type (classification, regression)
Train the Model — Fit on training data
Validate — Use validation set to tune hyperparameters
Test — Evaluate on unseen test data
Deploy & Monitor

Q7 Eigenvalues & Eigenvectors of A=[[2,1],[1,2]] and B=[[1,0,0],[0,5,0],[0,0,9]] ▼

// MATRIX A = [[2,1],[1,2]]

det(A - λI) = (2-λ)² - 1 = 0 λ² - 4λ + 4 - 1 = 0 λ² - 4λ + 3 = 0 (λ-1)(λ-3) = 0 → λ₁ = 1, λ₂ = 3 For λ₁ = 1: (A-I)v=0 → [[1,1],[1,1]]v=0 → v₁+v₂=0 → v = [1,-1]/√2 For λ₂ = 3: (A-3I)v=0 → [[-1,1],[1,-1]]v=0 → v₁=v₂ → v = [1,1]/√2

// MATRIX B = [[1,0,0],[0,5,0],[0,0,9]] (Diagonal Matrix)

SHORTCUT: For diagonal matrices, eigenvalues = diagonal entries! λ₁ = 1, λ₂ = 5, λ₃ = 9 Eigenvectors = standard basis vectors: v₁ = [1,0,0] for λ₁=1 v₂ = [0,1,0] for λ₂=5 v₃ = [0,0,1] for λ₃=9

💡 Exam Tip: Always check if a matrix is diagonal or triangular — eigenvalues are just the diagonal entries!

Q8 Overfitting vs Underfitting — Concept, Graphs, Bias-Variance Tradeoff ▼

// DEFINITIONS

	Overfitting	Underfitting
Definition	Model memorizes training data, fails on new data	Model too simple, can't even learn training data
Training Error	Very Low	High
Test/Validation Error	Very High	High
Bias	Low	High
Variance	High	Low
Model Complexity	Too complex	Too simple

// BIAS-VARIANCE TRADEOFF (ASCII Graph)

// 3 TECHNIQUES TO REDUCE OVERFITTING

Dropout — Randomly disable neurons during training. Forces the network to not rely on any single neuron → improves generalization.
Regularization (L1/L2) — Adds a penalty term to the loss function for large weights. Keeps weights small → smoother model.
Early Stopping — Stop training when validation error starts increasing, even if training error is still decreasing.
Data Augmentation — Create more training samples by flipping, rotating, cropping images.

Q9 House Price Model — Identify Overfitting/Underfitting ▼

Model	Training Error	Validation Error	Problem
Model 1	4% (very low)	22% (very high)	🔴 OVERFITTING
Model 2	18% (high)	20% (similar)	🟡 UNDERFITTING

// JUSTIFICATION

Model 1 — Overfitting: The model learned the training data too well (memorized noise). It performs great on training (4%) but poorly on unseen data (22%). Large gap = high variance.

Model 2 — Underfitting: Both training and validation errors are high (18%, 20%), meaning the model is too simple to capture the underlying pattern. Small gap but both errors high = high bias.

// BIAS-VARIANCE IN CONTEXT

Model 1: Low Bias, High Variance → Overfitting
Model 2: High Bias, Low Variance → Underfitting
Ideal model: Balance between bias and variance

Q10 Model Complexity vs Training & Validation Error ▼

Error | | ╔═══ Validation Error (U-shaped) | ╔╝ | ╔╝ | ╔╝ ← Overfitting zone (gap widens) | ╔╝ | ╔╝─────────── Training Error (always decreasing) | ╔╝ |────╔╝ +─────────────────────────────── Model Complexity Low Optimal High

Training Error always decreases as complexity increases (more complex model = better fit to training data).

Validation Error first decreases, then increases. It forms a U-shape:

Low complexity → high validation error (underfitting)
Optimal complexity → lowest validation error
High complexity → validation error rises (overfitting), training error keeps falling

Q11 Compare Overfitting vs Underfitting — 5 Aspects ▼

Aspect	Overfitting	Underfitting
a. Definition	Model learns training data too well, including noise	Model is too simple to capture the true pattern
b. Identifying	Low train error, high test error (big gap)	High train error AND high test error
c. Common Causes	Too many parameters, too little data, no regularization	Too few layers, too few neurons, underpowered model
d. Train/Test Error & Complexity	Low train error, high test error, complex model	High train error, high test error, simple model
e. Fixing	Dropout, regularization, more data, early stopping	Add more layers/neurons, train longer, reduce regularization

Q12 Batch GD vs Stochastic GD vs Mini-Batch GD ▼

Aspect	Batch GD	Stochastic GD (SGD)	Mini-Batch GD
Data Used Per Update	All N samples	1 sample at a time	Small batch (e.g., 32, 64)
Update Frequency	Once per epoch	N times per epoch	N/batch_size times
Speed	Slow (large computation)	Fast per update	Fast + stable
Convergence	Smooth, stable	Noisy/oscillating	Balanced
Memory	Needs all data in RAM	Very low memory	Moderate
GPU Efficiency	Good	Poor	Best (vectorized)
Used In Practice?	Rarely	Sometimes	✅ Most Common

Update rule (same for all, just changes what data is used): w = w - η × ∇J(w)

Q13 Gradient Descent on J(w) = (w-3)², w₀=6, η=0.1, 0.5, 1.2 ▼

J(w) = (w - 3)² dJ/dw = 2(w - 3) ← gradient Update rule: w_new = w - η × 2(w - 3) Minimum is at w = 3

// η = 0.1 (Small — Slow Convergence)

w₀ = 6 Grad₀ = 2(6-3) = 6 w₁ = 6 - 0.1×6 = 5.4 Grad₁ = 2(5.4-3) = 4.8 w₂ = 5.4 - 0.1×4.8 = 4.92 Grad₂ = 2(4.92-3) = 3.84 w₃ = 4.92 - 0.1×3.84 = 4.536 → Moving toward 3 slowly ✓

// η = 0.5 (Perfect — Converges Immediately!)

w₀ = 6 Grad₀ = 2(6-3) = 6 w₁ = 6 - 0.5×6 = 3.0 ✅ Minimum reached! Grad₁ = 2(3-3) = 0 w₂ = 3.0 (no change) w₃ = 3.0

// η = 1.2 (Too Large — Diverges!)

w₀ = 6 Grad₀ = 2(6-3) = 6 w₁ = 6 - 1.2×6 = -1.2 Grad₁ = 2(-1.2-3) = -8.4 w₂ = -1.2 - 1.2×(-8.4) = 8.88 Grad₂ = 2(8.88-3) = 11.76 w₃ = 8.88 - 1.2×11.76 = -5.232 → Oscillating and DIVERGING ✗

⚠️ Conclusion:
• η too small → converges but very slowly
• η = 0.5 → converges perfectly in 1 step (lucky for this function)
• η too large (1.2) → overshoots minimum, weights diverge

Q14 CNN for MNIST — Architecture & Layer Functions ▼

// CNN ARCHITECTURE FOR MNIST (28×28 grayscale, 10 classes)

Input (28×28×1) ↓ Conv Layer 1 (32 filters, 3×3, ReLU) → 26×26×32 ↓ Max Pooling (2×2) → 13×13×32 ↓ Conv Layer 2 (64 filters, 3×3, ReLU) → 11×11×64 ↓ Max Pooling (2×2) → 5×5×64 ↓ Flatten → 1600 neurons ↓ Dense (128, ReLU) ↓ Dropout (0.5) ↓ Dense (10, Softmax) → 10 class probabilities

// LAYER FUNCTIONALITIES

Layer	What It Does
Convolution	Applies filters/kernels to detect features like edges, curves, textures. Learns spatial patterns using shared weights.
Max Pooling	Takes the maximum value in each region. Reduces size, keeps important features, provides translation invariance.
Flatten	Converts the 2D feature map into a 1D vector so it can be fed into a fully connected layer.
Dense (Fully Connected)	Every neuron connects to every neuron in the next layer. Makes the final classification decision.
Softmax	Converts final scores to probabilities (all sum to 1). Picks the most likely class.

Q15 Biological Neuron + ANN Components + Activation Functions + Vanishing Gradient ▼

// A. BIOLOGICAL NEURON vs ANN

BIOLOGICAL NEURON | ANN EQUIVALENT ───────────────────────────────────────────── Dendrites (receive signals) | Inputs (x₁, x₂, ...) Synapse (connection strength)| Weights (w₁, w₂, ...) Cell Body (processes signal) | Summation: z = Σwᵢxᵢ + b Axon (sends output) | Activation Function: output = f(z) Threshold firing | Activation threshold

// B. COMPONENTS OF ANN

Input Layer — Receives raw data
Hidden Layers — Process and transform features
Output Layer — Produces final prediction
Weights & Biases — Learnable parameters
Activation Function — Adds non-linearity
Loss Function — Measures prediction error
Optimizer — Updates weights (e.g., Gradient Descent)

// C. SINGLE-LAYER vs MULTI-LAYER

	Single-Layer (Perceptron)	Multi-Layer (Deep Network)
Hidden Layers	None	One or more
Solves	Only linearly separable problems	Non-linear, complex problems
Example	AND, OR gates	XOR, image recognition

// D. VANISHING GRADIENT PROBLEM (in Sigmoid)

Sigmoid output is between 0 and 1. Its gradient is: σ'(x) = σ(x)(1−σ(x)) — max value is 0.25.

In backpropagation, gradients multiply through each layer: gradient = dL/dw ∝ σ'(x₁) × σ'(x₂) × σ'(x₃) × ... ∝ 0.25 × 0.25 × 0.25 × ... → approaches ZERO In deep networks, gradients become so tiny that early layers learn almost nothing — this is the VANISHING GRADIENT PROBLEM.

✅ Solution: Use ReLU instead of Sigmoid in hidden layers.

// E. ACTIVATION FUNCTIONS

Function	Formula	Range	Use Case
Sigmoid	σ(x) = 1/(1+e⁻ˣ)	(0, 1)	Binary classification output
Tanh	tanh(x) = (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)	(-1, 1)	Hidden layers (better than sigmoid)
ReLU	max(0, x)	[0, ∞)	Hidden layers (most popular)
Leaky ReLU	max(0.01x, x)	(-∞, ∞)	Fixes "dying ReLU" problem
Softmax	eˣⁱ / Σeˣʲ	(0,1), sums to 1	Multi-class output layer

Q16 LSTM Architecture — Input Gate, Forget Gate, Output Gate ▼

// WHY LSTM?

Traditional RNNs suffer from vanishing gradients — they forget long-term information. LSTM solves this using a special memory cell and gates.

// LSTM ARCHITECTURE OVERVIEW

┌─────────────────────────────────────────┐ x(t) ──▶│ Forget Gate → Input Gate → Cell Update │──▶ h(t) h(t-1)─▶│ → Output Gate │ └─────────────────────────────────────────┘ │ c(t) (Cell State = Long-term memory)

// THE THREE GATES

1. FORGET GATE — "What to forget from old memory?" f(t) = σ(Wf · [h(t-1), x(t)] + bf) Output between 0 and 1: 0 = completely forget | 1 = completely keep

2. INPUT GATE — "What new info to store?" i(t) = σ(Wi · [h(t-1), x(t)] + bi) ← how much to add C̃(t) = tanh(Wc · [h(t-1), x(t)] + bc) ← candidate values

3. CELL STATE UPDATE — "Update long-term memory" C(t) = f(t) × C(t-1) + i(t) × C̃(t) ↑ forget old ↑ add new info

4. OUTPUT GATE — "What to output right now?" o(t) = σ(Wo · [h(t-1), x(t)] + bo) h(t) = o(t) × tanh(C(t))

// HOW LSTM SOLVES VANISHING GRADIENT

The Cell State C(t) flows through time with only addition (not multiplication), preserving gradients. This is called the "constant error carousel" — gradients can flow backward without vanishing.

// APPLICATIONS

Machine Translation (Google Translate)
Speech Recognition (Siri, Alexa)
Text Generation
Time Series Forecasting
Sentiment Analysis

Q17 RNN Unfolding + BPTT + Vanishing/Exploding Gradients ▼

// RNN — HOW IT WORKS

An RNN processes sequences. At each time step, it takes current input x(t) AND the previous hidden state h(t-1).

h(t) = tanh(Wh · h(t-1) + Wx · x(t) + b) y(t) = Wy · h(t) ← output at each step

// UNFOLDING ACROSS TIME

x(1) ──▶ [RNN] ──▶ h(1) ──▶ [RNN] ──▶ h(2) ──▶ [RNN] ──▶ h(3) ↓ ↓ ↓ y(1) y(2) y(3) (Same weights W are reused at each time step — weight sharing!)

// BPTT — BACKPROPAGATION THROUGH TIME

Like regular backprop, but the gradient flows backward through time steps.

Total Loss: L = Σ L(t) Gradient: ∂L/∂Wh = Σ ∂L(t)/∂Wh At each step, gradient gets multiplied by Wh and σ'(z): ∂h(t)/∂h(k) = Π (Wh · diag(σ'(h(i)))) ← product for i=k to t

// VANISHING & EXPLODING GRADIENTS

	Vanishing Gradient	Exploding Gradient
Cause	\|Wh\| < 1 → product goes to 0	\|Wh\| > 1 → product grows to ∞
Effect	Early layers don't learn (long-term memory lost)	Weights blow up, NaN values
Fix	Use LSTM/GRU	Gradient Clipping

Q18 Perceptron Learning — Student Performance Dataset (η=0.6) ▼

Step activation: f(net) = 1 if net ≥ 0, else 0 Update rule: w_new = w_old + η × (d - ŷ) × x b_new = b_old + η × (d - ŷ) η = 0.6 Initial weights: w1 = 0, w2 = 0, b = 0 (assumed, as not given) Data: [4,6] → d=0 [3,4] → d=1 [7,6] → d=1 [6,7] → d=1

// EPOCH 1 — TRAINING EXAMPLE BY EXAMPLE

Example 1: x=[4,6], d=0 net = 0×4 + 0×6 + 0 = 0 → ŷ = f(0) = 1 (since 0 ≥ 0) Error = d - ŷ = 0 - 1 = -1 w1 = 0 + 0.6×(-1)×4 = -2.4 w2 = 0 + 0.6×(-1)×6 = -3.6 b = 0 + 0.6×(-1) = -0.6

Example 2: x=[3,4], d=1 net = -2.4×3 + (-3.6)×4 + (-0.6) = -7.2 - 14.4 - 0.6 = -22.2 → ŷ = 0 Error = 1 - 0 = 1 w1 = -2.4 + 0.6×1×3 = -0.6 w2 = -3.6 + 0.6×1×4 = -1.2 b = -0.6 + 0.6×1 = 0.0

Example 3: x=[7,6], d=1 net = -0.6×7 + (-1.2)×6 + 0 = -4.2 - 7.2 = -11.4 → ŷ = 0 Error = 1 - 0 = 1 w1 = -0.6 + 0.6×1×7 = 3.6 w2 = -1.2 + 0.6×1×6 = 2.4 b = 0.0 + 0.6×1 = 0.6

Example 4: x=[6,7], d=1 net = 3.6×6 + 2.4×7 + 0.6 = 21.6 + 16.8 + 0.6 = 39 → ŷ = 1 Error = 1 - 1 = 0 (No update needed!) w1 = 3.6, w2 = 2.4, b = 0.6

After 1 epoch: w1=3.6, w2=2.4, b=0.6

Q19 CNN Terminologies — Why CNN? Layers, Convolution, Feature Map ▼

// A. WHY CNNs OVER TRADITIONAL ANN FOR IMAGES?

Parameter Efficiency: CNN uses shared weights (same filter for whole image). ANN needs separate weights for every pixel connection → millions of parameters.
Spatial Structure: CNN preserves the 2D structure of images. ANN flattens the image → loses position information.
Translation Invariance: CNN can detect a feature (like an eye) anywhere in the image.

// B. MAIN LAYERS IN CNN

Convolutional Layer — Extracts features using filters
Activation Layer (ReLU) — Adds non-linearity
Pooling Layer — Reduces spatial size
Flatten Layer — Converts to 1D
Fully Connected (Dense) Layer — Classification

// C. CONVOLUTION OPERATION

A filter (small matrix) slides over the input image. At each position, element-wise multiplication happens, and the results are summed to produce one output value.

Output(i,j) = Σm Σn Input(i+m, j+n) × Filter(m, n)

// D. FEATURE MAP

The output produced after applying one filter to the entire input is called a feature map (also called an activation map). It represents where certain features (edges, curves, etc.) were detected in the image.

Q20 Convolution Numeric — 4×4 Image, 3×3 Filter, ReLU, Max Pooling ▼

Input (4×4):

1 2 0 1 3 1 2 2 0 1 3 1 2 2 1 0

Filter (3×3):

1 0 -1 1 0 -1 1 0 -1

// a) CONVOLUTION OUTPUT

Output size = (4-3)/1 + 1 = 2 × 2 Position (0,0) — top-left 3×3 of input: [[1,2,0],[3,1,2],[0,1,3]] ⊙ [[1,0,-1],[1,0,-1],[1,0,-1]] = 1×1 + 2×0 + 0×(-1) + 3×1 + 1×0 + 2×(-1) + 0×1 + 1×0 + 3×(-1) = 1 + 0 + 0 + 3 + 0 - 2 + 0 + 0 - 3 = -1 Position (0,1) — cols 1-3 of rows 0-2: [[2,0,1],[1,2,2],[1,3,1]] ⊙ filter = 2×1 + 0×0 + 1×(-1) + 1×1 + 2×0 + 2×(-1) + 1×1 + 3×0 + 1×(-1) = 2 + 0 - 1 + 1 + 0 - 2 + 1 + 0 - 1 = 0 Position (1,0) — rows 1-3, cols 0-2: [[3,1,2],[0,1,3],[2,2,1]] ⊙ filter = 3×1 + 1×0 + 2×(-1) + 0×1 + 1×0 + 3×(-1) + 2×1 + 2×0 + 1×(-1) = 3 + 0 - 2 + 0 + 0 - 3 + 2 + 0 - 1 = -1 Position (1,1) — rows 1-3, cols 1-3: [[1,2,2],[1,3,1],[2,1,0]] ⊙ filter = 1×1 + 2×0 + 2×(-1) + 1×1 + 3×0 + 1×(-1) + 2×1 + 1×0 + 0×(-1) = 1 + 0 - 2 + 1 + 0 - 1 + 2 + 0 + 0 = 1 Feature Map = [[-1, 0], [-1, 1]]

// b) OUTPUT SIZE

Height = (4 - 3)/1 + 1 = 2 Width = (4 - 3)/1 + 1 = 2 → Output size: 2 × 2

// c) ReLU ACTIVATION: max(0, x)

Input: [[-1, 0], [-1, 1]] ReLU: [[ 0, 0], [ 0, 1]] (negative → 0, positive/zero → unchanged)

// d) 2×2 MAX POOLING (stride=1)

Applying 2×2 pool with stride=1 on 2×2 ReLU output: Only one position fits: rows 0-1, cols 0-1 = max(0, 0, 0, 1) = 1 Pooled Feature Map: [[1]] Pooled Size: 1 × 1

Feature Map = [[-1,0],[-1,1]] | ReLU = [[0,0],[0,1]] | Max Pool = [[1]] (size 1×1)

// HOW CONV + ACTIVATION + POOLING WORK TOGETHER

Convolution — detects a specific feature (edge, curve) across the image
ReLU — keeps only positive activations (strong feature presence)
Pooling — shrinks the map, keeps strongest activation in each region
Stacking these = hierarchical feature extraction: edges → shapes → objects

Q21 Variational Autoencoder (VAE) — Full Explanation ▼

// TRADITIONAL AUTOENCODER vs VAE

Aspect	Traditional Autoencoder	VAE
Latent Space	Fixed point (deterministic)	Probability distribution (mean μ, std σ)
Can Generate?	No (not a generative model)	Yes (sample from latent space)
Latent Space Structure	Irregular, gaps possible	Continuous, smooth, structured
Loss Function	Reconstruction loss only	Reconstruction loss + KL Divergence

// VAE ARCHITECTURE

Input x ↓ [ENCODER] ──▶ μ (mean), σ (std deviation) ← learn distribution, not a point ↓ Latent Space z = μ + ε·σ (ε ~ N(0,1) — reparameterization trick) ↓ [DECODER] ──▶ Reconstructed x̂

// COMPONENTS

ENCODER Maps input x to a distribution in latent space. Outputs: μ (mean vector) and log(σ²) (log variance) Does NOT output a single point — outputs parameters of a Gaussian distribution.

LATENT SPACE A compressed, continuous probabilistic space. Each dimension represents a meaningful feature. Similar inputs → nearby regions in latent space. Sampling from it enables GENERATION of new data.

DECODER Takes a sampled latent vector z and reconstructs the output x̂. Learns to reverse the encoding.

// VAE LOSS FUNCTION

Total Loss = Reconstruction Loss + KL Divergence 1. Reconstruction Loss: L_recon = ||x - x̂||² (or Binary Cross-Entropy for images) → How well does the decoder recreate the input? 2. KL Divergence: L_KL = -0.5 × Σ(1 + log(σ²) - μ² - σ²) → Forces the learned distribution to stay close to N(0,1) → Keeps latent space structured and continuous Total: L = L_recon + β × L_KL

// REPARAMETERIZATION TRICK

To allow backpropagation through the random sampling step:

Instead of sampling z ~ N(μ, σ²) [not differentiable] We write: z = μ + ε × σ where ε ~ N(0,1) [differentiable!] Now gradients can flow back through μ and σ.

// APPLICATIONS

Image Generation — Generate new realistic images
Data Augmentation — Create synthetic training data
Anomaly Detection — High reconstruction error = anomaly
Drug Discovery — Generate new molecular structures
Representation Learning — Learn compact features

Q22 Deep Autoencoder (DAE) vs VAE — Critical Comparison ▼

// DEEP AUTOENCODER ARCHITECTURE

Input → [Dense] → [Dense] → [Bottleneck z] → [Dense] → [Dense] → Reconstructed Output Encoder (many layers) Decoder (many layers)

Uses multiple hidden layers for non-linear dimensionality reduction. Similar to PCA but non-linear.

// COMPARISON TABLE

Aspect	Deep Autoencoder (DAE)	VAE
Latent Space	Fixed point — irregular, no structure	Probability distribution — smooth & continuous
Reconstruction Loss	MSE or Binary Cross-Entropy only	Reconstruction Loss + KL Divergence
Generative Capability	❌ Cannot generate new samples	✅ Can generate new samples
Generalization	May not generalize well	Better generalization due to regularized latent space
Feature Extraction	Good non-linear features	Also good, plus interpretable (mean/variance)
Use Case	Dimensionality reduction, denoising	Generation, interpolation, anomaly detection
Interpolation	May produce invalid outputs	Smooth, meaningful interpolation possible

Q23 VAE as a Generative Model — Difference, Loss Function, Applications ▼

// 1. AUTOENCODER vs VAE (Key Difference)

	Autoencoder	VAE
Encoding	z = f(x) ← single point	z ~ N(μ(x), σ²(x)) ← distribution
Generative?	No	Yes — sample any z from latent space
Latent Space	Unstructured	Regularized (Gaussian prior)

// 2. VAE LOSS FUNCTION

L_VAE = L_reconstruction + L_KL Reconstruction Loss: Measures how well x̂ matches x → For images: Binary Cross-Entropy or MSE KL Divergence: D_KL(q(z|x) || p(z)) = -0.5 × Σ[1 + log(σ²) - μ² - σ²] → Regularizes encoder to produce distributions close to N(0,1) → This ensures the latent space is smooth and continuous The balance between these two terms controls: High reconstruction weight → better fidelity High KL weight → more structured/generalizable latent space

// 3. APPLICATIONS

Domain	Application
Computer Vision	Face generation, image editing (change hair color, add glasses)
Drug Discovery	Generate new molecule structures with desired properties
Anomaly Detection	Normal data has low reconstruction error; anomalies have high error
Data Augmentation	Generate synthetic training samples
Representation Learning	Learn compact, meaningful features for downstream tasks

// TABLE OF CONTENTS

UNIT 1 — Mathematical Foundations & Machine Learning Basics

// WHAT ARE EIGENVALUES & EIGENVECTORS?

// HOW TO FIND EIGENVALUES?

// EXAMPLE: A = [[4, 1], [2, 3]]

// IMPORTANCE IN DEEP LEARNING

// STEP 1 — MEAN CENTERING

// STEP 2 — COVARIANCE MATRIX

// STEP 3 — EIGENVALUES

// (i) FIND VALUE OF a

// (ii) P(X < 3)

// (iii) MEAN = E(X) = Σ x·P(x)

// (iv) VARIANCE = E(X²) − [E(X)]²

// CONVOLUTION FORMULA

// KEY COMPONENTS

// CNN vs FULLY CONNECTED ANN

// STEP 1 — EIGENVALUES

// STEP 2 — EIGENVECTORS

// EIGEN DECOMPOSITION

// SUPERVISED vs UNSUPERVISED

// BAYESIAN CLASSIFICATION

// STEPS TO BUILD AN ML ALGORITHM

// MATRIX A = [[2,1],[1,2]]

// MATRIX B = [[1,0,0],[0,5,0],[0,0,9]] (Diagonal Matrix)

// DEFINITIONS

// BIAS-VARIANCE TRADEOFF (ASCII Graph)

// 3 TECHNIQUES TO REDUCE OVERFITTING

// JUSTIFICATION

// BIAS-VARIANCE IN CONTEXT

// η = 0.1 (Small — Slow Convergence)

// η = 0.5 (Perfect — Converges Immediately!)

// η = 1.2 (Too Large — Diverges!)

UNIT 2 — Neural Networks, CNNs, RNNs & LSTMs

// CNN ARCHITECTURE FOR MNIST (28×28 grayscale, 10 classes)

// LAYER FUNCTIONALITIES

// A. BIOLOGICAL NEURON vs ANN

// B. COMPONENTS OF ANN

// C. SINGLE-LAYER vs MULTI-LAYER

// D. VANISHING GRADIENT PROBLEM (in Sigmoid)

// E. ACTIVATION FUNCTIONS

// WHY LSTM?

// LSTM ARCHITECTURE OVERVIEW

// THE THREE GATES

// HOW LSTM SOLVES VANISHING GRADIENT

// APPLICATIONS

// RNN — HOW IT WORKS

// UNFOLDING ACROSS TIME

// BPTT — BACKPROPAGATION THROUGH TIME

// VANISHING & EXPLODING GRADIENTS

// EPOCH 1 — TRAINING EXAMPLE BY EXAMPLE

// A. WHY CNNs OVER TRADITIONAL ANN FOR IMAGES?

// B. MAIN LAYERS IN CNN

// C. CONVOLUTION OPERATION

// D. FEATURE MAP

Input (4×4):

Filter (3×3):

// a) CONVOLUTION OUTPUT

// b) OUTPUT SIZE

// c) ReLU ACTIVATION: max(0, x)

// d) 2×2 MAX POOLING (stride=1)

// HOW CONV + ACTIVATION + POOLING WORK TOGETHER

UNIT 3 — Autoencoders & Variational Autoencoders (VAE)

// TRADITIONAL AUTOENCODER vs VAE

// VAE ARCHITECTURE

// COMPONENTS

// VAE LOSS FUNCTION

// REPARAMETERIZATION TRICK

// APPLICATIONS

// DEEP AUTOENCODER ARCHITECTURE

// COMPARISON TABLE

// 1. AUTOENCODER vs VAE (Key Difference)

// 2. VAE LOSS FUNCTION

// 3. APPLICATIONS