For a square matrix A, if multiplying it by a vector v only scales it (doesn't rotate it), then v is called an eigenvector and the scaling factor λ is called an eigenvalue.
| Application | How Eigenvalues Help |
|---|---|
| PCA (Dimensionality Reduction) | Eigenvectors = principal components; Eigenvalues = importance/variance |
| Gradient Descent | Eigenvalues of Hessian tell us if a point is min/max/saddle |
| Vanishing/Exploding Gradients | Eigenvalues of weight matrix <1 → vanishing; >1 → exploding |
| Weight Initialization | Keep eigenvalues ≈ 1 to stabilize training |
Given: P(X) = a, 3a, 5a, 7a, 9a, 11a, 13a, 15a, 17a for X = 0,1,2,...,8
| Component | What it does | Why it helps |
|---|---|---|
| Filter/Kernel | Slides over image, extracts features | Detects edges, textures, shapes |
| Stride | Step size when sliding filter | Controls output size; larger stride = smaller output |
| Padding | Adds zeros around input | Preserves border info, controls output size |
| Pooling (Max/Avg) | Reduces spatial dimensions | Reduces parameters, adds translation invariance |
| Dropout | Randomly turns off neurons during training | Prevents overfitting |
| Deep Layers | Stack many conv layers | Learn hierarchical features (edge→shape→object) |
| Aspect | CNN | ANN (Fully Connected) |
|---|---|---|
| Parameters | Few (shared weights) | Huge (28×28 image = 784 inputs × all neurons) |
| Spatial Awareness | Yes — preserves 2D structure | No — flattens image, loses position info |
| Translation Invariance | Yes — detects feature anywhere | No |
| Overfitting Risk | Low (weight sharing) | High (too many params) |
A matrix A can be written as: A = P · D · P⁻¹
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Labels | Uses labeled data (input + output) | No labels — finds patterns itself |
| Goal | Learn a mapping f(x) → y | Find structure/clusters in data |
| Examples | Classification, Regression | Clustering, PCA, Autoencoders |
| Generalization | Measured by test error | Measured by reconstruction or cluster quality |
| Capacity | Model complexity must match data complexity | Same — risk of overfitting latent space |
Based on Bayes' Theorem: update prior belief using observed evidence.
| Overfitting | Underfitting | |
|---|---|---|
| Definition | Model memorizes training data, fails on new data | Model too simple, can't even learn training data |
| Training Error | Very Low | High |
| Test/Validation Error | Very High | High |
| Bias | Low | High |
| Variance | High | Low |
| Model Complexity | Too complex | Too simple |
| Model | Training Error | Validation Error | Problem |
|---|---|---|---|
| Model 1 | 4% (very low) | 22% (very high) | 🔴 OVERFITTING |
| Model 2 | 18% (high) | 20% (similar) | 🟡 UNDERFITTING |
Model 1 — Overfitting: The model learned the training data too well (memorized noise). It performs great on training (4%) but poorly on unseen data (22%). Large gap = high variance.
Model 2 — Underfitting: Both training and validation errors are high (18%, 20%), meaning the model is too simple to capture the underlying pattern. Small gap but both errors high = high bias.
Training Error always decreases as complexity increases (more complex model = better fit to training data).
Validation Error first decreases, then increases. It forms a U-shape:
| Aspect | Overfitting | Underfitting |
|---|---|---|
| a. Definition | Model learns training data too well, including noise | Model is too simple to capture the true pattern |
| b. Identifying | Low train error, high test error (big gap) | High train error AND high test error |
| c. Common Causes | Too many parameters, too little data, no regularization | Too few layers, too few neurons, underpowered model |
| d. Train/Test Error & Complexity | Low train error, high test error, complex model | High train error, high test error, simple model |
| e. Fixing | Dropout, regularization, more data, early stopping | Add more layers/neurons, train longer, reduce regularization |
| Aspect | Batch GD | Stochastic GD (SGD) | Mini-Batch GD |
|---|---|---|---|
| Data Used Per Update | All N samples | 1 sample at a time | Small batch (e.g., 32, 64) |
| Update Frequency | Once per epoch | N times per epoch | N/batch_size times |
| Speed | Slow (large computation) | Fast per update | Fast + stable |
| Convergence | Smooth, stable | Noisy/oscillating | Balanced |
| Memory | Needs all data in RAM | Very low memory | Moderate |
| GPU Efficiency | Good | Poor | Best (vectorized) |
| Used In Practice? | Rarely | Sometimes | ✅ Most Common |
| Layer | What It Does |
|---|---|
| Convolution | Applies filters/kernels to detect features like edges, curves, textures. Learns spatial patterns using shared weights. |
| Max Pooling | Takes the maximum value in each region. Reduces size, keeps important features, provides translation invariance. |
| Flatten | Converts the 2D feature map into a 1D vector so it can be fed into a fully connected layer. |
| Dense (Fully Connected) | Every neuron connects to every neuron in the next layer. Makes the final classification decision. |
| Softmax | Converts final scores to probabilities (all sum to 1). Picks the most likely class. |
| Single-Layer (Perceptron) | Multi-Layer (Deep Network) | |
|---|---|---|
| Hidden Layers | None | One or more |
| Solves | Only linearly separable problems | Non-linear, complex problems |
| Example | AND, OR gates | XOR, image recognition |
Sigmoid output is between 0 and 1. Its gradient is: σ'(x) = σ(x)(1−σ(x)) — max value is 0.25.
✅ Solution: Use ReLU instead of Sigmoid in hidden layers.
| Function | Formula | Range | Use Case |
|---|---|---|---|
| Sigmoid | σ(x) = 1/(1+e⁻ˣ) | (0, 1) | Binary classification output |
| Tanh | tanh(x) = (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) | (-1, 1) | Hidden layers (better than sigmoid) |
| ReLU | max(0, x) | [0, ∞) | Hidden layers (most popular) |
| Leaky ReLU | max(0.01x, x) | (-∞, ∞) | Fixes "dying ReLU" problem |
| Softmax | eˣⁱ / Σeˣʲ | (0,1), sums to 1 | Multi-class output layer |
Traditional RNNs suffer from vanishing gradients — they forget long-term information. LSTM solves this using a special memory cell and gates.
The Cell State C(t) flows through time with only addition (not multiplication), preserving gradients. This is called the "constant error carousel" — gradients can flow backward without vanishing.
An RNN processes sequences. At each time step, it takes current input x(t) AND the previous hidden state h(t-1).
Like regular backprop, but the gradient flows backward through time steps.
| Vanishing Gradient | Exploding Gradient | |
|---|---|---|
| Cause | |Wh| < 1 → product goes to 0 | |Wh| > 1 → product grows to ∞ |
| Effect | Early layers don't learn (long-term memory lost) | Weights blow up, NaN values |
| Fix | Use LSTM/GRU | Gradient Clipping |
A filter (small matrix) slides over the input image. At each position, element-wise multiplication happens, and the results are summed to produce one output value.
The output produced after applying one filter to the entire input is called a feature map (also called an activation map). It represents where certain features (edges, curves, etc.) were detected in the image.
| Aspect | Traditional Autoencoder | VAE |
|---|---|---|
| Latent Space | Fixed point (deterministic) | Probability distribution (mean μ, std σ) |
| Can Generate? | No (not a generative model) | Yes (sample from latent space) |
| Latent Space Structure | Irregular, gaps possible | Continuous, smooth, structured |
| Loss Function | Reconstruction loss only | Reconstruction loss + KL Divergence |
To allow backpropagation through the random sampling step:
Uses multiple hidden layers for non-linear dimensionality reduction. Similar to PCA but non-linear.
| Aspect | Deep Autoencoder (DAE) | VAE |
|---|---|---|
| Latent Space | Fixed point — irregular, no structure | Probability distribution — smooth & continuous |
| Reconstruction Loss | MSE or Binary Cross-Entropy only | Reconstruction Loss + KL Divergence |
| Generative Capability | ❌ Cannot generate new samples | ✅ Can generate new samples |
| Generalization | May not generalize well | Better generalization due to regularized latent space |
| Feature Extraction | Good non-linear features | Also good, plus interpretable (mean/variance) |
| Use Case | Dimensionality reduction, denoising | Generation, interpolation, anomaly detection |
| Interpolation | May produce invalid outputs | Smooth, meaningful interpolation possible |
| Autoencoder | VAE | |
|---|---|---|
| Encoding | z = f(x) ← single point | z ~ N(μ(x), σ²(x)) ← distribution |
| Generative? | No | Yes — sample any z from latent space |
| Latent Space | Unstructured | Regularized (Gaussian prior) |
| Domain | Application |
|---|---|
| Computer Vision | Face generation, image editing (change hair color, add glasses) |
| Drug Discovery | Generate new molecule structures with desired properties |
| Anomaly Detection | Normal data has low reconstruction error; anomalies have high error |
| Data Augmentation | Generate synthetic training samples |
| Representation Learning | Learn compact, meaningful features for downstream tasks |