🔍 Convolutional Layers
Conv2D (Convolutional Layer)
The foundation of any CNN, Conv2D layers perform feature extraction through convolution operations.
How it works:
- Applies learnable filters (kernels) across the input image
- Each filter detects specific patterns (edges, textures, shapes)
- Preserves spatial relationships between pixels
- Shares parameters across the entire input, reducing overfitting
Parameters:
- Filters: Number of feature detectors (8, 16, 32, 64, 128, 256, 512)
- Kernel Size: Size of the convolution window (3x3, 5x5, 7x7)
- Strides: Step size for filter movement (1, 2)
- Padding: Border handling ('valid' or 'same')
- Activation: Built-in activation function (optional)
Best Practices:
- Start with fewer filters (8-16) in early layers
- Increase filter count in deeper layers for complex pattern recognition
- Use 3x3 kernels for most applications (proven most effective)
- Apply padding='same' to preserve spatial dimensions
MaxPooling2D (Pooling Layer)
Reduces spatial dimensions while retaining the most important features.
How it works:
- Divides input into non-overlapping regions
- Selects the maximum value from each region
- Reduces computational complexity and overfitting
- Provides translation invariance (small position changes don't affect output)
Parameters:
- Pool Size: Dimensions of pooling window (2x2, 3x3)
- Strides: Step size for pooling operation
- Padding: Border handling strategy
Benefits:
- Reduces memory usage and computation time
- Makes features more robust to small translations
- Helps prevent overfitting by reducing parameter count
- Increases receptive field of subsequent layers
⚡ Activation Layers
ReLU (Rectified Linear Unit)
The most popular activation function in modern deep learning.
How it works:
- Outputs the input if positive, zero otherwise: f(x) = max(0, x)
- Introduces non-linearity essential for learning complex patterns
- Solves the vanishing gradient problem better than sigmoid/tanh
- Computationally efficient (simple thresholding operation)
Advantages:
- Fast computation and gradient calculation
- Sparse activation (many neurons output zero)
- No saturation for positive values
- Biological plausibility (similar to neuron activation)
When to use:
- After every Conv2D layer
- In hidden Dense layers
- Default choice for most CNN architectures
Softmax
Converts raw prediction scores into probability distributions.
How it works:
- Exponentiates each input and normalizes by the sum
- Ensures outputs sum to 1.0 (valid probability distribution)
- Emphasizes the largest values while suppressing smaller ones
- Essential for multi-class classification problems
softmax(x_i) = exp(x_i) / Σ(exp(x_j)) for all j
Usage:
- Always the final layer for classification tasks
- Only use with Dense layers
- Perfect for MNIST's 10-class digit classification
🛡️ Regularization Layers
Dropout
Prevents overfitting by randomly setting neurons to zero during training.
How it works:
- Randomly "drops out" (sets to zero) a percentage of neurons
- Forces the network to not rely on specific neurons
- Creates ensemble effect with multiple sub-networks
- Only active during training, disabled during inference
Parameters:
- Rate: Fraction of neurons to drop (0.1 to 0.5 typical)
Best Practices:
- Use 0.25-0.5 for Dense layers
- Place before final classification layer
- Higher rates for larger networks
- Don't use in convolutional layers for small networks
BatchNormalization
Normalizes layer inputs to improve training stability and speed.
How it works:
- Normalizes each batch to have zero mean and unit variance
- Adds learnable scale and shift parameters
- Reduces internal covariate shift
- Acts as implicit regularization
Benefits:
- Faster training convergence
- Higher learning rates possible
- Less sensitive to weight initialization
- Reduces need for other regularization techniques
Placement:
- Typically after Conv2D layers, before activation
- Can be used after Dense layers
- Experiment with placement for best results
🏗️ Structural Layers
Flatten
Converts multi-dimensional feature maps to 1D vectors for Dense layers.
How it works:
- Reshapes 3D tensor (height × width × channels) to 1D
- Preserves all information, just changes organization
- Required transition between convolutional and dense layers
- No learnable parameters
Usage:
- Place exactly once between Conv2D and Dense layers
- Essential bridge in CNN architecture
- Typically after final pooling layer
Dense (Fully Connected)
Traditional neural network layer where each neuron connects to all previous neurons.
How it works:
- Computes weighted sum of all inputs plus bias
- Applies activation function to the result
- Learns global patterns across the entire flattened feature map
- High parameter count but powerful pattern recognition
Parameters:
- Units: Number of neurons in the layer
- Activation: Activation function to apply
Architecture Guidelines:
- Start with 64-128 units for hidden layers
- Final layer must have 10 units for MNIST (one per digit)
- Can stack multiple Dense layers
- Consider Dropout between Dense layers