Lecture 05: CNN

Lecture 05 · 36 slides

Section 1: Convolutional Neural Networks (CNN) & Transfer Learning

This section covers convolutional neural networks from the ground up -- starting with the mathematical operation of convolution, building up to CNN architectures, and finishing with normalization techniques. We'll also touch on transfer learning and how pre-trained models can be leveraged for new tasks.

Section 1 Questions

The key questions for this section: What is a CNN? Can you explain padding and strides? What are some popular CNN architectures? What is residual learning, and what two problems arise when training deep neural networks? Why use normalization layers? And on the transfer learning side -- what are learned feature representations, how does transfer learning work, what is a pre-trained model, and how is it useful?

Convolutions

Convolution is a mathematical operator on two functions. In the continuous case it's an integral, in the discrete case it's a summation. Visually, you're taking one function and sliding it over the other from negative infinity to positive infinity, computing the overlap at each position. The diagrams here show two functions -- one is a rectangular pulse that's zero everywhere except for a fixed region. As you slide one function across the other, the output traces the convolution result. This is a foundational operation in signal processing, and it's the core building block of CNNs.

Finite Length Discrete Convolutions

In practice, one of the functions is small and fixed in length -- this is called the kernel or filter. It's typically 3, 5, or 7 elements wide. The other function is the input signal, which in the context of CNNs is usually an image. You slide the kernel over the input, computing a dot product at each position, and the result is called the output signal or feature response. The diagram shows a filter [1, 2, 1] sliding over an input signal to produce the first output value. This finite, discrete version of convolution is what actually gets computed in neural networks.

1D Convolution Example – Zero Padding

Here's a concrete 1D example. The filter is [1, 2, 1] and the input is [0, 1, 2, 3, 4, 5, 0] with zero padding on each end. At each position, you compute the dot product: 1 times 0, plus 2 times 1, plus 1 times 2 gives 4. Shift one step, and you get 1 times 1, plus 2 times 2, plus 1 times 3 which is 8. You keep sliding the window across the entire signal. The zeros on either end are "zero padding" -- they let the filter reach the edges of the input. With "same" padding, the output has the same length as the input. Without padding ("valid"), the output shrinks because the filter can't extend past the boundaries. The formula is: output length equals input length minus kernel width plus one.

1D Convolution Example – Zero Padding and Strides

We don't have to move the filter one step at a time -- we can jump more than one position, and that's the idea of strides. A stride of 2 means we skip every other position, reducing the output length. This example also shows "full" padding, which is the maximum number of zeros you can add such that the filter still overlaps with at least one element of the input signal. Adding more zeros beyond that would just produce zeros in the output. Between padding and strides, you have two knobs that control the spatial dimensions of the output: padding preserves or increases the size, strides reduce it.

2D Convolutions

Images are two-dimensional, so we need 2D convolutions. The equation looks more complicated, but the idea is the same -- instead of sliding a 1D filter along a signal, we slide a 2D kernel across rows and columns. At each position, we compute the element-wise product of the kernel and the overlapping patch of the input, then sum everything up. The kernel here is a small grid of values (like ones and zeros), and we scan it across the full image in both dimensions to produce the output feature map.

2D Convolutions with Padding and Strides

These visualizations show the effect of different padding and stride combinations on 2D convolutions. The bottom grid is the input, the shaded region is the filter, and the top grid is the output. With no padding and stride 1, the output shrinks. Adding "same" padding (one layer of zeros around the input) keeps the output the same size. Increasing the stride to 2 or 3 reduces the output dimensions further because the filter jumps multiple positions between computations. These two parameters -- padding and stride -- give you precise control over the spatial dimensions at each layer.

2D Transposed Convolutions

Transposed convolutions go in the opposite direction -- from a small input to a larger output. The blue is still the input, but now a 2x2 input can generate a larger output through this operation. The filter still exists (the shaded area), but it's applied in a way that upsamples rather than downsamples. This is useful in architectures that need to produce spatial outputs larger than their inputs, like image segmentation or generative models. While strides in regular convolutions reduce dimensions, transposed convolutions increase them. You'll sometimes hear these called "deconvolutions," though that's technically a misnomer.

Image Processing with Convolutional Kernels

Before deep learning, people hand-designed convolutional kernels for image processing tasks. These kernels can perform edge detection, sharpening, blurring, and other operations depending on the weight values. For example, a Sobel kernel detects edges, a Gaussian kernel blurs, and a sharpening kernel enhances details. You might have used some of these in Photoshop. In the old days, these manually defined kernels were used to generate features for supervised learning -- for instance, using an edge map as input features to classify handwritten digits. But the obvious question is: why manually define kernel weights when we could learn them? That insight is the foundation of convolutional neural networks.

Errata: erratum #2

Convolutional (Neural Network) Layers

Instead of hand-picking kernel values, we replace them with learnable weights -- w1,1, w1,2, etc. The convolution operation is the same: take a 2D input, slide the kernel over it, and produce a feature map (also called an activation map). The key advantage is that convolutional layers share weights across all spatial positions, making them far more parameter-efficient than fully connected layers. They're also spatially invariant -- the same filter detects a pattern whether it appears in the top-left or bottom-right of the image. Activation functions are applied on the feature maps to provide non-linearity, just like in fully connected networks.

Convolutions with Input Depth

Real images aren't 2D -- they have depth. A color image has three channels (RGB: red, green, blue), so a 7x7 image is actually 7x7x3. The convolutional kernel must match this depth, so instead of a 3x3 filter, you have a 3x3x3 filter. The operation works the same way: slide the 3D kernel across the spatial dimensions, computing element-wise products across all three channels and summing everything into a single value. The output is still a 2D feature map because the depth dimension gets collapsed during the dot product. This is an important detail -- the kernel always spans the full depth of the input.

Convolutions with Multiple Filters

To capture multiple patterns, we use multiple filters. Each filter is a full 3D kernel (matching the input depth) and produces its own 2D feature map. If I have four filters, I get four feature maps stacked together as the output. Each filter operates independently on the same input, and each one learns to detect different patterns. The output depth equals the number of filters -- so a layer with 64 filters produces an output with depth 64. This is how convolutional layers build up rich representations: each filter specializes in a different feature.