While other computer vision architectures exist, CNNs are still widely used. If you’re curious about another increasingly popular architecture, vision transformers, please see our optional content. Here is a brief overview of how CNNs work.
Parts of a CNN
We will start with a presentation on CNNs.
Now, let’s look at the key components of an image classification CNN and start exploring how these are implemented in code.
-
Convolutional layers: The core building blocks of CNNs. They perform a mathematical operation called convolution. This operation involves sliding a filter (also referred to as a kernel) over the input image and computing the dot product between the filter and local regions of the image.
The typical hyperparameters for a CNN layer are:
- Kernel Size: The area in pixels that the filter looks at. While the kernel size is a hyperparameter that can be set, a 3-by-3-pixel filter is frequently used. Larger kernels may help with larger input images and have a larger field of view capable of identifying broader contexts in the input image. However, they increase the computational cost in training and at inference time.
- Stride: The step size, in pixels, that the filter moves over the image with each step. Larger strides reduce computational load but potentially miss feature nuance that smaller strides may find. Step sizes typically range from one to the dimension of the kernel.
-
Padding: Padding adds a border of extra pixels around the outside edge of the image so that the filter can be applied to each pixel in the input image. The smaller the input image, the more difference padding makes. The border can be made up of zeros (zero-padding, most commonly used) or various methods of extending the edge pixels into the border.
Each filter produces a feature map representing specific features or patterns in the input image (vertical or horizontal lines, curves, etc.).
- Activation functions: After the convolution operation, an activation function such as the Rectified Linear Unit (ReLU) is applied to introduce non-linearity into the model. Non-linearity allows the model to learn more complex patterns.
-
Pooling layers: These layers reduce the spatial dimensions (width and height) of the input for the next convolutional layer. Similar to the convolution operation, a window of defined size is passed over the input (sometimes referred to as the feature map), sliding over and down with each step. However, there are no parameters to learn; rather, a set operation occurs at each step. The most common form is max pooling, where the maximum element is selected in the input region covered by the window. Average pooling is another choice. The type of pooling and the window size are both hyperparameters of the pooling layer.
The pooling size in a CNN determines the extent of downsampling, with larger pooling sizes reducing the spatial dimensions of feature maps more significantly, thereby compressing the input and reducing the computational load for subsequent layers.
In addition to dimensionality reduction, pooling layers increase the field of view for deeper layers. Neurons in the network receive input from more than one input pixel because the outputs of previous layers have been pooled, passing some form of aggregate signal deeper into the network. With many convolution and pooling layers, early layers have a local focus, and deeper layers have gradually widened focus.
-
Flatten layer: The flatten layer bridges convolutional layers and fully connected (dense) layers. It reshapes the output of the convolutional part of the network into a simple one-dimensional vector. This vector serves as the input to the subsequent dense layers for classification.
Imagine the output of your convolutional layers as a 3D tensor (e.g., image width * image height * number of filters). The flatten layer unrolls this tensor into a long, concatenated 1D vector.
- Dense layers: Dense layers are sometimes used to perform the final classification task in CNNs. Models with Dense layers take the flattened representation of the image features and learn complex non-linear combinations for final classification. Each neuron in a dense layer receives input from all the neurons in the previous layer. During training, the weights and biases within the dense layers are adjusted to discover patterns that best discriminate among different classes.
-
Fully connected output layer: At the end of a CNN, fully connected layers use the features learned during the convolutional and pooling phases to classify the input image into various classes.
The Softmax function is often used in classification tasks to convert the output layer’s raw prediction scores into probabilities, ensuring they are between zero and one and sum to one thus indicating the likelihood of each class being the correct classification.
Return to Module 1 or Continue to Evaluating Datasets for Computer Vision