ViTs are based on the same architecture that enables Large Language Models. Unlike CNNs that process images through localized filters, ViTs apply the Transformer architecture to preserve long-range spatial dependencies. Here’s a detailed breakdown:
Key Components of ViTs
- Image Tokenization: ViTs begin by splitting an image into fixed-size patches, which are then flattened and linearly transformed into a series of vectors, known as tokens. These tokens are akin to words in a sentence for Natural Language Processing (NLP) transformers.
- Positional Encoding: Since the transformer architecture does not inherently process sequential data, positional encodings are added to the tokens to retain the positional information of each patch. This step is crucial for the model to understand the arrangement of patches in the image.
- Transformer Encoder: The core of a ViT is the transformer encoder, which consists of multiple layers of self-attention and feed-forward neural networks. The self-attention mechanism allows the model to weigh the importance of different patches relative to each other, enabling it to learn contextual relationships within the image.
- Classification Head: After processing through the transformer encoder layers, the output is passed to a classification head, typically a simple feed-forward network, to make the final prediction.
Vision Transformer Exercise
Now that we’ve thoroughly explored the basics of computer vision, let’s get our hands dirty with creating an image processing model ourselves. The following exercise will give you an introduction to building your own model in code!
Let’s put what we’ve learned into practice and make a ViT model ourselves. The following exercise will show you what a ViT in code!
The notebooks for the Computer Vision course are located at https://github.com/PracticumAI/computer_vision, and the ViT notebook is 01.5_ViT_optional_tutorial.ipynb.