Few-shot learning is increasingly popular because it can handle machine learning tasks with just a few learning examples. It is also more biologically plausible and closer to what we observe in nature. While learning a new task, one normally does not start from a randomly initialised neural network presenting hundreds of thousands of examples in several thousands epochs.
When you are told to remember a person from a picture, you are able to distinguish this person from others even when you see her in different positions or environments. In machine learning, this is called one shot learning. The task of one shot learning is to learn new classes given only one instance available for each class. Three-way five-shot learning means learning three classes given five training instances each. You do not learn classifiers from scratch, but you typically use neural networks trained on similar tasks using much more data. This also reflects the natural situation when the visual perception is already well trained on similar tasks when trying to remember a new person from the picture.
This process can be also called meta learning or transfer learning as one uses a pretrained neural network called a backbone network.
Also, in a few-shot learning scenario, you can often utilise unlabelled instances apart of those few labelled samples that are available for the task.
In the above example, it means using pictures of many people without the information that a particular person is present in the picture or not. How can this data help your few shot classifiers to get better? Before we answer this question, we should explain how to measure similarity of images using their neural embeddings.
Neural image embeddings
Backbone networks are well pretrained to perform a feature extraction task — it means that layers of this neural network extract increasingly abstract patterns from the picture. First layers percept simple patterns such as a background noise or texture, layers deeper in the network recognize the whole objects. You might refer to our blogpost and open source project investigating how a convolutional network processes images.
Convolutional networks can be utilized not only to recognize objects, but also to compare two or more images. When different images are presented to a single pretrained network, you can compare outputs of individual convolutional layers — these (latent) output vectors or matrices are called image embeddings. When outputs are similar for both images in first layers, it means that images share similar low level features such as textures. Similarity of image embeddings in deep layers mean that images contain similar objects. When output of the network is similar, images are classified to the same class. The output is however just a single number so it cannot represent the image in its complexity. Image embeddings are therefore typically outputs of last convolutional or fully connected layers just before the output.
A backbone neural network is therefore capable of making a projection of all images into an abstract space of their neural embeddings, where dimensions do not have particular meaning. When you measure a cosine similarity of embeddings, it should be somehow related to similarity of original images.
You can project this multidimensional image embedding into two dimensions using linear (PCA) or nonlinear (tSNE) mapping and display a thumbnail of the original image at obtained coordinates. This way, you can verify if the backbone network works well for your images.
Ideally, all images of a single class (e.g. a particular person) should be projected into one compact cluster maintaining distance from clusters of other people. Reality is however often far from the ideal world and pictures of different people are often mixed together even for a perfect classifier. There are several reasons for this:
- the backbone classifier is typically pretrained on large corpora of images that may not contain pictures similar to your usecase
- distribution of images for a single person does not follow the Gaussian distribution in the latent space of image embeddings
- even the non-linear projection into two dimensions is far from perfect, meaning it can project images close together even when they are quite distant in the multidimensional latent space.
Even-thought this technique is far from perfect, it enables us to study the quality of neural embeddings and distribution of images in the latent space.
Improving the latent space for better classification of unseen classes
Many recent studies indicate that the classification performance in the few-shot scenario can be improved by further preprocessing the image embeddings obtained from a backbone network. The approach we have introduced in our paper is called the Latent Space Transform and we obtained state of the art performance using this transform on several tasks (see paperswithcode).
The latent space transform consists of three preprocessing steps.
- The first step is a power transform combined with the semi-normalization of each point in the latent space — this operation makes distribution of images in the latent space more Gaussian-like, meaning that it better resembles Gaussian clusters that are better separated.
- The second step is intended to reduce dimensionality of the latent space while maintaining useful information (we use QR decomposition and removal of unimportant dimensions).
- The third step is further centering and further semi-normalization of the latent space.
These three steps help to improve separation of data clusters and remove correlations of latent dimensions leading to higher quality of image embeddings.
Optimal transport mapping with unlabeled data
Going back to the question how unlabeled data can help us to recognize new classes better, one should imagine the distribution of images projected by a backbone network into the latent space and later improved by our latent space transform.
In this space, images of new few shot classes are accompanied by embeddings of unlabeled images. Because we can assume that after our transform, distribution of classes are Gaussian, we iteratively adjust centroids of few-shot classes to better reflect distribution of unlabeled data. It is likely that class centroids will be more robust after this adjustment and classification performance further improves.
In our experiments we show that our approach was able to outperform previously introduced methods significantly for both datasets we have experimented with.
Implications for machine learning
Improving few shot classification accuracy is important not exclusively for computer vision tasks such as perception of self-driving cars, but it should also work in other areas.
Improving properties of neural image embeddings is an important step towards better robustness of machine learning methods. Thanks to our method, one should be able to find similar images with higher robustness meaning that for example in recommender systems based on image embeddings or in visual search, number of returned dissimilar items can be reduced using this technique.
Also, the few shot learning is designed to reuse knowledge that machine learning systems gain when solving similar tasks. Imagine that a recommender system processing product images along with user interactions is trained on large eshops with millions of items and users. Few shot learning can be applied to extract knowledge from this backbone recommender improving the understanding and recommendation performance on smaller eshops.