Introduction

Neural networks currently perform as well — or nearly as well — as humans in a variety of tasks and domains. On the surface, modern neural networks may seem sophisticated since they can have billions of parameters. However, deep inside, they are well organized as a sequence of manageable functions, or 'layers'. These layers transform the input data into internal representations that themselves are eventually transformed into a solution of the task. The behavior of these layers is complex and multi-dimensional. The internal representations are not directly comparable across layers. How can we make them more readily visible? This article assumes some familiarity with common operations in convolution neural network (CNN) and a basic intuition and motivation behind dimensional reduction techniques. For readers that in need of a gentle introduction or a refresher, there are great video series by Grant Sanderson (aka 3blue1brown) on neural networks and a 5-minute introduction to PCA by Josh Starmer (StatQuest).

Taking image classification as an example, ImageNet is a dataset for a 1000-category classification task created to benchmark computer vision applications. The inputs of neural networks are simply the images being given to it. Here, we use the popular UMAP algorithm to arrange a set of input images in the screen. These are some of the ImageNet validation samples: it is not hard to see that the images are arranged mainly by color. It is hard though, as we will later explain, for a 2D arrangement to capture the full structure of a high-dimensional space.

This typeset denotes an explanation of the interactive features in this essay.
If you are viewing this article in portrait mode (e.g. on a tablet or smartphone), you will be able to use only a portion of the interactive features in this article.
Otherwise, feel free to explore the image space by panning and zooming on the right half of the screen. To restore to the default scale, click the orange button on the bottom right.

Note that the overall color already hints at the class of the image. For example, a very blue background often contains ocean animals,

outdoor scenes, airplanes, or birds,

but others may not have a blue background.

You are seeing these images in a 2D layout: positions on the screen. These layouts often do not have enough capacity to show the full structure of a high-dimensional space. Even if we were to summarize the entire image as one color, there are three dimensions in color: red, green and blue; if we consider individual pixels, we can get near million-dimensional data. In ImageNet, we have 150,528 dimensions in the input, from images that are 224 x 224 x 3. Clearly, 2D will not be enough, but one million dimensions might be too much (there are "only" 1000 classes in the task, after all).

As it turns out, dimensionality reduction techniques do not restrict us from computing only 2D or 3D layouts. If we give it more room than 3 dimensions, UMAP will have the chance to preserve more of the structures in the data. In this example, we project (224 x 224 x 3)-dimensional image data to 15 dimensions with UMAP. Up until now, you are only seeing the first two dimensions.

Admittedly, 15-dimensional space is not exactly intuitive. To overcome that, we use the Grand Tour, a visualization which smoothly rotates the 15D embedding and linearly projects it down to the 2D screen.

We let you control the speed of Grand Tour rotation:

The rotation of the Grand Tour is random but thorough: the sequence of projections depends on the initial (random) state, but any animation will ultimately go through every projection. As the rotation happens, you can probably already spot interesting clusters, and it would be useful to nudge the projection this way or that. To make projections more controllable, we provide you with "handles" to grab and move around.

Now dragging, for example, the "Ocean Animals" handle will move the region around.

By doing this, we can explore structures in our data in a more controllable way. In this example and others that follows, we use UMAP to project our data to 15D and use the Grand Tour to display it.

A Journey Into the Layers

The internal representations of the images as they go through a deep neural network's layers are known as neuron activations. Just like the input images, neuron activations are also high dimensional vectors. Projecting neuron activations of any chosen layer to more than three dimensions preserves interesting structures of the feature space in that layer, and reveals how a layer sorts images and patterns in its feature space. Using image examples as probe and the Grand Tour as lens, we can get a glimpse of the internal feature space of these neural networks.

Sometimes, we may want to hide the fine details of individual images and focus on the overall distribution of points. In such cases, we can represent images as dots and color them by a coarse labeling which is derived from their true classes The ImageNet training has 1000 detailed classes which, for example, has 119 breeds of dogs. We color the data points based on a coarse classification of these 1000 classes, which result in 67 classes. But still, we have to reuse the 10-category color scheme for multiple labels. The coarse labels are due to Noam Eshed. .

Hover over dots on the bottom right to filter data by label.

As an example of real-world deep neural network, GoogLeNet (also known as Inception-V1) is a well-known high-performance DNN for image classification.

You can toggle between showing by pressing [i] on your keyboard.

One of its internal layers, Inception-4d, shows some striking patterns: it encodes the orientation of animals,

cluster various animal faces,

and despite there being no 'human' class in the 1000 labels, the network recognizes human faces.

Again, let's take a Grand Tour and look at the intricate structure of the 15-dimensional space, one 2D projection at a time.

As with the input data, one way of changing projections in the Grand Tour is to drag the handles provided around some given set of data points. We can also create custom handles.

Clicking or pressing [d] on your keyboard will add a handle at the centroid of your next brush selection on data points.
You can rename a custom handle by editing the corresponding text.
or [x] will delete one latest custom handle at a time.

We have seen two layers of GoogLeNet: input layer and Inception-4d layer. Now, let's tour the other layers.

Starting off from the input layer,

diving down to a max-pooling layer,

then Inception-4d,

Inception-5b,

and finally, the softmax classification output.

Let's pay attention to a technical detail for a bit: notice that the layer-to-layer transitions are consistent with each other, the overall orientation of the layers not changing. Later, we will use the same technique to compare different neural network architectures.

For now, feel free to explore other layers of GoogLeNet here. To switch layers, click any layer indicators on the right.

Comparing Architectures

In the previous section, we saw how individual layers work, looking at their 15-dimensional embeddings. However, on two different layers, we should not expect the embedding algorithm to give us directly comparable coordinates, given that the two layers have completely different neuron activation patterns and the embedding algorithm (for example, the Stochastic gradient descent in UMAP) is not deterministic. Without handling this problem, we can easily lose track of individual data points when switching from embedding of one layer to another.

We handle the misalignment of embeddings by a combination of two techniques We handled a similar misalignment issue when dealing with linear dimensionality reduction in :

  1. We initialize the embedding at the computed embedding of the previous layer, In practice, we simply set the 'init' argument to the embedding of previous layer in UMAP's python API similar to Rauber, Paulo E. et al .
  2. We compute the optimal rotation which aligns one layer to another, using the closed-form solution to the Orthogonal Procrustes Problem
Solving the misalignment issue also allows us to compare different DNN models side by side.

The second technique deserves some more detail. Given a pair of embeddings of different layers, the Orthogonal Procrustes Problem finds an optimal orthogonal matrix which rotates one embedding to align against another Formally, Orthogonal Procrustes Problem solves for Q^* = arg\,min_{Q \in O(p)} ||X_1 Q - X_2||_2 , where X_1 and X_2 are two sets embeddings and Q is an orthogonal matrix. .

By enabling alignment , we rotate the view on the left to align with the one on the right.
By moving the handles around, you should be able to see that most clusters are matched after we perform the Procrustean alignment.

Note that the two embeddings do not have to come from the same model. On the right, which we've seen before, we have our old friend GoogLeNet. On the left we have ResNet50 , another well-known, high-performance deep classification architecture.

Move the handles to see how well their embeddings in later layers are aligned. With this way of comparing different models, we can easily find commonalities within different layers. For instance, the human cluster in GoogLeNet is also presented in the 18-BottleNeck layer of ResNet50; and within GoogLeNet, the human cluster in the 12-Inception (Inception-4d) layer was dismissed in the next layer (13-Inception), then reassembled later in the 15-Inception.
What other interesting behaviors can you find in these two models?

Resources

See the UMAP Tour of other models:

  1. VGG-16
  2. ResNet50
  3. ResNet101
  4. GoogLeNet
  5. Inception V3
  6. Compare ResNet50 vs GoogLeNet