Deep Visualization Toolbox

By: Jason Yosinski

1413   0   109908

Uploaded on 07/07/2015

Code and more info:

Comments (9):

By trevyn    2017-09-20

Fair. How about an excellent 4-minute YouTube video to get a basic understanding? :)

Original Thread

By anonymous    2017-09-20

Each of the kernels learned from the CNN are the filters that creates those features (lines,corners and so on).

Let's talk about Sobel just as example, Sobel use an specific kernel to convolute the image and with this kernel we can recover the gradients of the imagen in X and Y which is used as an edge detector. But what about if that features (lines) are not the only think that we would like to recover or it is not the ideal feature for our specific problem. Therefore, we can learn those kernels and create different images.

This images created from the kernels are called Features Maps and can be visualised with different techniques, I really recommend to watch that video since you can better understand what are the features that CNN is learning and take a look of this course.

Well one way to learn this filters is giving that you know what is the expected output. You can convolute the image with random values (the first values in the filters/weights) and then learn those values until get something that is better to predict your training set, therefore, instead of using the typical sobel you are learning what is the best kernel/filter to recover the features that better represent your image.

So, those filters are finally the weights of the network that you just learn.

Original Thread

By anonymous    2017-09-20

Preface: A convolutional network is a collection of filters applied to sections of an image (strides, which are seen in the gif). They produce true/false labels for if a given sub-section of the image matches the filter.

Striding CNN

What you're seeing in the images you provide is not the best representation of how these visualizations work in my opinion, as they visualise how the CNN percieves the whole image, at each layer, and there are just a series of simple filters used there. This means all of them look very similar.

Here is a better representation of how the basic filters of a network might look like. Some of them will trigger on straight lines, others will trigger on horizontal lines. That is also what the image you linked shows, except it does so for the whole image, on a visually simple object, which makes it a bit more difficult to understand. When you get to more complex filters which build on top of these basic filters, you might be better off visualizing the entire image.

Simple convolutional filters

There is also a concept called transfer learning, where you take existing generalized models that are highly regarded, and try to apply these to your specific problem. These models often need to be tuned, which might mean removing some layers that are not needed (as each layer we keep means it's usually more time-consuming to train), and/or adding more layers.

A researcher will better be able to interpret how each layer in the network builds on the previous layers, and how they contribute to solving the problem at hand. This is often based on gut-feeling (Which can be simplified by good visualizations such as this deep visualization toolbox video)

As an example, let's say I'm using VGG16, which is the name of a general model trained on image-net. I want to change it to classify distinct categories of furniture, instead of the 1000 classes from of completely different things it was originally intended to classify. Because it is such a general model, it can recognize a lot of different things, from humans to animals, to cars, to furniture. But a lot of these things don't make sense for me to incur a performance-penalty for, since they don't really help me classifying my furniture.

Since a lot of the most important discoveries we make about these classes happen at different layers in the network, I can then move back up the convolutional layers, and remove everything which seem to be too complex for the task I'm doing. This might mean I remove some layers that seem to have specialized in categorising human features such as ears, mouths, eyes and faces.

As far as I know, people visualize as many layers as they find useful, and then usually make a judgement call based on instinct as to which layers to keep or throw away after that.

Images borrowed from:

Visualizing what ConvNets learn

An Intuitive Explanation of Convolutional Neural Networks

Original Thread

By anonymous    2017-09-20

There are several ways you might define "difference" here, which will probably lead to different solutions. One simple one might be by looking at the last feature vector in the network before the classification and comparing it to some sort of "ideal" feature vector for Tom Cruise's face (though you'd probably have to do some sort of normalization on that vector as well so the values don't grow infinitely). Then you'd have some vector difference.

However, I'm guessing what you're actually looking for is seeing the difference in image form. But again, how you define "difference" might be a problem. I would suspect a good way of showing the difference, would be to show what should be changed in the image to make it look more like Tom Cruise. This is actually what the "Deep Dreaming" networks are (partially) about. The idea is that you compute the gradients, but then rather than looking at the gradients to the weights, you look at the gradients to the inputs. These are the values which, if you changed the image in that direction, would help you produce an image that looks more like your target class. It's as though your training is being used to update the image rather than the values of the network (after you already have a trained network that is).

A little more than halfway through this short video, you can see something similar to what I'm suggesting. They show what should change in the image for the network to respond more strongly to the given image.

Original Thread

By anonymous    2018-03-12

I saw the github issue below asking the same question, you might want to follow it for future updates.

I don't speak for the developers who made this decision, but I would surmise that they would do this by default because it is indeed used often, and for most application where you aren't backpropagating into the labels, the labels are a constant anyway and won't be adversely affected.

Two common uses cases for backpropagating into labels are:

  • Creating adversarial examples

There is a whole field of study around building adversarial examples that fool a neural network. Many of the approaches used to do so involve training a network, then holding the network fixed and backpropagating into the labels (original image) to tweak it (under some constraints usually) to produce a result that fools the network into misclassifying the image.

  • Visualizing the internals of a neural network.

I also recommend people watch the deepviz toolkit video on youtube, you'll learn a ton about the internal representations learned by a neural network.

If you continue digging into that and find the original paper you'll find that they also backpropagate into the labels to generate images which highly activate certain filters in the network in order to understand them.

Original Thread

Popular Videos 44

Submit Your Video

If you have some great dev videos to share, please fill out this form.