Deep Visualization Toolbox
1413 0 109908
Code and more info: http://yosinski.com/deepvis
By trevyn 2017-09-20
Fair. How about an excellent 4-minute YouTube video to get a basic understanding? :)
By anonymous 2017-09-20
Each of the kernels learned from the CNN are the filters that creates those features (lines,corners and so on).
Let's talk about Sobel just as example, Sobel use an specific kernel to convolute the image and with this kernel we can recover the gradients of the imagen in X and Y which is used as an edge detector. But what about if that features (lines) are not the only think that we would like to recover or it is not the ideal feature for our specific problem. Therefore, we can learn those kernels and create different images.
This images created from the kernels are called Features Maps and can be visualised with different techniques, I really recommend to watch that video since you can better understand what are the features that CNN is learning and take a look of this course.
Well one way to learn this filters is giving that you know what is the expected output. You can convolute the image with random values (the first values in the filters/weights) and then learn those values until get something that is better to predict your training set, therefore, instead of using the typical sobel you are learning what is the best kernel/filter to recover the features that better represent your image.
So, those filters are finally the weights of the network that you just learn.
By anonymous 2017-09-20
Preface: A convolutional network is a collection of filters applied to sections of an image (strides, which are seen in the gif). They produce true/false labels for if a given sub-section of the image matches the filter.
What you're seeing in the images you provide is not the best representation of how these visualizations work in my opinion, as they visualise how the CNN percieves the whole image, at each layer, and there are just a series of simple filters used there. This means all of them look very similar.
Here is a better representation of how the basic filters of a network might look like. Some of them will trigger on straight lines, others will trigger on horizontal lines. That is also what the image you linked shows, except it does so for the whole image, on a visually simple object, which makes it a bit more difficult to understand. When you get to more complex filters which build on top of these basic filters, you might be better off visualizing the entire image.
There is also a concept called transfer learning, where you take existing generalized models that are highly regarded, and try to apply these to your specific problem. These models often need to be tuned, which might mean removing some layers that are not needed (as each layer we keep means it's usually more time-consuming to train), and/or adding more layers.
A researcher will better be able to interpret how each layer in the network builds on the previous layers, and how they contribute to solving the problem at hand. This is often based on gut-feeling (Which can be simplified by good visualizations such as this deep visualization toolbox video)
As an example, let's say I'm using VGG16, which is the name of a general model trained on image-net. I want to change it to classify distinct categories of furniture, instead of the 1000 classes from of completely different things it was originally intended to classify. Because it is such a general model, it can recognize a lot of different things, from humans to animals, to cars, to furniture. But a lot of these things don't make sense for me to incur a performance-penalty for, since they don't really help me classifying my furniture.
Since a lot of the most important discoveries we make about these classes happen at different layers in the network, I can then move back up the convolutional layers, and remove everything which seem to be too complex for the task I'm doing. This might mean I remove some layers that seem to have specialized in categorising human features such as ears, mouths, eyes and faces.
As far as I know, people visualize as many layers as they find useful, and then usually make a judgement call based on instinct as to which layers to keep or throw away after that.
Images borrowed from:
By anonymous 2017-09-20
There are several ways you might define "difference" here, which will probably lead to different solutions. One simple one might be by looking at the last feature vector in the network before the classification and comparing it to some sort of "ideal" feature vector for Tom Cruise's face (though you'd probably have to do some sort of normalization on that vector as well so the values don't grow infinitely). Then you'd have some vector difference.
However, I'm guessing what you're actually looking for is seeing the difference in image form. But again, how you define "difference" might be a problem. I would suspect a good way of showing the difference, would be to show what should be changed in the image to make it look more like Tom Cruise. This is actually what the "Deep Dreaming" networks are (partially) about. The idea is that you compute the gradients, but then rather than looking at the gradients to the weights, you look at the gradients to the inputs. These are the values which, if you changed the image in that direction, would help you produce an image that looks more like your target class. It's as though your training is being used to update the image rather than the values of the network (after you already have a trained network that is).
A little more than halfway through this short video, you can see something similar to what I'm suggesting. They show what should change in the image for the network to respond more strongly to the given image.