12. Learning: Neural Nets, Back Propagation
1174 16 177481
MIT 6.034 Artificial Intelligence, Fall 2010
View the complete course: http://ocw.mit.edu/6-034F10
Instructor: Patrick Winston
How do we model neurons? In the neural net problem, we want a set of weights that makes the actual output match the desired output. We use a simple neural net to work out the back propagation algorithm, and show that it is a local computation.
License: Creative Commons BY-NC-SA
More information at http://ocw.mit.edu/terms
More courses at http://ocw.mit.edu
By anonymous 2017-09-20
TLDR: Word2Vec is building word projections (embeddings) in a latent space of N dimensions, (N being the size of the word vectors obtained). The float values represents the coordinates of the words in this N dimensional space.
The major idea behind latent space projections, putting objects in a different and continuous dimensional space, is that your objects will have a representation (a vector) that has more interesting calculus characteristics than basic objects.
Word2Vec algorithms do this:
Imagine that you have a sentence:
The dog has to go ___ for a walk in the park.
You obviously want to fill the blank with the word "outside" but you could also have "out". The w2v algorithms are inspired by this idea. You'd like all words that fill in the blanks near, because they belong together. Therefore the words "out" and "outside" will be closer together whereas a word like "carrot" would be farther away.
This is sort of the "intuition" behind word2vec. For a real mathematical explanation of what's going on i'd suggest reading:
- GloVe: Global Vectors for Word Representation
- Linguistic Regularities in Sparse and Explicit Word Representations
- Neural Word Embedding as Implicit Matrix Factorization
For paragraph vectors, the idea is the same as in w2v. Each paragraph can be represented by its words. Two models are presented in the paper.
- In a "Bag of Word" way (the pv-dbow model) where one fixed length paragraph vector is used to predict its words.
- By adding a fixed length paragraph token in word contexts (the pv-dm model). By retropropagating the gradient they get "a sense" of what's missing, bringing paragraph with the same words/topic "missing" close together.
The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. [...] The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph
For full understanding on how these vectors are built you'll need to learn how neural nets are built and how the backpropagation algorithm works. (i'd suggest starting by this video and Andrew NG's Coursera class)
NB: Softmax is just a fancy way of saying classification, each word in w2v algorithms is considered as a class. Hierarchical softmax/negative sampling are tricks to speed up softmax and handle a lot of classes.