Text Mining: GloVe

First of all, GloVe was invented by Jeffrey Pennington, Richard Socher and Christopher D. Manning. The original paper, which I have used as main reference, can be found here.


About this Project

This project was part of a text mining course. The processing time was from October 2017 to March 2018. The course was outlined in 17 topics. Those topics starts easy with a brief survey of text mining and general classification methods followed by an introduction about some machine learning topics.

After that introduction the text mining topics starts with text classification and text similarity followed by sentiment analysis. The next was my topic about GloVe. The last more advanced topics handles e.g. convolutional neural networks for sentence classification and general adversarial text to image synthesis. You can take a look at the full list of topics at the GitHub Readme.

In this post I want to present the work I have investigated to explain GloVe. This starts with the general idea followed by the mathematics behind the model. After that I have presented the R package text2vec and some attempts of creating my own word embeddings using the full Wikipedia dump. Finally I show how to validate the word vectors by using my results.

The last thing to say here is: Have fun reading and exploring GloVe! If you have some comments or issues I am very happy about to receive those within the issue tracker of the repository.


The Idea of GloVe

The creators of GloVe try to keep the linear structure of the word vector space. They intend to use this linearity to define similarities between words. If you take a look at the animation below you can see what this meant. For instance, if we think about which word behave to Germany like Paris to France, then we expect that our model "says" Berlin. If you take a look at the illustration you might recognize, that the purple points (France and Paris) seems to have the same structure to each other as the orange points (Germany and Berlin). It is possible to describe this analogies by calculating the vector \(w_{germany} + w_{paris} - w_{france}\). This new vector ideally should point to \(w_{berlin}\). We use \(w\) for words.

Another point is, that we want related words closer to each other. Take a look at the two red bubbles. One bubble contains animals while the other one contains Italian cities. We can also think about much more dependencies just like singular and plural words and so on. But here we have a problem, at least in 3 dimensions. We are not able to model all those coherences by using a low dimension. Therefore, the dimension of a word vector is normally chosen very high (e.g. 500).

Basically, the coherence between two words seems to be "stored" in the difference of those vectors. This is meant as in terms of keep the linear structure. If you now think that the example below is just made up to show the idea than I have to disappoint you. Actually, the animation was made by using 3 dimensional trained word vectors and even here we can see some correct dependencies.


The Presentation

The most of the text I have written here comes from this presentation. Additionally, there is more mathematics explaining the model and more about the R package. Of course the presentation contains some other stuff I don't have mentioned here. Nevertheless it is a presentation. Therefore, some images, code snippets or other content isn't as good explained as in this post. I think the post and the presentation complement each other quite well.