How does GloVe differ from word2vec

Word Embeddings - Methods for representing words in algorithms and neural networks

If words and entire texts are to be used as input for an algorithm or a neural network, the following problem is quickly encountered:

How can words be represented by numbers in a meaningful way so that different programs can process them further?

Which way makes sense, of course, depends on the problem. In the following we present some common methods, especially those that have proven themselves for working with neural networks.

Bag-Of-Words (BOW)

The simplest approach is thatBag-Of-Words (BOW) -Approach: After an initial cleanup of the entire text corpus of special characters, a dictionary of all occurring words is created. Most of the time this dictionary is sorted in descending order of the number of occurrences and each word is assigned a number. The number zero is not used because it will later be used for words that are not in the dictionary (Out-Of-Vocabulary, OOV). This means that in the BOW approach, the number 1 represents the word that occurs most frequently, the number 2 the second most common, and so on.

With the help of this dictionary, a number is assigned to each word in the text corpus. This can be enough for certain simple tasks. However, the bag-of-words approach has the following drawback:

The size of the number is unrelated to the meaning of the word. Therefore, only one per word can be used as input for a neural network One-Hot-Encoded Vector which is as long as the dictionary contains words. This quickly creates vectors with a length of over 100,000, each containing only a 1 and otherwise only zeros. Such so-called Sparse Vectors can make it very difficult for neural networks to learn and create a high level of complexity. This can be avoided with other methods.

Unsupervised Word Vectors - Word2Vec, GloVe

Ideally, the mathematical representation of a word should also represent the meaning of that word in the context of other words.

A widely used approach to achieve this is to have an algorithm arrange all the words in the text corpus based on their occurrence and the words around them in a multidimensional space in such a way that words that often appear in similar contexts also have a similar vector. Such algorithms use so-called unsupervised learning, i.e. they independently search for the representation model without external specifications, which minimizes the deviations between related words.

The result is a vector representation for each word contained in the text corpus. Depending on the model, this vector has between 100 - 300 dimensions and is therefore significantly less complex than, for example, a one-hot-encoded BOW vector with 100,000 dimensions. Above all, each word vector also represents at least a rudimentary semantic meaning of the word in the context. So you can with pre-trained Word2Vec or GloVe Implementing models e.g. arithmetic with words:

So the vector for the word KING minus the vector for the word MAN plus the VECTOR for the word WOMAN actually gives the vector for QUEEN. This of course not exactly, but the vector for QUEEN is the vector closest to the result and can therefore be clearly determined as the result.

KING - MAN + WOMAN = QUEEN

The following also works, for example:

BERLIN - GERMANY + FRANCE = PARIS

So the models actually contain a relatively meaningful representation of the meaning of words. Of course, this meaning is not comprehensive and only refers to individual words and not to word roles or even entire passages. The practical vector format can, however, be processed well by neural networks of all kinds and Word2Vec, GloVe and other pre-trained models are therefore the basis for many current applications.

The disadvantage of Unsupervised Word Vectors is that you need a very, very large and as “clean” text corpus as possible in order to generate a meaningful vector representation. In addition, the algorithms are very computationally intensive. This is usually not feasible for individual projects, as the data volume required for meaningful training is simply missing.

When using pre-trained models such as Word2Vec and GloVe, the problem then arises that these models only know words from their corpus - all words that were not available during training receive a zero vector. This can be particularly difficult for special areas with their own jargon. If, for example, a project in the field of health insurance is to be implemented and the pre-trained models do not know important words such as daily sickness allowance, additional dental insurance, etc. and assign zero vectors for them, it will be very difficult for even the best neural network architecture to find the desired content to learn.

Another disadvantage is the storage capacity in the main memory required by pre-trained models: In order to have quick access to the word vectors, the complete dictionary including the associated vectors must be available in memory. This quickly takes up 1-2 GB of RAM just for holding the model. Some models such as spaCy have therefore switched to having the word vectors learned downstream from a convolutional neural network (CNN) - then only the CNN model has to be loaded and this then creates the desired vector for each word on request. Supervised Word Vectors are an alternative to Word2Vec and Co.

Supervised Word Vectors - Tensorflow Embedding

An alternative to the models trained with unsupervised learning to represent words in a vector space are the supervised word vectors. Even with a relatively small corpus, very good results can be achieved, especially in the area of ​​text classification. As with all supervised learning methods, a pre-classified training data set is required. A so-called Embedding layer a neural network can be trained to choose the vector for each word so that it is as close as possible to the classes in which the word occurs frequently and less close to classes in which it occurs little or not at all. The open source library Tensorflow from Google, for example, offers such a trainable embedding layer.

The word vector learned in this way therefore contains information about how likely it is that this word will be used in the context of a class. For classification, for example, you can simply add the vectors of all words in a text and divide by the number of words - this approach is surprisingly effective and robust. Of course, the vectors generated in this way can also be used as input for a Recurrent Neural Network (RNN) such as an LSTM, and thus read in each word individually and in the correct order, thus letting the RNN learn the classification.

Outlook - word processing at letter level

In research, intensive work is being carried out on using only single letters as input for neural networks instead of words. This already works very well in some areas. For example, new texts can also be generated dynamically, i.e. a neural network can read texts character by character and regenerate the following text character by character. Such sequence-to-sequence models are used, for example, in the field of language translation, but in the future they will also be used in chat systems, for example.

/ 1 comment / by Roland Becker