The goal of this analysis is to give a brief overview over the data used in building a predictive language model. In this case we will be using three different text corpi, consisting of twitter, blog and news posts, thereby restricting ourselves on english texts. To form valid input for a prediction algorithm, the most widespread representation are frequency based n-grams. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on. In our case an n-gram consists of a sequence of n words and a corpus will be represented in a data structure, called term-document matrix. A mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a term-document matrix, columns correspond to documents in the collection and rows correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. In our case the entries consist of the absolute frequency of terms within each text corpus and will serve as the input for our predictive language model.
First we will give an overview of the original input data, which includes file size, word and line counts.
## file size lines words
## 1 en_US.blogs.txt 210 MB 899288 37334114
## 2 en_US.news.txt 210 MB 1010242 34365936
## 3 en_US.twitter.txt 160 MB 2360148 30341028
We can see that the the dataset requires about 580 MB and is balanced around the 30-37 million words. The number of lines differ between 900000 words for the blog corpus and about 2 million words. This is not surprising as the number of words for a tweet is limited to 140 characters and therefore consists of significantly less words, than the average blog or news post. Even though the file sizes do not seem to be very large, representing them in a corpus term-document matrix can be demanding on RAM.
For a proof of concept we therefore have randomly sampled 50000 lines of blogs and news data and 150000 lines of twitter data.
## file size lines words
## 1 en_US.blogs.txt 12 MB 50000 2064541
## 2 en_US.news.txt 10 MB 50000 1700165
## 3 en_US.twitter.txt 10 MB 150000 1930854
The file size of this sample is limited to about 32 MB, making it possible to store it in a term-document matrix on an average PC. The number of words still balanced around 2 million, such that none of the corpi is over-represented. A comparison of word counts in the different corpi, can be seen in the following figure.
In this section we will have a more detailed look at the n-gram represenatation of the input data. Therefore the term-document matrix has been computed for Uni-, Bi, 3 and 4-grams. The underlying corpi had been adjusted for stopwords, additional white spaces had been stripped, and numbers and punctuation had been removed. As stated above, these matrices can be quite large but are sparse, meaning that they include many n-gram frequencies equal to 0, which won’t contribute to the performance of a predictive model. Therefore we have removed those terms which have at least 30% of empty (i.e. terms occurrin 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of 30%. It is worth mentioning, that this significantly reduced the size of the term-document matrix, making it more feasible to store in memory. Following the summary statistics of the number of n-gram occurances in each of the corpi.
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 3.00 1st Qu.: 2.00 1st Qu.: 2.00
## Median : 7.00 Median : 7.00 Median : 5.00
## Mean : 39.58 Mean : 35.57 Mean : 38.59
## 3rd Qu.: 22.00 3rd Qu.: 21.00 3rd Qu.: 16.00
## Max. :10707.00 Max. :12594.00 Max. :9548.00
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.000 Median : 1.000 Median : 1.000
## Mean : 2.299 Mean : 1.617 Mean : 2.334
## 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 2.000
## Max. :1318.000 Max. :688.000 Max. :2017.000
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 2.000 Median : 1.000 Median : 2.000
## Mean : 3.877 Mean : 2.139 Mean : 4.096
## 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.: 4.000
## Max. :188.000 Max. :104.000 Max. :239.000
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 2.000 Median : 1.000 Median : 1.000
## Mean : 2.687 Mean : 1.744 Mean : 2.692
## 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :36.000 Max. :18.000 Max. :46.000
Non surprisingly, the maximum number of n-gram occurances decreases for larger sequences. On the other hand the average number of occurances for Bi, 3- and 4-grams show no downward trend, which is unexpected. The above statistics however does not give any information which of the n-grams are more represented in the different corpi. The 15 most frequent n-grams can be seen in the following figures.
## TableGrob (2 x 2) "arrange": 4 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
## TableGrob (2 x 2) "arrange": 4 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
## TableGrob (2 x 2) "arrange": 4 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
It can be seen that most people on twitter write about what they do, feel and love right at this moment. This is mostly shown in the high occurences of “just”, “like” and “love” and even more so in the 4-gram “I feel like I’m”. This is to be expected, as the 140 character limit, invites to post quick status updates of people’s lifes. A surprising result is, that many people seem to send their wishes for “happy mother’s day” on May 5th (“cinco de mayo”). Mainly because Mother’s Day is on May 10th in the US, but also because people should rather congratulate personally instead of using a medium, that their mother most likely is not proficient with. It is more likely however, that “cinco de mayo” is refering to a celebration day of Mexican Americans who see this day as a source of pride and one way they can honor their ethnicity.
A similar picture is drawn by the analysis of the n-grams of the blog corpus. According to the n-gram analysis people use this medium mostly to write about what the think and what they know about. This is to be expected, as these thoughts usually can not be limited to 140 characters like on Twitter. This can mostly be inferred from the high frequencies of “I think”, “I know” and “I can”.
In contrast, news evolve around who said what, while at the same time refering mostly to “The New York Times” and “The Dow Jones Average”. While the high number of occurences of “New York” is expected, the number of “I think” is a bit surprising. After all news should be objective rather than subjective.
In conclusion, the n-grams show that the random sample seems to represent the “tone” of each medium on the web correctly. In turn this gives a good indication, that it indeed can be used for word prediction. A representative random sample of the corpi, consisting of about 6 million words, could be instantiated. This sample includes frequencies of up to 4-gram terms which can be used for various predictive language models. It could be seen however, that after the removal of sparse terms, the term-document matrix can be stored far more memory efficient. For a more representative perfomance measure, we plan to load the data in batches to compute term document matrices in parallel. It should then be possible to combine each term-document matrix, from which sparse terms have been removed and store the whole dataset memory efficiently. Given these matrices it is now possible to build a conventional language model, like HMM or the Katz’s back-off model. However, these frequency count derived models usually have severe problems, when confronted with n-grams that have not been part of the training dataset, which make them unfit for a word prediction model. According to current literature, this is even the case if smoothing (assigning some of the total probability mass to unseen words or n-grams) is applied. A promising solution might be neural network language models, which use a different data structure called word embeddings. A word embedding is a function that maps words in some language in a high dimensional, real-valued vectorspace. These word vectors can then be clustered by similarity. Similar words being close together then allow us to generalize from one n-gram to a class of similar n-grams. Given the respective n-1-grams it then would be possible to predict synonyms for the last word in a sequence of words. So the next steps will include further clean up of the input data, Good-Turing frequency estimation of unseen n-grams, computation of word embeddings, using the Brown clustering approach for a word and n-gram similarity measure and the definition of proper input and output layers of an RNNLM (Recursive Neural Network Language Model).