Predictive analysis for text data requires an specific treatment due to the intrinsec characteristics of this type of data (like high dimensionality or sparsity). Here we explore a three set corpus data as a previous step to modeling a predictive text algorithm.
More specifically, the data comes from a corpus called HC Corpora (www.corpora.heliohost.org. Page is down at the moment of writing this report). Consists on a data set from Twitter, another from News sites and a last one from Blog sites.
## Warning: package 'knitr' was built under R version 3.0.3
## Warning: package 'tm' was built under R version 3.0.3
## Warning: package 'RWeka' was built under R version 3.0.3
Due to memory and computing time issues, small sample is presented in this analysis (my computer has limited capacities) with a size of 5% of the original size of the data.
To give a first impression of how the data looks like, following is the the head of this three documents samples.
## [1] "isn't it ironic that i don't follow you? Tagging me in tweets sure shows me"
## [2] "well said"
## [3] "Chen wow! 2 k's in the top of the first including Lord Hamilton"
## [4] "I'm a manly muppet!!!"
## [5] "are you located in the uptown or CBD area?"
## [1] "The stoppage was seen as a test of whether unions have the support to stage a full-blown general strike over labor market reforms the Socialist government says it will impose by decree very soon if the unions do not reach agreement on their own with management. The reforms are deemed critical to resurrecting Spain's moribund economy and reassuring jittery investors who have sent the government's borrowing costs soaring."
## [2] "So what could possibly balance out these horror-show headlines? For all the mistakes made by voters and their elected officials, in 2008 they also proved they could adapt to worsening conditions. It was a year of historic change that raises high hopes that the future will be better than the present."
## [3] "\"I think after you get here, you set that goal for yourself every year,\" Aldridge said Friday. \"Not coming wouldn't be good for me now.\""
## [4] "The answer depends on how the land is used and the outcome of additional soil testing in the works. For example, the most problematic area is the southeast corner of the parcel, where the Vikings' plan calls for a parking lot. State pollution officials said if more contamination is found, they could require the developer to remove soil to a depth of two feet before it lays asphalt."
## [5] "That he agreed to work with Stupor is another way he came to our community. This big filmmaker working on this little zine? That's amazing."
## [1] "Alderman Griffiths was local-born and headmaster of nearby primary school, Fir Tree Lane. Mr.Walker lived in Hamstead Garden Suburb. Peter Griffiths lost his seat in March 1966 to Andrew Faulds (Labour) who lived in Stratford âupon-Avon!"
## [2] "Sometimes they could obtain pork, which made them feel quite special."
## [3] "Last I smelled, both Baldwin and Sarandon were still polluting the ozone layer above Hollywood."
## [4] "Children in the Wind (Japanâ¦Hiroshi Shimizu)"
## [5] "faith is given to you to extinguish all the fiery darts of the enemy."
In order to obtain a final distribution of words (or tokens), a deep treatment has been performed on the data. After dividing the data into a training set and a test set, the tokens were obtained following a top down analysis, that is, in a first step the text was divided in phrases using punctiations signs (point, comma, colon, interrogation…) as separators. On a second step, tokens were obtained using blank space and slash as separators.
A dictionary was used to remove profanity words from the data set. The dictionary can be found at http://www.bannedwordlist.com/swearwordresources.html
Numbers encountered in the data were labeled as <number>.
Tiny data sets based on tokens and token distributions are the basis for this exploratory analysis. A table with the number of words and lines in each corpus is presented bellow, as well as a bar plot for a better comparison.
| Words | Lines | Words per line | |
|---|---|---|---|
| News | 133,158 | 3,863 | 34.47 |
| 1,507,272 | 118,008 | 12.77 | |
| Blogs | 1,882,228 | 44,965 | 41.86 |
“Blogs” is the corpus with the higher number of lines followed by the twitter data set, but with a lower number of words per line. The news corpora has the smallest size with only about 4,000 lines (taking into account that this is a 5% size of the original data).
Bellow three tables are presented with the 10 most frequent words sorted by frequency in decreasing order.
| Twitter token | Frequency | Relative frequency |
|---|---|---|
| the | 92771 | 0.05 |
| and | 54549 | 0.03 |
| to | 53502 | 0.03 |
| a | 44950 | 0.02 |
| of | 44237 | 0.02 |
| i | 39063 | 0.02 |
| in | 29720 | 0.02 |
| that | 23036 | 0.01 |
| is | 21531 | 0.01 |
| it | 20259 | 0.01 |
| News token | Frequency | Relative frequency |
|---|---|---|
| the | 7829 | 0.06 |
| to | 3583 | 0.03 |
| a | 3428 | 0.03 |
| and | 3423 | 0.03 |
| of | 3006 | 0.02 |
| <number> | 2882 | 0.02 |
| in | 2572 | 0.02 |
| for | 1348 | 0.01 |
| that | 1341 | 0.01 |
| is | 1068 | 0.01 |
| Blogs token | Frequency | Relative frequency |
|---|---|---|
| the | 92771 | 0.05 |
| and | 54549 | 0.03 |
| to | 53502 | 0.03 |
| a | 44950 | 0.02 |
| of | 44237 | 0.02 |
| i | 39063 | 0.02 |
| in | 29720 | 0.02 |
| that | 23036 | 0.01 |
| is | 21531 | 0.01 |
| it | 20259 | 0.01 |
From these tables one can have a feeling of how the token distribution will look like: essentially, a low percentage of words cover a hugh part of text, and the majority of this most frequent words are function words (stop words): prepositions, conjunctions, determinants and pronoums. These words are not very useful for deriving meaning from a text but have a hugh value when it comes to predict the next word based on the previous ones.
What follows is the log distribution of token frequency showing the charactistics of the data distribution commented in the previous paragraph, from another prespective: notice the extreme skewnes, meaning that there are a lot of words that appear just a few times (left side of the histogram) and a limited number of words with a high frequency(rigt side. Also, see the tables above), which cover an important part of the whole corpus.
Frequency of 2 and 3 gram follows a similar distribution than 1 gram, with the function words occupying the highest positions. The distribution is not so skewed though. What follows next is a set of plots showing the distribution of the grams in relation by the text covered by these grams.
It can be seen that with a few tokens we cover a decent size of the text. These skewness tends to decrease once the Ngram size is aumented. We present next, as an example, a table with the percentage of words needed to cover the 50%, 80% and 90% of the text for 1, 2, and 3 grams in the News corpora.
| Text covered | Prop. of 1gram tokens | Prop. of 2gram tokens | Prop. of 3gram tokens |
|---|---|---|---|
| 50% | 0.01 | 0.21 | 0.46 |
| 80% | 0.14 | 0.69 | 0.81 |
| 90% | 0.30 | 0.93 | 0.90 |
There is a large number of tokens with a very low frequency, the prective value of this instances will be null. We can be use this information to stablise a predictive criteria for when an unseeen word (a word that is not in the training set) is presented.
A pruning of the data, consisting in removing the lowest frequency tokens will be carried out. Before that, a sample of the words with count equal to 1, will be labeled as <UNK> (for unknown word). This information will later be use in the predicting algorithm when confronted with unseen words.
With the remaining data an Ngram model for prediction will be performed. As can be seen from the tables, the frequency of a given token has been computed and will be used as an stimation of the probability. The prediction will consist in picking the word whose previous Ngram (a large one, if possible) has the highest joint frequency. In case the predicted value is <UNK>, the prediction will come from the n-1 gram with highest frequency and so on.
What follows is the heading of the 3Gram news frequency table just to show that 3Gram joint frequencies have also been computed.
## token freq rel
## 53685 one of the 60 0.0006144
## 47507 more than <number> 43 0.0004403
## 2285 a lot of 41 0.0004198
## 723 <number> percent of 37 0.0003789
## 904 <number> to <number> 36 0.0003686
## 38177 in the first 35 0.0003584
## 30757 going to be 30 0.0003072
## 3994 according to the 29 0.0002970
## 55655 part of the 27 0.0002765
## 51561 of the <number> 26 0.0002662
In order to make the model even more efficient the main data structue will be turn into a trie structure with three levels based on the firs three letters of the NGrams.
An application using the shiny package from R will be developed. It will consist in a text input fied, and a set of options (three or more) from which the user can select the word next. These options will change with change according to the letter typed.