Predictive analysis for text data requires an specific treatment due to the intrinsec characteristics of this type of data (like high dimensionality or sparsity). Here we explore a three set corpus data as a previous step to modeling a predictive text algorithm.

More specifically, the data comes from a corpus called HC Corpora (www.corpora.heliohost.org. Page is down at the moment of writing this report). Consists on a data set from Twitter, another from News sites and a last one from Blog sites.

## Warning: package 'knitr' was built under R version 3.0.3
## Warning: package 'tm' was built under R version 3.0.3
## Warning: package 'RWeka' was built under R version 3.0.3

Due to memory and computing time issues, small sample is presented in this analysis (my computer has limited capacities) with a size of 5% of the original size of the data.

To give a first impression of how the data looks like, following is the the head of this three documents samples.

Heading of Twitter sample data
## [1] "isn't it ironic that i don't follow you? Tagging me in tweets sure shows me"
## [2] "well said"                                                                  
## [3] "Chen wow! 2 k's in the top of the first including Lord Hamilton"            
## [4] "I'm a manly muppet!!!"                                                      
## [5] "are you located in the uptown or CBD area?"
Heading of News sample data
## [1] "The stoppage was seen as a test of whether unions have the support to stage a full-blown general strike over labor market reforms the Socialist government says it will impose by decree very soon if the unions do not reach agreement on their own with management. The reforms are deemed critical to resurrecting Spain's moribund economy and reassuring jittery investors who have sent the government's borrowing costs soaring."
## [2] "So what could possibly balance out these horror-show headlines? For all the mistakes made by voters and their elected officials, in 2008 they also proved they could adapt to worsening conditions. It was a year of historic change that raises high hopes that the future will be better than the present."                                                                                                                           
## [3] "\"I think after you get here, you set that goal for yourself every year,\" Aldridge said Friday. \"Not coming wouldn't be good for me now.\""                                                                                                                                                                                                                                                                                           
## [4] "The answer depends on how the land is used and the outcome of additional soil testing in the works. For example, the most problematic area is the southeast corner of the parcel, where the Vikings' plan calls for a parking lot. State pollution officials said if more contamination is found, they could require the developer to remove soil to a depth of two feet before it lays asphalt."                                       
## [5] "That he agreed to work with Stupor is another way he came to our community. This big filmmaker working on this little zine? That's amazing."
Heading of Blog sample data
## [1] "Alderman Griffiths was local-born and headmaster of nearby primary school, Fir Tree Lane. Mr.Walker lived in Hamstead Garden Suburb. Peter Griffiths lost his seat in March 1966 to Andrew Faulds (Labour) who lived in Stratford –upon-Avon!"
## [2] "Sometimes they could obtain pork, which made them feel quite special."                                                                                                                                                                          
## [3] "Last I smelled, both Baldwin and Sarandon were still polluting the ozone layer above Hollywood."                                                                                                                                                
## [4] "Children in the Wind (Japan…Hiroshi Shimizu)"                                                                                                                                                                                                 
## [5] "faith is given to you to extinguish all the fiery darts of the enemy."

Previous cleaning and filtering

In order to obtain a final distribution of words (or tokens), a deep treatment has been performed on the data. After dividing the data into a training set and a test set, the tokens were obtained following a top down analysis, that is, in a first step the text was divided in phrases using punctiations signs (point, comma, colon, interrogation…) as separators. On a second step, tokens were obtained using blank space and slash as separators.

A dictionary was used to remove profanity words from the data set. The dictionary can be found at http://www.bannedwordlist.com/swearwordresources.html

Numbers encountered in the data were labeled as <number>.

Basic figures

Tiny data sets based on tokens and token distributions are the basis for this exploratory analysis. A table with the number of words and lines in each corpus is presented bellow, as well as a bar plot for a better comparison.

Words Lines Words per line
News 133,158 3,863 34.47
Twitter 1,507,272 118,008 12.77
Blogs 1,882,228 44,965 41.86

plot of chunk loading blog data

“Blogs” is the corpus with the higher number of lines followed by the twitter data set, but with a lower number of words per line. The news corpora has the smallest size with only about 4,000 lines (taking into account that this is a 5% size of the original data).

Token distribution

Bellow three tables are presented with the 10 most frequent words sorted by frequency in decreasing order.

Twitter token Frequency Relative frequency
the 92771 0.05
and 54549 0.03
to 53502 0.03
a 44950 0.02
of 44237 0.02
i 39063 0.02
in 29720 0.02
that 23036 0.01
is 21531 0.01
it 20259 0.01
[1] “”
News token Frequency Relative frequency
the 7829 0.06
to 3583 0.03
a 3428 0.03
and 3423 0.03
of 3006 0.02
<number> 2882 0.02
in 2572 0.02
for 1348 0.01
that 1341 0.01
is 1068 0.01
[1] “”
Blogs token Frequency Relative frequency
the 92771 0.05
and 54549 0.03
to 53502 0.03
a 44950 0.02
of 44237 0.02
i 39063 0.02
in 29720 0.02
that 23036 0.01
is 21531 0.01
it 20259 0.01

From these tables one can have a feeling of how the token distribution will look like: essentially, a low percentage of words cover a hugh part of text, and the majority of this most frequent words are function words (stop words): prepositions, conjunctions, determinants and pronoums. These words are not very useful for deriving meaning from a text but have a hugh value when it comes to predict the next word based on the previous ones.

What follows is the log distribution of token frequency showing the charactistics of the data distribution commented in the previous paragraph, from another prespective: notice the extreme skewnes, meaning that there are a lot of words that appear just a few times (left side of the histogram) and a limited number of words with a high frequency(rigt side. Also, see the tables above), which cover an important part of the whole corpus.

plot of chunk histograms

2Grams and 3Grams

Frequency of 2 and 3 gram follows a similar distribution than 1 gram, with the function words occupying the highest positions. The distribution is not so skewed though. What follows next is a set of plots showing the distribution of the grams in relation by the text covered by these grams.

plot of chunk ggplot distributionsplot of chunk ggplot distributionsplot of chunk ggplot distributions

It can be seen that with a few tokens we cover a decent size of the text. These skewness tends to decrease once the Ngram size is aumented. We present next, as an example, a table with the percentage of words needed to cover the 50%, 80% and 90% of the text for 1, 2, and 3 grams in the News corpora.

Text covered Prop. of 1gram tokens Prop. of 2gram tokens Prop. of 3gram tokens
50% 0.01 0.21 0.46
80% 0.14 0.69 0.81
90% 0.30 0.93 0.90

Low frequency tokens

There is a large number of tokens with a very low frequency, the prective value of this instances will be null. We can be use this information to stablise a predictive criteria for when an unseeen word (a word that is not in the training set) is presented.

Modeling plans

A pruning of the data, consisting in removing the lowest frequency tokens will be carried out. Before that, a sample of the words with count equal to 1, will be labeled as <UNK> (for unknown word). This information will later be use in the predicting algorithm when confronted with unseen words.

With the remaining data an Ngram model for prediction will be performed. As can be seen from the tables, the frequency of a given token has been computed and will be used as an stimation of the probability. The prediction will consist in picking the word whose previous Ngram (a large one, if possible) has the highest joint frequency. In case the predicted value is <UNK>, the prediction will come from the n-1 gram with highest frequency and so on.

What follows is the heading of the 3Gram news frequency table just to show that 3Gram joint frequencies have also been computed.

##                      token freq       rel
## 53685           one of the   60 0.0006144
## 47507   more than <number>   43 0.0004403
## 2285              a lot of   41 0.0004198
## 723    <number> percent of   37 0.0003789
## 904   <number> to <number>   36 0.0003686
## 38177         in the first   35 0.0003584
## 30757          going to be   30 0.0003072
## 3994      according to the   29 0.0002970
## 55655          part of the   27 0.0002765
## 51561      of the <number>   26 0.0002662

In order to make the model even more efficient the main data structue will be turn into a trie structure with three levels based on the firs three letters of the NGrams.

Final product

An application using the shiny package from R will be developed. It will consist in a text input fied, and a set of options (three or more) from which the user can select the word next. These options will change with change according to the letter typed.