Beginning to form a predictive text model

An exploratory analysis

See Appendix for code This picture was generated from the data (See Appendix).

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.


Loading the data and preliminary exploration

We were given three datasets containing text lines from publicly available web sources; blogs, twitter, and news sites. Please check here for more information about the corpora. After loading the 3 datasets, cleaning it up, removing profanity, and tokenizing each text, I explored the data.

Blogs

lines:8.9928810^{5}
unique words:29793

Twitter

lines:2.36014810^{6}
unique words:15013

News

lines:1.01024210^{6}
unique words:29981

The top 10 words used in each dataset:

Blogs

##   the   and  that   for   you   was  with  this  have   but 
## 0.122 0.072 0.030 0.024 0.020 0.019 0.019 0.018 0.015 0.014

Twitter

##   the   you   and   for  that  this  your  with  have   are 
## 0.098 0.056 0.044 0.041 0.024 0.019 0.018 0.017 0.017 0.017

News

##   the   and   for  that  with  said   was  from   his   but 
## 0.158 0.071 0.028 0.028 0.021 0.020 0.018 0.012 0.012 0.012

Exploring the data at large

After the initial analysis and cleaning, the three datasets were combined into one large corpora, which will be used to form a predictive algorithm. Exploring the large dataset could help determine how best to proceed towards a predictive model.

We can observe the frequencies of various n-grams from the data. The following tables shows the frequencies for the top 10 words.

The following two tables show the 2 and 3-gram frequencies for the top 10 word pairs and triplets.

Most of the unique words in the data are not observed very often. In fact, while sorting the data, it becomes blatantly clear removing sparse terms is almost necessary to do even a basic exploratory analysis. But how much pruning is appropriate? There will be a natural tug-and-pull to either increase efficiency via pruning or increase accuracy by not pruning.


Word Frequency

Looking at the data, we can use the word frequencies to determine how many of the highest frequency words are needed to cover most of the data.

For 50% coverage, the top 324 most frequent words are needed, and for 90% coverage the top 9658 most frequent words are needed.


Handling Different Languages

Although the data were filtered for english content, there may be parts where other languages have slipped in or passed the filter. The occurance of non-english words may not be entirely erroneous, since similarity between languages is inevitable, however, when forming an english predicting model its vital we have ways for assessing how clear our data is.
Of the many packages related to language assessment in R, I found the “textcat” package to suit the needs for this project. In particular, the language profiles can be easily modified to include as many or as few languages as desired. Since we were instructed to assess for English, Finnish, German, and Russian, these languages served as the restricting parameter for the package.

The following is a table of the total number of words by language seen in the data, as designated by the “textcat” package.

##   languagesFilt  Freq
## 1       english 34341
## 2       finnish  6803
## 3        german  8963

As you can see, there are no “Russian” words. I suspect this is due to my cleaning process, where I removed non-english letters from the dataset. In the future, I may want to remove that filter.

As an example, here are some of the terms from the data that were designated as non-english.

Sample of “Finnish” words found in data

##  [1] "muriels"    "pills"      "virtuosity" "mannequin"  "sake"      
##  [6] "kentuck"    "ima"        "amidst"     "jokingly"   "sells"

Sample of “German”" words found in data

##  [1] "sider"     "lightenup" "shaker"    "wretch"    "leonte"   
##  [6] "nischeta"  "district"  "archer"    "lawmaker"  "shuffles"

Concerns and building a predictive model

One major concern is memory and efficiency. If possible, it would be best to reduce the number of terms involved, while keeping model accuracy high. To better conserve space, I could use regular expressions that match a “like” expression to reduce the number of terms needed for high coverage. This might reduce accuracy, but the overall output would be better.

Forming the predictive n-gram model would include the following steps:
1. Form 2, 3, and 4-gram corpora from the large dataset.
2. Calculate the frequencies of the terms in each.
3. “Smooth” the frequencies in each, in order to allow for zero-frequency terms to be viable.
4. Create simple n-gram models using the 2, 3, and 4-gram corpora from the large dataset.
5. Using a Markov chain approach, a word (or more) can be predicted based on the word(s) currently provided to these models.
6. However, each model may exhibit unique characteristics and term frequencies. These frequencies should be weighed properly when predicting, so a more accurate model should be formed by combining the three simple n-gram models into one “back-off” model, which takes into account the model with the highest accuracy.
7. In order to account for words not seen in our data, regular expressions could be used to recognize likeness.


Appendix

library(wordcloud)
wordcloud(singles$dimnames$Terms, min.freq = .0001, scale = c(5, .25), random.color = TRUE , max.words = 400, colors = c("blue","red", "black", "orange"))