This picture was generated from the data (See Appendix).
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
We were given three datasets containing text lines from publicly available web sources; blogs, twitter, and news sites. Please check here for more information about the corpora. After loading the 3 datasets, cleaning it up, removing profanity, and tokenizing each text, I explored the data.
lines:8.9928810^{5}
unique words:29793
lines:2.36014810^{6}
unique words:15013
lines:1.01024210^{6}
unique words:29981
## the and that for you was with this have but
## 0.122 0.072 0.030 0.024 0.020 0.019 0.019 0.018 0.015 0.014
## the you and for that this your with have are
## 0.098 0.056 0.044 0.041 0.024 0.019 0.018 0.017 0.017 0.017
## the and for that with said was from his but
## 0.158 0.071 0.028 0.028 0.021 0.020 0.018 0.012 0.012 0.012
After the initial analysis and cleaning, the three datasets were combined into one large corpora, which will be used to form a predictive algorithm. Exploring the large dataset could help determine how best to proceed towards a predictive model.
We can observe the frequencies of various n-grams from the data. The following tables shows the frequencies for the top 10 words.
The following two tables show the 2 and 3-gram frequencies for the top 10 word pairs and triplets.
Most of the unique words in the data are not observed very often. In fact, while sorting the data, it becomes blatantly clear removing sparse terms is almost necessary to do even a basic exploratory analysis. But how much pruning is appropriate? There will be a natural tug-and-pull to either increase efficiency via pruning or increase accuracy by not pruning.
Looking at the data, we can use the word frequencies to determine how many of the highest frequency words are needed to cover most of the data.
For 50% coverage, the top 324 most frequent words are needed, and for 90% coverage the top 9658 most frequent words are needed.
Although the data were filtered for english content, there may be parts where other languages have slipped in or passed the filter. The occurance of non-english words may not be entirely erroneous, since similarity between languages is inevitable, however, when forming an english predicting model its vital we have ways for assessing how clear our data is.
Of the many packages related to language assessment in R, I found the “textcat” package to suit the needs for this project. In particular, the language profiles can be easily modified to include as many or as few languages as desired. Since we were instructed to assess for English, Finnish, German, and Russian, these languages served as the restricting parameter for the package.
The following is a table of the total number of words by language seen in the data, as designated by the “textcat” package.
## languagesFilt Freq
## 1 english 34341
## 2 finnish 6803
## 3 german 8963
As you can see, there are no “Russian” words. I suspect this is due to my cleaning process, where I removed non-english letters from the dataset. In the future, I may want to remove that filter.
As an example, here are some of the terms from the data that were designated as non-english.
## [1] "muriels" "pills" "virtuosity" "mannequin" "sake"
## [6] "kentuck" "ima" "amidst" "jokingly" "sells"
## [1] "sider" "lightenup" "shaker" "wretch" "leonte"
## [6] "nischeta" "district" "archer" "lawmaker" "shuffles"
One major concern is memory and efficiency. If possible, it would be best to reduce the number of terms involved, while keeping model accuracy high. To better conserve space, I could use regular expressions that match a “like” expression to reduce the number of terms needed for high coverage. This might reduce accuracy, but the overall output would be better.
Forming the predictive n-gram model would include the following steps:
1. Form 2, 3, and 4-gram corpora from the large dataset.
2. Calculate the frequencies of the terms in each.
3. “Smooth” the frequencies in each, in order to allow for zero-frequency terms to be viable.
4. Create simple n-gram models using the 2, 3, and 4-gram corpora from the large dataset.
5. Using a Markov chain approach, a word (or more) can be predicted based on the word(s) currently provided to these models.
6. However, each model may exhibit unique characteristics and term frequencies. These frequencies should be weighed properly when predicting, so a more accurate model should be formed by combining the three simple n-gram models into one “back-off” model, which takes into account the model with the highest accuracy.
7. In order to account for words not seen in our data, regular expressions could be used to recognize likeness.
library(wordcloud)
wordcloud(singles$dimnames$Terms, min.freq = .0001, scale = c(5, .25), random.color = TRUE , max.words = 400, colors = c("blue","red", "black", "orange"))