The goal of this milestone report is to conduct an exploratory analysis on the given dataset and build a simple model for the relationship between words. Those are the first steps in building a predictive text mining application.
We have been supplied with a dataset Capstone Dataset containing twitter, news and blog data for four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora.
In our report we will solely focus on the English corpus. If the final prediction model works well and there is sufficient time, we will try to incorporate other languages in our model.
The main pre-processing tasks are:
Tokenization: the process of identifying appropriate tokens such as words, punctuation and numbers.
Filtering: the process of removing profanity and other words we do not want to predict.
We start our analysis by providing some summary statistics for each data file and in total before we apply any manipulations:
| Source | Lines | Words | Characters | Size (in Mb) |
|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 37334117 | 208623081 | 200.42 |
| en_US.news.txt | 1010242 | 34365936 | 205243643 | 196.28 |
| en_US.twitter.txt | 2360148 | 30373559 | 166816544 | 159.36 |
| Total counts | 4269678 | 102073612 | 580683268 | 556.06 |
We observe that although the files contain different number of lines of text the words are roughly on the same scale (30-37M). That is to be expected because in twitter the lines are shorter than in news, and in blogs the lines can be longer than in news.
Since our dataset is quite large we are going to sample 20% of the lines from blogs, 15% of the lines from news and 10% of the lines from tweets. We merge these samples into one dataset that contains sample lines from the three types of available data. We use this sample for the rest of our analysis.
We apply tokenization to the sample and we get the list of the words it contains. With similar tokenazation we get pairs of words (bigrams) and triads of words (trigrams). For the tokenization process we used the methods described in the book Text Mining with R: A Tidy Approach by Julia Silge and David Robinson available on-line for free.
We then apply the following filters:
We’ve downloaded a publicly available profanity list from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words and we filtered out the matching words.
We used the datasets profanity_alvarez, profanity_arr_bad, profanity_banned, profanity_racist and profanity_zac_anger from the lexicon package.
We filtered out words longer than 20 characters.
We filtered out words that do not contain English characters or the apostrophe.
We filtered out some twitter acronyms like lol and rt.
We omitted the stopwords (high frequency words like ‘the’ that do not add meaning to a sentence).
The last filter was not applied to bigrams and trigrams.
After filtering we can count the appearances of each word, bigram and trigram in the sample dataset. We can plot the most common words in the sample dataset to get a feeling of the most common subjects. We can also plot the most common bigrams and trigrams to get a sense of the way people write.
Of course all of the above figures have really long tails because of the use of rare words, bigrams and trigrams in blogs, news and tweets.
We will use these sets of n-grams to create predictive models based on the book Speech and Language Processing (3rd ed. draft) by Daniel Jurafsky and James Martin that is freely available on-line. We will keep the best model based on cross validation tests.
We will also build a Shiny application as a user-interface to interact with our predictive model. The application will try to predict the next word based on the text the user entered. The user will be presented with the three most probable next words (much like the SwiftKey app).
We will also create a slide-deck for presenting our application to general audience.