Introduction

The goal of this milestone report is to conduct an exploratory analysis on the given dataset and build a simple model for the relationship between words. Those are the first steps in building a predictive text mining application.

We have been supplied with a dataset Capstone Dataset containing twitter, news and blog data for four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora.

In our report we will solely focus on the English corpus. If the final prediction model works well and there is sufficient time, we will try to incorporate other languages in our model.

Pre-processing

The main pre-processing tasks are:

  1. Tokenization: the process of identifying appropriate tokens such as words, punctuation and numbers.

  2. Filtering: the process of removing profanity and other words we do not want to predict.

Summary statistics

We start our analysis by providing some summary statistics for each data file and in total before we apply any manipulations:

Source Lines Words Characters Size (in Mb)
en_US.blogs.txt 899288 37334117 208623081 200.42
en_US.news.txt 1010242 34365936 205243643 196.28
en_US.twitter.txt 2360148 30373559 166816544 159.36
Total counts 4269678 102073612 580683268 556.06

We observe that although the files contain different number of lines of text the words are roughly on the same scale (30-37M). That is to be expected because in twitter the lines are shorter than in news, and in blogs the lines can be longer than in news.

Sampling

Since our dataset is quite large we are going to sample 20% of the lines from blogs, 15% of the lines from news and 10% of the lines from tweets. We merge these samples into one dataset that contains sample lines from the three types of available data. We use this sample for the rest of our analysis.

Tokenization and Filtering

We apply tokenization to the sample and we get the list of the words it contains. With similar tokenazation we get pairs of words (bigrams) and triads of words (trigrams). For the tokenization process we used the methods described in the book Text Mining with R: A Tidy Approach by Julia Silge and David Robinson available on-line for free.

We then apply the following filters:

  1. We’ve downloaded a publicly available profanity list from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words and we filtered out the matching words.

  2. We used the datasets profanity_alvarez, profanity_arr_bad, profanity_banned, profanity_racist and profanity_zac_anger from the lexicon package.

  3. We filtered out words longer than 20 characters.

  4. We filtered out words that do not contain English characters or the apostrophe.

  5. We filtered out some twitter acronyms like lol and rt.

  6. We omitted the stopwords (high frequency words like ‘the’ that do not add meaning to a sentence).

The last filter was not applied to bigrams and trigrams.

Exploratory analysis

After filtering we can count the appearances of each word, bigram and trigram in the sample dataset. We can plot the most common words in the sample dataset to get a feeling of the most common subjects. We can also plot the most common bigrams and trigrams to get a sense of the way people write.

Of course all of the above figures have really long tails because of the use of rare words, bigrams and trigrams in blogs, news and tweets.

Future goals