Synopsis

The aim of this capstone project is to apply data science techniques in the area of Natural Language Processing using R.
The starting dataset for this project is represented by a large corpus of documents, which will be used to build a simple but effective predictive text model. The data can be found here.

This milestone report will explain the basics steps taken to load and sample the data, clean it and organize it in a useful manner to perform the subsequent modeling tasks.

Getting the data

After downloading and unzipping the data in the data/ directory, we can check which files we have.

list.files("./data/final/", recursive = T)
##  [1] "de_DE/de_DE.blogs.txt"   "de_DE/de_DE.news.txt"   
##  [3] "de_DE/de_DE.twitter.txt" "en_US/en_US.blogs.txt"  
##  [5] "en_US/en_US.news.txt"    "en_US/en_US.twitter.txt"
##  [7] "fi_FI/fi_FI.blogs.txt"   "fi_FI/fi_FI.news.txt"   
##  [9] "fi_FI/fi_FI.twitter.txt" "ru_RU/ru_RU.blogs.txt"  
## [11] "ru_RU/ru_RU.news.txt"    "ru_RU/ru_RU.twitter.txt"

There are three different files, blogs.txt, news.txt and twitter.txt for each of the four provided languages, DE, US, FI and RU. We will use the US dataset for this project.

Files statistics

Let’s get some basic statistics about these files; specifically, their size in MB, number of lines and number of words.

corp_stats <- tibble(corpus = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"), 
                     size_MB = c(blogs_size, news_size, twitter_size), 
                     num_lines = c(blogs_len, news_len, twitter_len), 
                     num_words = c(blogs_words, news_words, twitter_words))
corp_stats %>% 
    kable()
corpus size_MB num_lines num_words
en_US.blogs.txt 200.4242 899288 37546239
en_US.news.txt 196.2775 1010242 34762395
en_US.twitter.txt 159.3641 2360148 30093413

Sampling the data

From what we see, the datasets are quite huge, so we’ll need to select a small sample of them for the sake of simplicity. This of course might harm our analysis, but it should be fine for the purposes of this project.
We will select 1% of the lines in each document, and create a cumulative sample by merging together these 3 different samples.

set.seed(420)
blogs_sample <- sample_frac(blogs_df, size = .01, replace = T)
news_sample <- sample_frac(news_df, size = .01, replace = T)
twitter_sample <- sample_frac(twitter_df, size = .01, replace = T)
cum_sample <- bind_rows(blogs_sample, news_sample, twitter_sample)

Samples statistics

Now we can compute the same statistics seen above, but for our reduced samples.

sample_stats <- tibble(sample = c("blogs", "news", "twitter", "cumulative"), 
                     size = c(blogs_sample_size, news_sample_size, twitter_sample_size, cum_sample_size), 
                     num_lines = c(blogs_sample_len, news_sample_len, twitter_sample_len, cum_sample_len), 
                     num_words = c(blogs_sample_words, news_sample_words, twitter_sample_words, cum_sample_words))
sample_stats %>% 
    kable()
sample size num_lines num_words
blogs 2.5 Mb 8993 375790
news 2.6 Mb 10102 344597
twitter 3.2 Mb 23601 300564
cumulative 8.3 Mb 42696 1020951

After this sampling, the data should be much easier to analyse and manipulate, so we can proceed to clean and prepare it for the rest of the project.

Cleaning the data

The data we have so far need to be cleaned a bit, this means removing common stop words (“the”, “of”, “in”, etc.), removing punctuation and whitespaces, and converting every word to lowercase; we will also remove profanity words, based on a list available on this GitHub repository.
This will of course harm the accuracy of our future predictions, but will ensure that the data are coherent and easy to work with.
We are going to use the tm package for this purpose.

First we will create a so-called corpus, which contains our data in a suitable format.

corpus <- VCorpus(VectorSource(c(cum_sample)))

Now we can apply all the desired transformations and cleaning steps.

corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_contractions = T)
tospace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
corpus <- tm_map(corpus, tospace, "–")
corpus <- tm_map(corpus, tospace, "”")
corpus <- tm_map(corpus, tospace, "“")
corpus <- tm_map(corpus, tospace, "…")
corpus <- tm_map(corpus, removeNumbers)
remove_url <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", " ", x)
corpus <- tm_map(corpus, content_transformer(remove_url))
remove_handles <- function(x) gsub("@[^\\s]+", " ", x)
corpus <- tm_map(corpus, content_transformer(remove_handles))
profanity <- readLines("./data/profanity.txt")
corpus <- tm_map(corpus, removeWords, profanity)
corpus <- tm_map(corpus, stripWhitespace)

Now we are ready to perform some analysis and gain a few interesting insights from our data.

n-grams Statistics

The clean corpus we generated can be used to analyse word frequencies, as well as a couple of n-grams frequencies. An n-gram is a contiguous sequence of n items (in this case, words) from a given corpus; we will explore bi-grams (n = 2) and tri-grams (n = 3).
In order to do this, we must first create a document-term matrix, which contains the frequency of each n-gram over the three documents we have (blogs, news and tweets).
We are going to use the RWeka library for this purpose.

1-grams

dtm1 <- DocumentTermMatrix(corpus)
top_1gram <- findMostFreqTerms(dtm1, n = 30)$`1`
top_1gram
##   will   said   just    one   like    can    get   time    new   good 
##   3077   3039   2953   2894   2618   2493   2307   2200   1948   1826 
##    now    day   love   know people    see  first   back  going   also 
##   1750   1665   1637   1600   1501   1399   1386   1348   1337   1285 
##  great   make  think   last   year   much    two really   work    got 
##   1248   1242   1238   1229   1227   1191   1178   1168   1133   1115
df_1gram <- tibble(word = names(top_1gram), freq = top_1gram)
p_1gram <- df_1gram %>% 
    ggplot(aes(x = reorder(word, -freq), y = freq, label = word)) + 
    geom_col(fill = "steelblue") + 
    labs(x = "", y = "Frequency", title = "Distribution of top 30 words") + 
    theme_light() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
ggplotly(p_1gram, tooltip = c("freq", "word"))

2-grams

bi_gram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2 <- DocumentTermMatrix(corpus, control = list(tokenize = bi_gram_tokenizer))
top_2gram <- findMostFreqTerms(dtm2, n = 30)$`1`
top_2gram
##       right now       last year        new york     high school 
##             248             214             198             149 
##      last night       years ago      first time         can get 
##             140             137             133             123 
##       feel like looking forward       last week     even though 
##             122             116             108             103 
##        st louis       make sure      looks like       next week 
##             100              98              95              94 
##    good morning  happy birthday   united states         one day 
##              93              89              86              85 
##         can see       look like      new jersey       every day 
##              83              81              81              80 
##       two years       just like           s day        let know 
##              79              76              76              72 
##         go back        just got 
##              70              70
df_2gram <- tibble(word = names(top_2gram), freq = top_2gram)
p_2gram <- df_2gram %>% 
    ggplot(aes(x = reorder(word, -freq), y = freq, label = word)) + 
    geom_col(fill = "seagreen") + 
    labs(x = "", y = "Frequency", title = "Distribution of top 30 bigrams") + 
    theme_light() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p_2gram, tooltip = c("freq", "word"))

3-grams

tri_gram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3 <- DocumentTermMatrix(corpus, control = list(tokenize = tri_gram_tokenizer))
top_3gram <- findMostFreqTerms(dtm3, n = 30)$`1`
top_3gram
##              fake fake fake                mother s day 
##                          33                          29 
##          none repeat scroll        repeat scroll yellow 
##                          25                          25 
## stylebackground none repeat              happy new year 
##                          25                          24 
##               new york city                 let us know 
##                          23                          21 
##      president barack obama               two years ago 
##                          20                          18 
##             valentine s day              cake cake cake 
##                          17                          15 
##              happy mother s               cinco de mayo 
##                          15                          14 
##           happy mothers day                 last year s 
##                          14                          14 
##                  new year s              new york times 
##                          13                          13 
##          martin luther king                 come see us 
##                          12                          11 
##            county sheriff s                 g protein g 
##                          11                          11 
##                 rock n roll            g carbohydrate g 
##                          11                          10 
##            sheriff s office             st louis county 
##                          10                          10 
##             three years ago         wall street journal 
##                          10                          10 
##           attorney s office             fat g saturated 
##                           9                           9
df_3gram <- tibble(word = names(top_3gram), freq = top_3gram)
p_3gram <- df_3gram %>% 
    ggplot(aes(x = reorder(word, -freq), y = freq, label = word)) + 
    geom_col(fill = "darksalmon") + 
    labs(x = "", y = "Frequency", title = "Distribution of top 30 trigrams") + 
    theme_light() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p_3gram, tooltip = c("freq", "word"))

Discussion and remarks

The provided dataset is composed of three different documents: blog posts, news and tweets. Our aim was to understand their structure and peculiarities in order to build a text prediction model, similar to those working daily on our smartphones.
I could have chosen to use only the blog posts and news text, since usually tweets can contain many more mistakes and contractions due to the 140-letter restriction; however, this so-called Twitter slang has become part of the common English language, so for the sake of completeness I included the tweets dataset into the data analysis as well.
In order to provide fast but reliable predictions, I had to sample these data; 1% of the original dataset seemed to be a reasonable amount, in order to achieve fast computations while still being representative of the whole corpus. This should be enough for the purpose of this project.
Stop words were removed, so we may not be able to predict some super-common words such as “and”, “or”, “in” and so on. An additional cleaning step would be word stemming, but this could have restricted our already-chopped dataset a bit too much, so I chose to avoid it.

All these choices might impair the accuracy of the prediction model, but hopefully the final application will still be able to perform quite well with the given amount of data.

Future plans

The n-grams plots showed that some words are way more common than some others, and from the 2- and 3-grams plots it is evident how some of these words frequently appear together. This will be exploited for the creation of the prediction algorithm, which will probably take advantage of a back-off model, where the model would first look at the most common 3-grams (or maybe also 4-grams) in order to predict the next word, if nothing is found it will look in the 2-grams model and finally it will use the single-words model.
The final prediction application will be implemented as a ShinyApp.

All the code for this report as well as for the future development of the app will be available on this GitHub repository.