Executive Summary

This is a Mile Stone report for the second week of the course ‘Data Science Capstone’ of the Coursera Data Science Specialisation.

As part of the project, this report demonstrates the process of: * Getting and cleaning the data * Exploratory analysis * An overview of what the prediction algorithm will be like.

The dataset is here

The dataset

Load data

connect.blogs <- file('final/en_US/en_US.blogs.txt','rb')
connect.news <- file('final/en_US/en_US.news.txt','rb')
connect.twitter <- file('final/en_US/en_US.twitter.txt','rb')

blogs = readLines(connect.blogs, skipNul = TRUE, encoding="UTF-8")
news = readLines(connect.news, skipNul = TRUE, encoding="UTF-8")
twitter = readLines(connect.twitter, skipNul = TRUE, encoding="UTF-8")

close(connect.blogs)
close(connect.news)
close(connect.twitter)

Data description

##      File File.Size.MB   Lines    Words
## 1   blogs     248.4935  899288 37570839
## 2 twitter     249.6329 1010242 34494539
## 3    news     301.3969 2360148 30451170

The three files sums up to 799.5233231 megabyte, 4269678 lines, and 102516548 words.

Data sampling

Since the dataset consists of more than 4 million lines, sampling is needed for reasonable processing time. Here we sample 10% of the data to work with.

data.sample <- sample(c(sample(blogs, length(blogs) * .1),
                 sample(twitter, length(twitter) * .1),
                 sample(news, length(news) * .1))
)

Pre-processing

The data come from three different sources, each having slightly different format. So it is better to tidy up the data for easier visualisation as well as modelling.

#Make tokenised version of data.sample

#Profainity filter
bad_words = readLines("bad_words.txt")
data.sample = gsub(paste0('\\s*\\w*', bad_words[1:300], '\\w*\\s*', collapse = '|'), ' ', data.sample)
data.sample = gsub(paste0('\\s*\\w*', bad_words[301:600], '\\w*\\s*', collapse = '|'), ' ', data.sample)
data.sample = gsub(paste0('\\s*\\w*', bad_words[601:933], '\\w*\\s*', collapse = '|'), ' ', data.sample)

data.sample = replace_html(data.sample)  #take away HTML5 tags
data.sample = replace_non_ascii(data.sample)  #take away none-ASCII characters
data.sample = replace_contraction(data.sample) 
data.sample = replace_ordinal(data.sample)
data.sample = replace_number(data.sample, remove = TRUE)
data.sample = replace_names(data.sample)
data.sample = str_to_lower(data.sample)
data.sample = mgsub(data.sample, c("\\b[Uu]\\.*[Ss]\\.*[Aa]\\.*\\b", "\\b[Uu]\\.+[Ss]\\.*\\b","\\b[Uu]\\b"), 
                    c('United States of America', 'United States', "you"), fixed = F)
data.sample = replace_incomplete(data.sample, ' ')
data.sample = replace_rating(data.sample)

make_token = function(n, text_vector){
        data.sample.token = tokens(text_vector, remove_numbers = TRUE, 
                                   remove_url = TRUE, remove_separators = TRUE, remove_punct = TRUE, ngrams = n)
        data.sample.token = tokens_remove(data.sample.token, c('[^[:print:]]', "[[:punct:]]", '\\s+'))
}

N-gram tokens

tokens_one_grams = make_token(1, data.sample) %>% tokens_keep(., min_nchar = 2)
tokens_two_grams = make_token(2, data.sample)
tokens_three_grams = make_token(3, data.sample)
tokens_four_grams = make_token(4, data.sample)

Document-feature matrix

With a clean dataset at hand, let’s build some Document-feature matrices (DFM). According to Wikipedia, a DFM is a “mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.”

#Create n-gram dfm
data.sample.dfm.1.gram = dfm(tokens_one_grams, remove = stopwords("english"))
data.sample.dfm.2.gram = dfm(tokens_two_grams)
data.sample.dfm.3.gram = dfm(tokens_three_grams)
data.sample.dfm.4.gram = dfm(tokens_four_grams)

Since removing stopwords makes many n-gram ungrammitical, only unigrams have stopwords removed.

Exploratory Analysis

Let’s take a look at some popular n-gram features.

Unigrams

## will  can said  one like just  get time  now know 
## 4235 3202 3082 2963 2685 2559 2307 2217 1886 1700

Bigrams

##  of_the  in_the   it_is    i_am for_the  to_the  on_the  do_not   to_be 
##    4206    3942    3181    3015    2104    2086    1933    1713    1561 
##  i_have 
##    1510

Trigrams

##       i_do_not     one_of_the        it_is_a       a_lot_of       i_am_not 
##            607            354            341            313            311 
##      i_can_not thanks_for_the      it_is_not      i_did_not     there_is_a 
##            280            266            255            229            202

Quatgrams

##         i_do_not_know         i_am_going_to       can_not_wait_to 
##                   129                   123                   106 
##        do_not_want_to thanks_for_the_follow        i_do_not_think 
##                    83                    81                    81 
##        the_end_of_the    for_the_first_time        is_going_to_be 
##                    80                    79                    79 
##       i_would_like_to 
##                    77

Prediction Strategy and plans for Shiny app

The prediction algorithum will employ the N-gram model in order to calculate the probability of the next word in respect to the previous words. The Shine app will be simple, consisiting an input field for inputing the strings desired for prediction, a submite button to initiate the prediction, and a word cloud to beautifually showcase the predictions.