Data Science Capstone: Milestone Report

The dataset

Load data

connect.blogs <- file('final/en_US/en_US.blogs.txt','rb')
connect.news <- file('final/en_US/en_US.news.txt','rb')
connect.twitter <- file('final/en_US/en_US.twitter.txt','rb')

blogs = readLines(connect.blogs, skipNul = TRUE, encoding="UTF-8")
news = readLines(connect.news, skipNul = TRUE, encoding="UTF-8")
twitter = readLines(connect.twitter, skipNul = TRUE, encoding="UTF-8")

close(connect.blogs)
close(connect.news)
close(connect.twitter)

Data description

##      File File.Size.MB   Lines    Words
## 1   blogs     248.4935  899288 37570839
## 2 twitter     249.6329 1010242 34494539
## 3    news     301.3969 2360148 30451170

The three files sums up to 799.5233231 megabyte, 4269678 lines, and 102516548 words.

Data sampling

Since the dataset consists of more than 4 million lines, sampling is needed for reasonable processing time. Here we sample 10% of the data to work with.

data.sample <- sample(c(sample(blogs, length(blogs) * .1),
                 sample(twitter, length(twitter) * .1),
                 sample(news, length(news) * .1))
)

Pre-processing

The data come from three different sources, each having slightly different format. So it is better to tidy up the data for easier visualisation as well as modelling.

#Make tokenised version of data.sample

#Profainity filter
bad_words = readLines("bad_words.txt")
data.sample = gsub(paste0('\\s*\\w*', bad_words[1:300], '\\w*\\s*', collapse = '|'), ' ', data.sample)
data.sample = gsub(paste0('\\s*\\w*', bad_words[301:600], '\\w*\\s*', collapse = '|'), ' ', data.sample)
data.sample = gsub(paste0('\\s*\\w*', bad_words[601:933], '\\w*\\s*', collapse = '|'), ' ', data.sample)

data.sample = replace_html(data.sample)  #take away HTML5 tags
data.sample = replace_non_ascii(data.sample)  #take away none-ASCII characters
data.sample = replace_contraction(data.sample) 
data.sample = replace_ordinal(data.sample)
data.sample = replace_number(data.sample, remove = TRUE)
data.sample = replace_names(data.sample)
data.sample = str_to_lower(data.sample)
data.sample = mgsub(data.sample, c("\\b[Uu]\\.*[Ss]\\.*[Aa]\\.*\\b", "\\b[Uu]\\.+[Ss]\\.*\\b","\\b[Uu]\\b"), 
                    c('United States of America', 'United States', "you"), fixed = F)
data.sample = replace_incomplete(data.sample, ' ')
data.sample = replace_rating(data.sample)

make_token = function(n, text_vector){
        data.sample.token = tokens(text_vector, remove_numbers = TRUE, 
                                   remove_url = TRUE, remove_separators = TRUE, remove_punct = TRUE, ngrams = n)
        data.sample.token = tokens_remove(data.sample.token, c('[^[:print:]]', "[[:punct:]]", '\\s+'))
}

N-gram tokens

tokens_one_grams = make_token(1, data.sample) %>% tokens_keep(., min_nchar = 2)
tokens_two_grams = make_token(2, data.sample)
tokens_three_grams = make_token(3, data.sample)
tokens_four_grams = make_token(4, data.sample)

Document-feature matrix

With a clean dataset at hand, let’s build some Document-feature matrices (DFM). According to Wikipedia, a DFM is a “mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.”

#Create n-gram dfm
data.sample.dfm.1.gram = dfm(tokens_one_grams, remove = stopwords("english"))
data.sample.dfm.2.gram = dfm(tokens_two_grams)
data.sample.dfm.3.gram = dfm(tokens_three_grams)
data.sample.dfm.4.gram = dfm(tokens_four_grams)

Since removing stopwords makes many n-gram ungrammitical, only unigrams have stopwords removed.

Exploratory Analysis

Let’s take a look at some popular n-gram features.

Unigrams

## will  can said  one like just  get time  now know 
## 4235 3202 3082 2963 2685 2559 2307 2217 1886 1700

Bigrams

##  of_the  in_the   it_is    i_am for_the  to_the  on_the  do_not   to_be 
##    4206    3942    3181    3015    2104    2086    1933    1713    1561 
##  i_have 
##    1510

Trigrams

##       i_do_not     one_of_the        it_is_a       a_lot_of       i_am_not 
##            607            354            341            313            311 
##      i_can_not thanks_for_the      it_is_not      i_did_not     there_is_a 
##            280            266            255            229            202

Quatgrams

##         i_do_not_know         i_am_going_to       can_not_wait_to 
##                   129                   123                   106 
##        do_not_want_to thanks_for_the_follow        i_do_not_think 
##                    83                    81                    81 
##        the_end_of_the    for_the_first_time        is_going_to_be 
##                    80                    79                    79 
##       i_would_like_to 
##                    77