This is a Mile Stone report for the second week of the course ‘Data Science Capstone’ of the Coursera Data Science Specialisation.
As part of the project, this report demonstrates the process of: * Getting and cleaning the data * Exploratory analysis * An overview of what the prediction algorithm will be like.
The dataset is here
connect.blogs <- file('final/en_US/en_US.blogs.txt','rb')
connect.news <- file('final/en_US/en_US.news.txt','rb')
connect.twitter <- file('final/en_US/en_US.twitter.txt','rb')
blogs = readLines(connect.blogs, skipNul = TRUE, encoding="UTF-8")
news = readLines(connect.news, skipNul = TRUE, encoding="UTF-8")
twitter = readLines(connect.twitter, skipNul = TRUE, encoding="UTF-8")
close(connect.blogs)
close(connect.news)
close(connect.twitter)
## File File.Size.MB Lines Words
## 1 blogs 248.4935 899288 37570839
## 2 twitter 249.6329 1010242 34494539
## 3 news 301.3969 2360148 30451170
The three files sums up to 799.5233231 megabyte, 4269678 lines, and 102516548 words.
Since the dataset consists of more than 4 million lines, sampling is needed for reasonable processing time. Here we sample 10% of the data to work with.
data.sample <- sample(c(sample(blogs, length(blogs) * .1),
sample(twitter, length(twitter) * .1),
sample(news, length(news) * .1))
)
The data come from three different sources, each having slightly different format. So it is better to tidy up the data for easier visualisation as well as modelling.
#Make tokenised version of data.sample
#Profainity filter
bad_words = readLines("bad_words.txt")
data.sample = gsub(paste0('\\s*\\w*', bad_words[1:300], '\\w*\\s*', collapse = '|'), ' ', data.sample)
data.sample = gsub(paste0('\\s*\\w*', bad_words[301:600], '\\w*\\s*', collapse = '|'), ' ', data.sample)
data.sample = gsub(paste0('\\s*\\w*', bad_words[601:933], '\\w*\\s*', collapse = '|'), ' ', data.sample)
data.sample = replace_html(data.sample) #take away HTML5 tags
data.sample = replace_non_ascii(data.sample) #take away none-ASCII characters
data.sample = replace_contraction(data.sample)
data.sample = replace_ordinal(data.sample)
data.sample = replace_number(data.sample, remove = TRUE)
data.sample = replace_names(data.sample)
data.sample = str_to_lower(data.sample)
data.sample = mgsub(data.sample, c("\\b[Uu]\\.*[Ss]\\.*[Aa]\\.*\\b", "\\b[Uu]\\.+[Ss]\\.*\\b","\\b[Uu]\\b"),
c('United States of America', 'United States', "you"), fixed = F)
data.sample = replace_incomplete(data.sample, ' ')
data.sample = replace_rating(data.sample)
make_token = function(n, text_vector){
data.sample.token = tokens(text_vector, remove_numbers = TRUE,
remove_url = TRUE, remove_separators = TRUE, remove_punct = TRUE, ngrams = n)
data.sample.token = tokens_remove(data.sample.token, c('[^[:print:]]', "[[:punct:]]", '\\s+'))
}
tokens_one_grams = make_token(1, data.sample) %>% tokens_keep(., min_nchar = 2)
tokens_two_grams = make_token(2, data.sample)
tokens_three_grams = make_token(3, data.sample)
tokens_four_grams = make_token(4, data.sample)
With a clean dataset at hand, let’s build some Document-feature matrices (DFM). According to Wikipedia, a DFM is a “mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.”
#Create n-gram dfm
data.sample.dfm.1.gram = dfm(tokens_one_grams, remove = stopwords("english"))
data.sample.dfm.2.gram = dfm(tokens_two_grams)
data.sample.dfm.3.gram = dfm(tokens_three_grams)
data.sample.dfm.4.gram = dfm(tokens_four_grams)
Since removing stopwords makes many n-gram ungrammitical, only unigrams have stopwords removed.
Let’s take a look at some popular n-gram features.
## will can said one like just get time now know
## 4235 3202 3082 2963 2685 2559 2307 2217 1886 1700
## of_the in_the it_is i_am for_the to_the on_the do_not to_be
## 4206 3942 3181 3015 2104 2086 1933 1713 1561
## i_have
## 1510
## i_do_not one_of_the it_is_a a_lot_of i_am_not
## 607 354 341 313 311
## i_can_not thanks_for_the it_is_not i_did_not there_is_a
## 280 266 255 229 202
## i_do_not_know i_am_going_to can_not_wait_to
## 129 123 106
## do_not_want_to thanks_for_the_follow i_do_not_think
## 83 81 81
## the_end_of_the for_the_first_time is_going_to_be
## 80 79 79
## i_would_like_to
## 77
The prediction algorithum will employ the N-gram model in order to calculate the probability of the next word in respect to the previous words. The Shine app will be simple, consisiting an input field for inputing the strings desired for prediction, a submite button to initiate the prediction, and a word cloud to beautifually showcase the predictions.