Data Science Captsone : Milestone Report

vergès Maxime 13th May, 2019

Synopsis

This report stands for the second week 2 (milestone report) regarding Data Science Capstone from the Coursera Data Scienence Specialization. The challenge is to develop a prediction algorithm to get the next word in a sequence of words. This report highlights how to get, to clean and to process the data but alos how to explore data analysis with some patterns.


Dataset

The data for this project come from the provided source.


Loading the datasets and the packages

We load the data as below:

setwd("C:/Users/maxim/Desktop/Coursera-SwiftKey/final/en_US")
blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

The packages rJava, knitr, NLP, tm, RWekajars, RWeka, ggplot2, stringi, RColorBrewer, _wordcloud_n ngram, slam, _htmlTable_n xtable and dplyr have to be loaded.


Overview

An overview is needed to understand the study, hence information have been summarized regarding the 3 datasets blogs, news and twitter. The table provides the name, the size (in MB), the number of lines, the number of characters but also the length of the longest entry from each dataset.

overview <- data.frame(
  file_name=c("blogs","news","twitter"),
  "file_size" = sapply(list(blogs, news, twitter), function(x){format(object.size(x),"MB")}),
  'number_of_lines' = sapply(list(blogs, news, twitter), function(x){length(x)}),
  'number_of_characters' = sapply(list(blogs, news, twitter), function(x){sum(nchar(x))}),
  'longest_entry' = sapply(list(blogs, news, twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)
kable(overview,caption = "the main datasets")
the main datasets
file_name file_size number_of_lines number_of_characters longest_entry
blogs 255.4 Mb 899288 206824505 40833
news 19.8 Mb 77259 15639408 5760
twitter 319 Mb 2360148 162096031 140

Data cleaning and corpus processing

As the size of each dataset is very big, we reduce the size of each dataset (0.5% of each dataset) to create a corpus and then we clean the corpus by removing non-ASCII characters, punctuation, numbers, useless white spaces, by creating plain text format and by converting all words to lowercase.

#we make the study reproducible
set.seed(12345)
b_subset <- sample(blogs, length(blogs) * 0.005)
n_subset <- sample(news, length(news) * 0.005)
t_subset <- sample(twitter, length(twitter) * 0.005)

#non ASCII characters have to be removed
blogs_subset <- iconv(b_subset, "UTF-8", "ASCII", sub="")
news_subset <- iconv(n_subset, "UTF-8", "ASCII", sub="")
twitter_subset <- iconv(t_subset, "UTF-8", "ASCII", sub="")
data_subset <- c(blogs_subset,news_subset,twitter_subset)

#function to get the cleaned corpus
corpus_processing <- function (x = data_subset) {
  object <- VCorpus(VectorSource(data_subset))
  object <- tm_map(object, tolower)
  object <- tm_map(object, stripWhitespace)
  object <- tm_map(object, removeNumbers)
  object <- tm_map(object, removePunctuation)
  object <- tm_map(object, PlainTextDocument)
}
corpus <- corpus_processing(data_subset)

N_grams

The tm package is useful to tokenize the sample and get easily matrices of uniqrams, bigrams, and trigrams.

corpus_uni_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
corpus_bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
corpus_tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

corpus_uni_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = corpus_uni_tokenizer))
corpus_bi_matrix<- TermDocumentMatrix(corpus, control = list(tokenize = corpus_bi_tokenizer))
corpus_tri_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = corpus_tri_tokenizer))

corpus_uni <- findFreqTerms(corpus_uni_matrix,lowfreq = 10)
corpus_bi <- findFreqTerms(corpus_bi_matrix,lowfreq=10)
corpus_tri <- findFreqTerms(corpus_tri_matrix,lowfreq=10)

corpus_uni_frame <- rowSums(as.matrix(corpus_uni_matrix[corpus_uni,]))
corpus_uni_frame <- data.frame(word=names(corpus_uni_frame), frequency=corpus_uni_frame)
corpus_bi_frame<- rowSums(as.matrix(corpus_bi_matrix[corpus_bi,]))
corpus_bi_frame <- data.frame(word=names(corpus_bi_frame), frequency=corpus_bi_frame)
corpus_tri_frame <- rowSums(as.matrix(corpus_tri_matrix[corpus_tri,]))
corpus_tri_frame <- data.frame(word=names(corpus_tri_frame), frequency=corpus_tri_frame)

Explore data analysis

First we can get 3 graphs with top 20 of unigrams, 2-grams and 3-grams as it is useful to explore data analysis.

graph_uni <- ggplot(data = corpus_uni_frame[order(-corpus_uni_frame$frequency),][1:20,], aes(x = reorder(word, -frequency), y = frequency))+ 
  geom_bar(stat="identity", fill = "darkred", colour = "black", width = 1.1) + 
  labs(x = "unigrams", y = "Frequency", title = "Top 20 of unigrams") + 
  theme(axis.text.x=element_text(angle=90))

graph_uni

graph_bi <- ggplot(data = corpus_bi_frame[order(-corpus_bi_frame$frequency),][1:20,], aes(x = reorder(word, -frequency), y = frequency))+ 
  geom_bar(stat="identity", fill = "darkred", colour = "black", width = 1.1) + 
  labs(x = "2-grams", y = "Frequency", title = "Top 20 of 2-grams") + 
  theme(axis.text.x=element_text(angle=90))

graph_bi

graph_tri <- ggplot(data = corpus_tri_frame[order(-corpus_tri_frame$frequency),][1:20,], aes(x = reorder(word, -frequency), y = frequency))+ 
  geom_bar(stat="identity", fill = "darkred", colour = "black", width = 1.1) + 
  labs(x = "3-grams", y = "Frequency", title = "Top 20 of 3-grams") + 
  theme(axis.text.x=element_text(angle=90))

graph_tri

We can get something more visual with wordclouds.

cloud_uni <- list(corpus_uni_frame)
wordcloud(cloud_uni[[1]]$word, cloud_uni[[1]]$frequency, scale = c(3,1), max.words=50, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(12, "Paired"))

cloud_bi <- list(corpus_bi_frame)
wordcloud(cloud_bi[[1]]$word, cloud_bi[[1]]$frequency, scale = c(3,1), max.words=50, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(12, "Paired"))

cloud_tri <- list(corpus_tri_frame)
wordcloud(cloud_tri[[1]]$word, cloud_tri[[1]]$frequency, scale = c(3,1), max.words=50, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(12, "Paired"))

Future work

  • Create prediction algorithm to get the next word in a sequence of words
  • Create Shiny app
  • Optimize the method as it is currently time-consuming