Executive Summary

The goal of this Milestone Report is to present my exploratory data analysis in the context of the Datascience Capstone project. The project itself in about building a model to predict next word(s) based on user type-in text. In this project, a Corpus data structure consisting of 3 types of documents source is used: Blogs, news, and twitter data (each of these file is very large). I have already loaded theses files in a working directory in our computer.

This report consist of 3 sections organised as follow:

1 - Data Processing

2 - Building and sampling the Corpus

3 - Building n-gram models & dysplaying Top features

Data processing

Loading the data from the working directory

The data for this project come from Capstone Dataset

The data contains 3 text files wich has been download to a working directory. In the code below we are loading these files from the working directory.

inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "rb")
blogtext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)

inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.news.txt", "rb")
newstext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)

inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "rb")
twittertext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)

Exploring the entire data set

In the raw form, each of the 3 files is very large. Bellow are the summary statistic of the content of these files.

##      filename Number_of_words Number_of_lines file_size_MB
## 1   blogs.txt        37546246          899288     200.4242
## 2    news.txt        34762395         1010242     196.2775
## 3 twitter.txt        30093369         2360148     159.3641

Building and sampling the Corpus

In this section, I build the sample corpus to be use in the project. To do this, I first build three corpus consisting of 1 corpus for each of the initial text file. Then I merge the three corpus.

#Building 3 Corpus from each of the 3 source files
myCorpusNews <- corpus(newstext)
myCorpusTwitter <- corpus(twittertext)
myCorpusBlog <- corpus(blogtext)

#Adding the 3 corpus togetrer
mycorpusAll <- myCorpusTwitter + myCorpusBlog + myCorpusNews

# I then take  a 20% sample of the total corpus. 

corpusSample <- mycorpusAll[sample(nrow(mycorpusAll$documents),nrow(mycorpusAll$documents)*0.2)]

Building n-gram models, word cloud & dysplaying top features

Before building our n-gram models, we need to

Cleaning the memory

Here we remove all the uneeded datasets from the memory. This space will be usefull during the creation of the 3 n-grams.

rm(mycorpusAll, myCorpusTwitter, myCorpusBlog, myCorpusNews, blogtext, twittertext, newstext,words_blogs, words_news,words_twitter)

building top features histogram ploting function

Here I build a function to used for ploting top features histogram later.

plotHistTopfeature <- function(mydfm, topNWords, fillColor, title, xlabel, ylabel) {
  topNWords <- topfeatures(mydfm, n=topNWords)
  topNWordsDf <- data.frame(Words=names(topNWords), Frequency=topNWords)
  histPlot <- ggplot(topNWordsDf, aes(x=reorder(Words, Frequency), y=Frequency)) + geom_bar(stat="Identity", fill=fillColor) +
    coord_flip() + xlab(xlabel) + ylab(ylabel) + ggtitle(title)
  histPlot
}

Building the unigram model

In this section, we follow the same strategy for each of the three n-grams. We begin by building the n-gram, then we display the corresponding word cloud and then we plot the top 20 features.

Building 1-gram model

myUnigrams <- dfm(corpusSample,
           ngrams=1,
           toLower=T,
           removeNumbers=T,
           concatenator=" ", 
           removePunct=T,
           removeSeparators = T,
           ignoredFeatures= stopwords("english"),
           stem=F)
## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 853,935 documents
##    ... indexing features: 322,942 feature types
##    ... removed 174 features, from 174 supplied (glob) feature types
##    ... created a 853935 x 322768 sparse dfm
##    ... complete. 
## Elapsed time: 129.1 seconds.

1-gram word cloud

Plot of the 1-gram word cloud

1-gram top 20 features

Top 20 features of the 1-gram

Building the bigram model

Building the 2-gram model

myBigrams <- dfm(corpusSample,
            ngrams=2,
            toLower=T,
            removeNumbers=T,
            concatenator=" ", 
            removePunct=T,
            removeSeparators = T,
            ignoredFeatures= stopwords("english"),
            stem=F)
## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 853,935 documents
##    ... indexing features: 4,476,016 feature types
##    ... removed 1,713,045 features, from 174 supplied (glob) feature types
##    ... created a 853935 x 2762971 sparse dfm
##    ... complete. 
## Elapsed time: 356.93 seconds.

2-gram word cloud

Plot of the 2-gram word cloud

2-gram top 20 features

Top 20 features of the 2-gram

Building 3-gram model

myTrigrams <- dfm(corpusSample,
            ngrams=3,
            toLower=T,
            removeNumbers=T,
            concatenator=" ", 
            removePunct=T,
            removeSeparators = T,
            ignoredFeatures= stopwords("english"),
            stem=F)
## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 853,935 documents
##    ... indexing features: 11,438,955 feature types
##    ... removed 9,397,718 features, from 174 supplied (glob) feature types
##    ... created a 853935 x 2041237 sparse dfm
##    ... complete. 
## Elapsed time: 655.14 seconds.

3-gram word cloud

Plot of the 3-gram word cloud

3-gram top 20 features

Top 20 features of the 3-gram

Conclusion

In this milestone report I have realised the first step toward building the model of prediction of the next word. I have built three ngrams models (1-grams, 2-grams and 3-grams). In the next step, I will used the n-grams models to: