Capstone Milestone Report - Text Mining Exploratory Data Analysis

Executive Summary

The goal of this Milestone Report is to present my exploratory data analysis in the context of the Datascience Capstone project. The project itself in about building a model to predict next word(s) based on user type-in text. In this project, a Corpus data structure consisting of 3 types of documents source is used: Blogs, news, and twitter data (each of these file is very large). I have already loaded theses files in a working directory in our computer.

This report consist of 3 sections organised as follow:

1 - Data Processing

2 - Building and sampling the Corpus

3 - Building n-gram models & dysplaying Top features

Data processing

Loading the data from the working directory

The data for this project come from Capstone Dataset

The data contains 3 text files wich has been download to a working directory. In the code below we are loading these files from the working directory.

inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "rb")
blogtext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)

inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.news.txt", "rb")
newstext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)

inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "rb")
twittertext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)

Exploring the entire data set

In the raw form, each of the 3 files is very large. Bellow are the summary statistic of the content of these files.

##      filename Number_of_words Number_of_lines file_size_MB
## 1   blogs.txt        37546246          899288     200.4242
## 2    news.txt        34762395         1010242     196.2775
## 3 twitter.txt        30093369         2360148     159.3641

Building and sampling the Corpus

In this section, I build the sample corpus to be use in the project. To do this, I first build three corpus consisting of 1 corpus for each of the initial text file. Then I merge the three corpus.

#Building 3 Corpus from each of the 3 source files
myCorpusNews <- corpus(newstext)
myCorpusTwitter <- corpus(twittertext)
myCorpusBlog <- corpus(blogtext)

#Adding the 3 corpus togetrer
mycorpusAll <- myCorpusTwitter + myCorpusBlog + myCorpusNews

# I then take  a 20% sample of the total corpus. 

corpusSample <- mycorpusAll[sample(nrow(mycorpusAll$documents),nrow(mycorpusAll$documents)*0.2)]

Building n-gram models, word cloud & dysplaying top features

Before building our n-gram models, we need to

clean the memory by removing unneeded data.
build a function to be used for ploting top features histogram

Cleaning the memory

Here we remove all the uneeded datasets from the memory. This space will be usefull during the creation of the 3 n-grams.

rm(mycorpusAll, myCorpusTwitter, myCorpusBlog, myCorpusNews, blogtext, twittertext, newstext,words_blogs, words_news,words_twitter)

building top features histogram ploting function

Here I build a function to used for ploting top features histogram later.

plotHistTopfeature <- function(mydfm, topNWords, fillColor, title, xlabel, ylabel) {
  topNWords <- topfeatures(mydfm, n=topNWords)
  topNWordsDf <- data.frame(Words=names(topNWords), Frequency=topNWords)
  histPlot <- ggplot(topNWordsDf, aes(x=reorder(Words, Frequency), y=Frequency)) + geom_bar(stat="Identity", fill=fillColor) +
    coord_flip() + xlab(xlabel) + ylab(ylabel) + ggtitle(title)
  histPlot
}

Building the unigram model

In this section, we follow the same strategy for each of the three n-grams. We begin by building the n-gram, then we display the corresponding word cloud and then we plot the top 20 features.

Building 1-gram model

myUnigrams <- dfm(corpusSample,
           ngrams=1,
           toLower=T,
           removeNumbers=T,
           concatenator=" ", 
           removePunct=T,
           removeSeparators = T,
           ignoredFeatures= stopwords("english"),
           stem=F)

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 853,935 documents
##    ... indexing features: 322,942 feature types
##    ... removed 174 features, from 174 supplied (glob) feature types
##    ... created a 853935 x 322768 sparse dfm
##    ... complete. 
## Elapsed time: 129.1 seconds.

1-gram word cloud

Plot of the 1-gram word cloud

1-gram top 20 features

Top 20 features of the 1-gram

Building the bigram model

Building the 2-gram model

myBigrams <- dfm(corpusSample,
            ngrams=2,
            toLower=T,
            removeNumbers=T,
            concatenator=" ", 
            removePunct=T,
            removeSeparators = T,
            ignoredFeatures= stopwords("english"),
            stem=F)

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 853,935 documents
##    ... indexing features: 4,476,016 feature types
##    ... removed 1,713,045 features, from 174 supplied (glob) feature types
##    ... created a 853935 x 2762971 sparse dfm
##    ... complete. 
## Elapsed time: 356.93 seconds.

2-gram word cloud

Plot of the 2-gram word cloud

2-gram top 20 features

Top 20 features of the 2-gram

Building 3-gram model

myTrigrams <- dfm(corpusSample,
            ngrams=3,
            toLower=T,
            removeNumbers=T,
            concatenator=" ", 
            removePunct=T,
            removeSeparators = T,
            ignoredFeatures= stopwords("english"),
            stem=F)

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 853,935 documents
##    ... indexing features: 11,438,955 feature types
##    ... removed 9,397,718 features, from 174 supplied (glob) feature types
##    ... created a 853935 x 2041237 sparse dfm
##    ... complete. 
## Elapsed time: 655.14 seconds.

3-gram word cloud

Plot of the 3-gram word cloud

3-gram top 20 features

Top 20 features of the 3-gram

Conclusion

In this milestone report I have realised the first step toward building the model of prediction of the next word. I have built three ngrams models (1-grams, 2-grams and 3-grams). In the next step, I will used the n-grams models to:

find a strategy to remove profanaty word.
build the prediction model,
develop the final ShinyApp data product.