The goal of this Milestone Report is to present my exploratory data analysis in the context of the Datascience Capstone project. The project itself in about building a model to predict next word(s) based on user type-in text. In this project, a Corpus data structure consisting of 3 types of documents source is used: Blogs, news, and twitter data (each of these file is very large). I have already loaded theses files in a working directory in our computer.
This report consist of 3 sections organised as follow:
1 - Data Processing
2 - Building and sampling the Corpus
3 - Building n-gram models & dysplaying Top features
The data for this project come from Capstone Dataset
The data contains 3 text files wich has been download to a working directory. In the code below we are loading these files from the working directory.
inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "rb")
blogtext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)
inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.news.txt", "rb")
newstext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)
inputfile <- file("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "rb")
twittertext <- readLines(inputfile, encoding="UTF-8")
close(inputfile)
In the raw form, each of the 3 files is very large. Bellow are the summary statistic of the content of these files.
## filename Number_of_words Number_of_lines file_size_MB
## 1 blogs.txt 37546246 899288 200.4242
## 2 news.txt 34762395 1010242 196.2775
## 3 twitter.txt 30093369 2360148 159.3641
In this section, I build the sample corpus to be use in the project. To do this, I first build three corpus consisting of 1 corpus for each of the initial text file. Then I merge the three corpus.
#Building 3 Corpus from each of the 3 source files
myCorpusNews <- corpus(newstext)
myCorpusTwitter <- corpus(twittertext)
myCorpusBlog <- corpus(blogtext)
#Adding the 3 corpus togetrer
mycorpusAll <- myCorpusTwitter + myCorpusBlog + myCorpusNews
# I then take a 20% sample of the total corpus.
corpusSample <- mycorpusAll[sample(nrow(mycorpusAll$documents),nrow(mycorpusAll$documents)*0.2)]
Before building our n-gram models, we need to
clean the memory by removing unneeded data.
build a function to be used for ploting top features histogram
Here we remove all the uneeded datasets from the memory. This space will be usefull during the creation of the 3 n-grams.
rm(mycorpusAll, myCorpusTwitter, myCorpusBlog, myCorpusNews, blogtext, twittertext, newstext,words_blogs, words_news,words_twitter)
Here I build a function to used for ploting top features histogram later.
plotHistTopfeature <- function(mydfm, topNWords, fillColor, title, xlabel, ylabel) {
topNWords <- topfeatures(mydfm, n=topNWords)
topNWordsDf <- data.frame(Words=names(topNWords), Frequency=topNWords)
histPlot <- ggplot(topNWordsDf, aes(x=reorder(Words, Frequency), y=Frequency)) + geom_bar(stat="Identity", fill=fillColor) +
coord_flip() + xlab(xlabel) + ylab(ylabel) + ggtitle(title)
histPlot
}
In this section, we follow the same strategy for each of the three n-grams. We begin by building the n-gram, then we display the corresponding word cloud and then we plot the top 20 features.
myUnigrams <- dfm(corpusSample,
ngrams=1,
toLower=T,
removeNumbers=T,
concatenator=" ",
removePunct=T,
removeSeparators = T,
ignoredFeatures= stopwords("english"),
stem=F)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 853,935 documents
## ... indexing features: 322,942 feature types
## ... removed 174 features, from 174 supplied (glob) feature types
## ... created a 853935 x 322768 sparse dfm
## ... complete.
## Elapsed time: 129.1 seconds.
Plot of the 1-gram word cloud
Top 20 features of the 1-gram
myBigrams <- dfm(corpusSample,
ngrams=2,
toLower=T,
removeNumbers=T,
concatenator=" ",
removePunct=T,
removeSeparators = T,
ignoredFeatures= stopwords("english"),
stem=F)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 853,935 documents
## ... indexing features: 4,476,016 feature types
## ... removed 1,713,045 features, from 174 supplied (glob) feature types
## ... created a 853935 x 2762971 sparse dfm
## ... complete.
## Elapsed time: 356.93 seconds.
Plot of the 2-gram word cloud
Top 20 features of the 2-gram
myTrigrams <- dfm(corpusSample,
ngrams=3,
toLower=T,
removeNumbers=T,
concatenator=" ",
removePunct=T,
removeSeparators = T,
ignoredFeatures= stopwords("english"),
stem=F)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 853,935 documents
## ... indexing features: 11,438,955 feature types
## ... removed 9,397,718 features, from 174 supplied (glob) feature types
## ... created a 853935 x 2041237 sparse dfm
## ... complete.
## Elapsed time: 655.14 seconds.
Plot of the 3-gram word cloud
Top 20 features of the 3-gram
In this milestone report I have realised the first step toward building the model of prediction of the next word. I have built three ngrams models (1-grams, 2-grams and 3-grams). In the next step, I will used the n-grams models to:
find a strategy to remove profanaty word.
build the prediction model,
develop the final ShinyApp data product.