This goal of this project is to create a prediction algorithm for ‘Milestone Report’ assignment, which is part of Capstone session in data science specialization.
The expectation from this milestone is to load, clean the data. In the subsequent steps, we would explore the cleaned data to create a predictive algorithm for auto suggesting the next word based on what is been typed in the keyboard.
The data set is bundled with different texts from news, twitter and blogs for Russian, US English, Finnish and German languages. The primary goal is to load, clean and explore the data in the exercise. The size of the file is 574 MB, which is a large data set. We need to write effective preprocess the data to ensure we quick loading and processing the auto suggestion words.
From the given link, download the archive file only once.
library(tm)
## Loading required package: NLP
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
csFileLink = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
csDumpFileName="Coursera-SwiftKey.zip"
csBlFilePath="final/en_US/en_US.blogs.txt"
csTwFilePath="final/en_US/en_US.twitter.txt"
csNwFilePath="final/en_US/en_US.news.txt"
profanityFilePath="final/en_US/bad.txt"
csBlFile=file(csBlFilePath)
csTwFile=file(csTwFilePath)
csNwFile=file(csNwFilePath)
profanityFile=file(profanityFilePath)
# download the file only once
if (!file.exists(csDumpFileName)){
download.file(csFileLink, csDumpFileName)
# unzip the bundle
unzip(csDumpFileName)
}
After downloading the archive, unzip the file and load english language files.
bData = readLines(csBlFile, encoding="UTF-8", skipNul=TRUE)
close(csBlFile)
tData = readLines(csTwFile, encoding="UTF-8", skipNul=TRUE)
close(csTwFile)
nData = readLines(csNwFile, encoding="UTF-8", skipNul=TRUE)
close(csNwFile)
List out the file sizes
paste(csBlFilePath," = ",file.info(csBlFilePath)$size/1024^2," MB")
## [1] "final/en_US/en_US.blogs.txt = 200.424207687378 MB"
paste(csTwFilePath," = ",file.info(csTwFilePath)$size/1024^2," MB")
## [1] "final/en_US/en_US.twitter.txt = 159.364068984985 MB"
paste(csNwFilePath," = ",file.info(csNwFilePath)$size/1024^2," MB")
## [1] "final/en_US/en_US.news.txt = 196.277512550354 MB"
List out the file line numbers
paste(csBlFilePath," = ",length(bData), "rows")
## [1] "final/en_US/en_US.blogs.txt = 899288 rows"
paste(csTwFilePath," = ",length(tData), "rows")
## [1] "final/en_US/en_US.twitter.txt = 2360148 rows"
paste(csNwFilePath," = ",length(nData), "rows")
## [1] "final/en_US/en_US.news.txt = 1010242 rows"
To explore the entier data would be a performance challenge. So, the effective/efficient way to handle large data is to sample and use the sample for auto suggestion. We can create a separate sub-sample data set by reading in a random subset of the original data. Few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.
bDataSample <- sample(bData, NROW(bData)/400, replace=FALSE)
tDataSample <- sample(tData, NROW(tData)/400, replace=FALSE)
nDataSample <- sample(nData, NROW(nData)/400, replace=FALSE)
In this section we would clean the sample data by removing profanity words, spaces, numbers etc
mergeSample <- list(bDataSample,tDataSample,nDataSample)
corpusSampleData <- VCorpus(VectorSource(mergeSample))
First, let’s download the list of profanity words
if(!file.exists(profanityFilePath))
{
download.file(url="http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", destfile=profanityFilePath, quiet=T)
}
fData = readLines(profanityFile, encoding="UTF-8", skipNul=TRUE)
close(profanityFile)
Remove profanity words
corpusSampleData <- tm_map(corpusSampleData, removeWords, fData)
Remove spaces
corpusSampleData <- tm_map(corpusSampleData, stripWhitespace)
Remove number
corpusSampleData <- tm_map(corpusSampleData, removeNumbers)
Remove punctuation
corpusSampleData <- tm_map(corpusSampleData, removePunctuation)
Remove stop word
corpusSampleData <- tm_map(corpusSampleData, removeWords, c(stopwords('english')))
Transform the data into lower case
corpusSampleData <- tm_map(corpusSampleData, content_transformer(tolower))
Now that we have a cleansed data, it’s time to tokenize the words. We would identify appropriate tokens such as words, punctuation, and numbers. Then we structure the words for auto suggestion.
As suggested in the course content, writing a function for N-Grams that takes size and returns structured data set.
nGramFn <- function(ng) {
options(mc.cores=1)
nGramTokenizer <- function(nData) NGramTokenizer(nData, Weka_control(min = ng, max = ng, delimiters = " \\r\\n\\t.,;:\"()?!"))
tdMatrix <- TermDocumentMatrix(corpusSampleData, control=list(tokenize=nGramTokenizer))
tdMatrix <- as.data.frame(apply(tdMatrix,1,sum))
summary(tdMatrix)
#colnames(tdMatrix) <- c("Frequency")
return(tdMatrix)
}
Creating a function for plotting graph. This would sort the data and extra only top 10 words to be shown on the graph
plotGraph <- function(nDataFrame, gName) {
nDataFrame <- as.data.frame(cbind(rownames(nDataFrame), nDataFrame[,1]))
colnames(nDataFrame) <- c("Word","Frequency")
nDataFrame <- nDataFrame[order(nDataFrame$Frequency,decreasing = TRUE),]
print(head(nDataFrame))
nDataFrame <- nDataFrame[1:10,]
ggPlotData <- ggplot(nDataFrame, aes(x=Word, y=Frequency)) + geom_bar(stat="Identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("Graph for",gName))
ggPlotData
}
Plot graph for unigram
plotGraph(nGramFn(1),"unigram")
## Word Frequency
## 23377 twitter 99
## 3238 called 98
## 12606 later 98
## 21448 stop 98
## 958 anything 97
## 12212 kids 97
Plot graph for bigram
plotGraph(nGramFn(2),"bigram")
## Word Frequency
## 2933 along way 9
## 5835 around world 9
## 7671 back home 9
## 7928 bad i 9
## 14584 can find 9
## 14631 can just 9
Plot graph for trigram
plotGraph(nGramFn(3),"trigram")
## Word Frequency
## 15258 cant wait see 9
## 31870 even though i 9
## 32260 every time i 8
## 52083 i think im 8
## 45124 happy mothers day 7
## 49783 i cant believe 7
As the next step is to create a shiny app for displaying the ability to predict the next word based on the high frequency rate. The frequency rate would depict the best predictive model.
https://www.coursera.org/course/nlp https://cran.r-project.org/web/views/NaturalLanguageProcessing.html https://www.jstatsoft.org/article/view/v025i05 https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip http://www.jhsph.edu/academics/degree-programs/master-of-public-health/current-students/JHSPH-ReferencingHandbook.pdf https://en.wikipedia.org/wiki/N-gram http://www.cs.cmu.edu/~biglou/resources/bad-words.txt http://www.inside-r.org/packages/cran/tm/docs/tm_map https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf