Synopsis

This goal of this project is to create a prediction algorithm for ‘Milestone Report’ assignment, which is part of Capstone session in data science specialization.

The expectation from this milestone is to load, clean the data. In the subsequent steps, we would explore the cleaned data to create a predictive algorithm for auto suggesting the next word based on what is been typed in the keyboard.

About the data

The data set is bundled with different texts from news, twitter and blogs for Russian, US English, Finnish and German languages. The primary goal is to load, clean and explore the data in the exercise. The size of the file is 574 MB, which is a large data set. We need to write effective preprocess the data to ensure we quick loading and processing the auto suggestion words.

Load Data

From the given link, download the archive file only once.

library(tm)
## Loading required package: NLP
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
csFileLink = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
csDumpFileName="Coursera-SwiftKey.zip"
csBlFilePath="final/en_US/en_US.blogs.txt"
csTwFilePath="final/en_US/en_US.twitter.txt"
csNwFilePath="final/en_US/en_US.news.txt"
profanityFilePath="final/en_US/bad.txt"
csBlFile=file(csBlFilePath)
csTwFile=file(csTwFilePath)
csNwFile=file(csNwFilePath)
profanityFile=file(profanityFilePath)
# download the file only once
if (!file.exists(csDumpFileName)){
  download.file(csFileLink, csDumpFileName)
  # unzip the bundle
  unzip(csDumpFileName)
}

After downloading the archive, unzip the file and load english language files.

bData = readLines(csBlFile, encoding="UTF-8", skipNul=TRUE)
close(csBlFile)
tData = readLines(csTwFile, encoding="UTF-8", skipNul=TRUE)
close(csTwFile)
nData = readLines(csNwFile, encoding="UTF-8", skipNul=TRUE)
close(csNwFile)

Meta-Data

List out the file sizes

paste(csBlFilePath," = ",file.info(csBlFilePath)$size/1024^2," MB")
## [1] "final/en_US/en_US.blogs.txt  =  200.424207687378  MB"
paste(csTwFilePath," = ",file.info(csTwFilePath)$size/1024^2," MB")
## [1] "final/en_US/en_US.twitter.txt  =  159.364068984985  MB"
paste(csNwFilePath," = ",file.info(csNwFilePath)$size/1024^2," MB")
## [1] "final/en_US/en_US.news.txt  =  196.277512550354  MB"

List out the file line numbers

paste(csBlFilePath," = ",length(bData), "rows")
## [1] "final/en_US/en_US.blogs.txt  =  899288 rows"
paste(csTwFilePath," = ",length(tData), "rows")
## [1] "final/en_US/en_US.twitter.txt  =  2360148 rows"
paste(csNwFilePath," = ",length(nData), "rows")
## [1] "final/en_US/en_US.news.txt  =  1010242 rows"

Sampling

To explore the entier data would be a performance challenge. So, the effective/efficient way to handle large data is to sample and use the sample for auto suggestion. We can create a separate sub-sample data set by reading in a random subset of the original data. Few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.

bDataSample <- sample(bData, NROW(bData)/400, replace=FALSE)
tDataSample <- sample(tData, NROW(tData)/400, replace=FALSE)
nDataSample <- sample(nData, NROW(nData)/400, replace=FALSE)

Clean Data

In this section we would clean the sample data by removing profanity words, spaces, numbers etc

mergeSample <- list(bDataSample,tDataSample,nDataSample)
corpusSampleData <- VCorpus(VectorSource(mergeSample))

First, let’s download the list of profanity words

if(!file.exists(profanityFilePath))
{    
  download.file(url="http://www.cs.cmu.edu/~biglou/resources/bad-words.txt",  destfile=profanityFilePath, quiet=T)
}
fData = readLines(profanityFile, encoding="UTF-8", skipNul=TRUE)
close(profanityFile)

Remove profanity words

corpusSampleData <- tm_map(corpusSampleData, removeWords, fData)

Remove spaces

corpusSampleData <- tm_map(corpusSampleData, stripWhitespace)

Remove number

corpusSampleData <- tm_map(corpusSampleData, removeNumbers)

Remove punctuation

corpusSampleData <- tm_map(corpusSampleData, removePunctuation)

Remove stop word

corpusSampleData <- tm_map(corpusSampleData, removeWords, c(stopwords('english')))

Transform the data into lower case

corpusSampleData <- tm_map(corpusSampleData, content_transformer(tolower))

Explore data

Now that we have a cleansed data, it’s time to tokenize the words. We would identify appropriate tokens such as words, punctuation, and numbers. Then we structure the words for auto suggestion.

As suggested in the course content, writing a function for N-Grams that takes size and returns structured data set.

nGramFn <- function(ng) {
  options(mc.cores=1)
  nGramTokenizer <- function(nData) NGramTokenizer(nData, Weka_control(min = ng, max = ng, delimiters = " \\r\\n\\t.,;:\"()?!"))
  tdMatrix <- TermDocumentMatrix(corpusSampleData, control=list(tokenize=nGramTokenizer))
  tdMatrix <- as.data.frame(apply(tdMatrix,1,sum))
  summary(tdMatrix)
  #colnames(tdMatrix) <- c("Frequency")
  return(tdMatrix)
}

Creating a function for plotting graph. This would sort the data and extra only top 10 words to be shown on the graph

plotGraph <- function(nDataFrame, gName) {
  nDataFrame <- as.data.frame(cbind(rownames(nDataFrame), nDataFrame[,1]))
  colnames(nDataFrame) <- c("Word","Frequency")
  nDataFrame <- nDataFrame[order(nDataFrame$Frequency,decreasing = TRUE),]
  print(head(nDataFrame))
  nDataFrame <- nDataFrame[1:10,]
  ggPlotData  <- ggplot(nDataFrame, aes(x=Word, y=Frequency)) + geom_bar(stat="Identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("Graph for",gName))
  ggPlotData
}

Plot graph for unigram

plotGraph(nGramFn(1),"unigram")
##           Word Frequency
## 23377  twitter        99
## 3238    called        98
## 12606    later        98
## 21448     stop        98
## 958   anything        97
## 12212     kids        97

Plot graph for bigram

plotGraph(nGramFn(2),"bigram")
##               Word Frequency
## 2933     along way         9
## 5835  around world         9
## 7671     back home         9
## 7928         bad i         9
## 14584     can find         9
## 14631     can just         9

Plot graph for trigram

plotGraph(nGramFn(3),"trigram")
##                    Word Frequency
## 15258     cant wait see         9
## 31870     even though i         9
## 32260      every time i         8
## 52083        i think im         8
## 45124 happy mothers day         7
## 49783    i cant believe         7

Conclusion

As the next step is to create a shiny app for displaying the ability to predict the next word based on the high frequency rate. The frequency rate would depict the best predictive model.

References

https://www.coursera.org/course/nlp https://cran.r-project.org/web/views/NaturalLanguageProcessing.html https://www.jstatsoft.org/article/view/v025i05 https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip http://www.jhsph.edu/academics/degree-programs/master-of-public-health/current-students/JHSPH-ReferencingHandbook.pdf https://en.wikipedia.org/wiki/N-gram http://www.cs.cmu.edu/~biglou/resources/bad-words.txt http://www.inside-r.org/packages/cran/tm/docs/tm_map https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf