Executive Summary

This report is a milestone report of the capstone project introduced by Johns Hopkins University through Coursera. The principal aim of this project is to develop a data product (Shiny app) which implements a predictive text model. The first step in the project involves downloading and reading in the data sets, followed by cleaning of the data sets. This is followed by exploratory analysis of the data sets to allow the development of a strategy to create a predictive text model.

Getting and Cleaning Data

The data as specified by the project can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

library(tm) #For Text Mining 
library(qdap) #For Text Mining & Corpus workings
library(RWeka) #For n-gram generation
library(stringi) #For General Stats
library(ggplot2) #For Plots and Exploratory Analysis

if(!file.exists("Coursera-SwiftKey.zip")){
  fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(fileURL, "Coursera-SwiftKey.zip")
  unzipData <- unzip("Coursera-SwiftKey.zip")
}

File sizes of the downloaded data source:

##                    news    blogs  twitter
## File Sizes(mb) 196.2775 200.4242 159.3641

Next step is to load a sample of those files into R and provide general statistics of file contents:

news <- readLines("final/en_US/en_US.news.txt", n=10000, encoding = "UTF-8")
blogs <- readLines("final/en_US/en_US.blogs.txt", n=10000, encoding = "UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", n=10000, encoding = "UTF-8")

SampleData <- paste(news, blogs, twitter)
##                 Blogs      News  Tweets
## Lines          10,000    10,000  10,000
## LinesNEmpty    10,000    10,000  10,000
## Chars       2,277,383 2,035,687 681,544
## CharsNWhite 1,876,763 1,701,758 563,870

Now, we create corpus for each data type and clean the data by removing numbers, white spaces, punctuaction and stopwords. Moreover we remove profanity words, which has been downloaded from http://www.cs.cmu.edu/~biglou/resources/bad-words.txt

if(!file.exists("bad-words.txt")){
  fileBadWordsURL <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
  download.file(fileBadWordsURL, "bad-words.txt")
}

bad_words <- as.vector(readLines("bad-words.txt"))

corpus <- VCorpus(VectorSource(SampleData))
corpus <- tm_map(corpus, removeNumbers)  
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, bad_words)

fullData <- data.frame(text=unlist(sapply(corpus, '[', "content")), stringsAsFactors=F)
fullData[1:5,1]

After that we can use RWeka package to create one-gram, bi-grams,tri-grams sets:

tokenizersDelimiters <- "\"\'\\t\\r\\n ().,;!?"

oneGramsTokenizer <- NGramTokenizer(fullData, Weka_control(min = 1, max = 1))
biGramsTokenizer <- NGramTokenizer(fullData, Weka_control(min = 2, max = 2, delimiters = tokenizersDelimiters))
triGramsTokenizer <- NGramTokenizer(fullData, Weka_control(min = 3, max = 3, delimiters = tokenizersDelimiters))

Convert corpus into data frame and sort them:

oneGramsTab <- data.frame(table(oneGramsTokenizer))
biGramsTab <- data.frame(table(biGramsTokenizer))
triGramsTab <- data.frame(table(triGramsTokenizer))

oneGramsSorted <- oneGramsTab[order(oneGramsTab$Freq, decreasing = TRUE),]
biGramsSorted <- biGramsTab[order(biGramsTab$Freq, decreasing = TRUE),]
triGramsSorted <- triGramsTab[order(triGramsTab$Freq, decreasing = TRUE),]

Top 20 of Uni, Bi and TriGrams:

top20oneGrams <- oneGramsSorted[1:20,]
top20biGrams <- biGramsSorted[1:20,]
top20triGrams <- triGramsSorted[1:20,]

Exploratory Analysis

We use plots to have a more visual knowledge of the data used in this analysis report. here we show the top 20 frequencies of UniGrams, BiGrams and Trigrams:

ggplot(top20oneGrams, aes(x=oneGramsTokenizer,y=Freq)) + ggtitle("Top 20 Unigrams") + labs(x="Unigrams",y="Frequency") + geom_bar(fill = "green", stat="Identity") + geom_text(aes(label=Freq), vjust=-0.4)

ggplot(top20biGrams, aes(x=biGramsTokenizer,y=Freq)) + ggtitle("Top 20 Bigrams") + labs(x="Bigrams",y="Frequency") + geom_bar(fill = "blue", stat="Identity") + geom_text(aes(label=Freq), vjust=-0.4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(top20triGrams, aes(x=triGramsTokenizer,y=Freq)) + ggtitle("Top 20 Trigrams") + labs(x="Trigrams",y="Frequency") + geom_bar(fill = "orange", stat="Identity") + geom_text(aes(label=Freq), vjust=-0.4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Next steps: