This report is a milestone report of the capstone project introduced by Johns Hopkins University through Coursera. The principal aim of this project is to develop a data product (Shiny app) which implements a predictive text model. The first step in the project involves downloading and reading in the data sets, followed by cleaning of the data sets. This is followed by exploratory analysis of the data sets to allow the development of a strategy to create a predictive text model.
The data as specified by the project can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
library(tm) #For Text Mining
library(qdap) #For Text Mining & Corpus workings
library(RWeka) #For n-gram generation
library(stringi) #For General Stats
library(ggplot2) #For Plots and Exploratory Analysis
if(!file.exists("Coursera-SwiftKey.zip")){
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, "Coursera-SwiftKey.zip")
unzipData <- unzip("Coursera-SwiftKey.zip")
}
File sizes of the downloaded data source:
## news blogs twitter
## File Sizes(mb) 196.2775 200.4242 159.3641
Next step is to load a sample of those files into R and provide general statistics of file contents:
news <- readLines("final/en_US/en_US.news.txt", n=10000, encoding = "UTF-8")
blogs <- readLines("final/en_US/en_US.blogs.txt", n=10000, encoding = "UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", n=10000, encoding = "UTF-8")
SampleData <- paste(news, blogs, twitter)
## Blogs News Tweets
## Lines 10,000 10,000 10,000
## LinesNEmpty 10,000 10,000 10,000
## Chars 2,277,383 2,035,687 681,544
## CharsNWhite 1,876,763 1,701,758 563,870
Now, we create corpus for each data type and clean the data by removing numbers, white spaces, punctuaction and stopwords. Moreover we remove profanity words, which has been downloaded from http://www.cs.cmu.edu/~biglou/resources/bad-words.txt
if(!file.exists("bad-words.txt")){
fileBadWordsURL <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
download.file(fileBadWordsURL, "bad-words.txt")
}
bad_words <- as.vector(readLines("bad-words.txt"))
corpus <- VCorpus(VectorSource(SampleData))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, bad_words)
fullData <- data.frame(text=unlist(sapply(corpus, '[', "content")), stringsAsFactors=F)
fullData[1:5,1]
After that we can use RWeka package to create one-gram, bi-grams,tri-grams sets:
tokenizersDelimiters <- "\"\'\\t\\r\\n ().,;!?"
oneGramsTokenizer <- NGramTokenizer(fullData, Weka_control(min = 1, max = 1))
biGramsTokenizer <- NGramTokenizer(fullData, Weka_control(min = 2, max = 2, delimiters = tokenizersDelimiters))
triGramsTokenizer <- NGramTokenizer(fullData, Weka_control(min = 3, max = 3, delimiters = tokenizersDelimiters))
Convert corpus into data frame and sort them:
oneGramsTab <- data.frame(table(oneGramsTokenizer))
biGramsTab <- data.frame(table(biGramsTokenizer))
triGramsTab <- data.frame(table(triGramsTokenizer))
oneGramsSorted <- oneGramsTab[order(oneGramsTab$Freq, decreasing = TRUE),]
biGramsSorted <- biGramsTab[order(biGramsTab$Freq, decreasing = TRUE),]
triGramsSorted <- triGramsTab[order(triGramsTab$Freq, decreasing = TRUE),]
Top 20 of Uni, Bi and TriGrams:
top20oneGrams <- oneGramsSorted[1:20,]
top20biGrams <- biGramsSorted[1:20,]
top20triGrams <- triGramsSorted[1:20,]
We use plots to have a more visual knowledge of the data used in this analysis report. here we show the top 20 frequencies of UniGrams, BiGrams and Trigrams:
ggplot(top20oneGrams, aes(x=oneGramsTokenizer,y=Freq)) + ggtitle("Top 20 Unigrams") + labs(x="Unigrams",y="Frequency") + geom_bar(fill = "green", stat="Identity") + geom_text(aes(label=Freq), vjust=-0.4)
ggplot(top20biGrams, aes(x=biGramsTokenizer,y=Freq)) + ggtitle("Top 20 Bigrams") + labs(x="Bigrams",y="Frequency") + geom_bar(fill = "blue", stat="Identity") + geom_text(aes(label=Freq), vjust=-0.4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(top20triGrams, aes(x=triGramsTokenizer,y=Freq)) + ggtitle("Top 20 Trigrams") + labs(x="Trigrams",y="Frequency") + geom_bar(fill = "orange", stat="Identity") + geom_text(aes(label=Freq), vjust=-0.4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))