In this milestone report an exploratory analysis of the 3 provided HC Corpora English language datasets is performed. Because of the size of the files 1% of the data is used for the exploratory analysis. Exploratory analysis of unigrams, bigrams and trigrams is performed. A brief description of plans for creating the prediction algorithm and Shiny App is also provided.
library(tm)
library(RWeka)
library(qdap)
library(quanteda)
library(ggplot2)
Task to accomplish:
1. Download and unzip the HC Corpora files if necessary. 2. Download a list of profanity for profanity filtering. 3. Read into R the necessary data files 4. Basic dataset size assessment
The three Enlish Language text files were downloaded to the project folder from the Coursera Website. The URL is https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The three files are 1. blogs.txt 2. news.txt 3. twitter.txt
A list of obscene words (“badwords”) was downloaded from the website https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en
##Check for necessary data files in working directory and download and unzip if necessary.
setwd("~/Documents/Mandeeps Documents/WORK Related/Courses/Data Science Capstone")
url.swiftkey.data<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
url.badwords<-"https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
if (!file.exists("Coursera-SwiftKey.zip")) download.file(url.swiftkey.data, destfile="coursera-swiftkey.zip")
if (!file.exists("badwords")) download.file(url.badwords, destfile="badwords.txt")
english.files<-c("final/en_US/en_US.twitter.txt", "final/en_US/en_US.news.txt", "final/en_US/en_US.news.txt")
unzip("coursera-swiftkey.zip", files=english.files, exdir="en_US", overwrite=TRUE, junkpaths=TRUE)
##Read datafiles into R
blogs.txt<-scan("final/en_US/en_US.blogs.txt", what="", sep="\n")
news.txt<-scan("final/en_US/en_US.news.txt", what="", sep="\n")
twitter.txt<-scan("final/en_US/en_US.twitter.txt", what="", sep="\n")
badwords.txt<-scan("badwords.txt", what="", sep="\n")
Here we summarize the number of lines and memory usage (bytes) of the three data sets
## Determine the size of the three data sets
file.lines<-c(length(blogs.txt), length(news.txt), length(twitter.txt))
file.bytes<-c(object.size(blogs.txt), object.size(news.txt) , object.size(twitter.txt))
data.summary<-data.frame(file.lines, file.bytes)
row.names(data.summary)<-c("Blogs", "News", "Twitter")
print(data.summary)
## file.lines file.bytes
## Blogs 899288 260564320
## News 1010242 261759048
## Twitter 2360148 316037344
Tasks to ccomplish: 1. Data Subsetting: In order to make the exploratory analyzis more practical 1% of the three files were subsetted and combined. 2. Data Cleaning: Data is converted to all lowercase, numbers and punctuation are removed, English language stopwords are removed, and whitespace is removed.
3. Profanity filtering: profane words are removed per the Capstone project instructions.
set.seed(1)
news.subset<-sample(news.txt, round(length(news.txt)*.01))
blogs.subset<-sample(blogs.txt, round(length(blogs.txt)*.01))
twitter.subset<-sample(twitter.txt, round(length(twitter.txt)*.01))
combo.data<-c(news.subset, blogs.subset, twitter.subset)
The combined 1% data set contains 42696 lines and has a memory size of 8.40188810^{6} bytes
#To make a volatile corpus, R needs to interpret each element in our vector of text as a document. The tm package provides Source functions to do this.
combo.data.vector<-VectorSource(combo.data)
##VCorpus(), creates our volatile corpus. The VCorpus object is a nested list. At each index of the VCorpus object, there is a PlainTextDocument object, which is essentially a list that contains the actual text data, as well as some corresponding metadata.
data.corpus<-VCorpus(combo.data.vector) ##Create a corpus
preprocessCorpus<-function(corpus) {
corpus<-tm_map(corpus, content_transformer(tolower))
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, removePunctuation)
corpus<-tm_map(corpus, removeWords, stopwords("english"))
corpus<-tm_map(corpus, removeWords, badwords.txt)
corpus<-tm_map(corpus, stripWhitespace)
return(corpus)
}
data.corpus.proc<-preprocessCorpus(data.corpus)
Key questions - How frequently do certain words appear in the data set - How frequently do certain pairs of words appear? How about triplets?
Tasks to accomplish:
Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
#Create Unigrans
unigram_tdm<-TermDocumentMatrix(data.corpus.proc) #Create unigram TDM
unigram_tdm_sparse<-removeSparseTerms(unigram_tdm, .99)
unigram_m<-as.matrix(unigram_tdm_sparse)
#Plot most frequent 20 Unigrams
word.freq.uni<-sort(rowSums(unigram_m), decreasing=TRUE)
word.freq.uni.df<-data.frame(word=names(word.freq.uni), freq=word.freq.uni)
freq.words<-word.freq.uni.df[1:20,]
ggplot(freq.words, aes(x=reorder(word, -freq), y=freq))+geom_bar(stat="identity")+theme(axis.text=element_text(size=12, angle=45), plot.title=element_text(size=12, face="bold"),axis.title=element_text(size=12))+labs(x="Word", y="Count")+ggtitle("Top 20 Most Frequent Unigrams")
options(mc.cores=1)
tokenizer<-function(x) NGramTokenizer(x, Weka_control(min=2, max=2)) #uses the RWeka package to create trigram (three word) tokens:
# Sets the default number of threads to use. For some reason this is necessary for n-grams
bigram_tdm<-TermDocumentMatrix(data.corpus.proc, control=list(tokenize=tokenizer))
bigram_tdm_sparse<-removeSparseTerms(bigram_tdm, .999)
bigrams_m<-as.matrix(bigram_tdm_sparse)
bigrams.freq<-sort(rowSums(bigrams_m), decreasing=TRUE)
bigrams.freq.df<-data.frame(word=names(bigrams.freq), freq=bigrams.freq)
bigrams.top20.freq<-bigrams.freq.df[1:20,]
ggplot(bigrams.top20.freq, aes(x=reorder(word, -freq), y=freq))+geom_bar(stat="identity")+theme(axis.text.x=element_text(size=12, angle=90), plot.title=element_text(size=12, face="bold"),axis.title=element_text(size=12))+labs(x="Word", y="Count")+ggtitle("Top 20 Most Frequent Bigrams")
tokenizer<-function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
trigram_tdm<-TermDocumentMatrix(data.corpus.proc, control=list(tokenize=tokenizer))
trigram_tdm_sparse<-removeSparseTerms(trigram_tdm, .9999)
trigrams_m<-as.matrix(trigram_tdm_sparse)
trigrams.freq<-sort(rowSums(trigrams_m), decreasing=TRUE)
trigrams.freq.df<-data.frame(word=names(trigrams.freq), freq=trigrams.freq)
trigrams.top20.freq<-trigrams.freq.df[1:20,]
ggplot(trigrams.top20.freq, aes(x=reorder(word, -freq), y=freq))+geom_bar(stat="identity")+theme(axis.text.x=element_text(size=12, angle=90, vjust=.5), plot.title=element_text(size=12, face="bold"),axis.title=element_text(size=12))+labs(x="Word", y="Count")+ggtitle("Top 20 Most Frequent Trigrams")
The goal is to build an n-gram model for predicting the next word based on the user’s previously entered 1, 2, or 3 words. The app will initially try to base prediction by searching for similar trigrams and if not found will backoff to bigrams or unigrams. My app will be probably be based on the combination of the three data sets as above although an alternative is to allow the use to select the data set used as an input to the model since n-gram frequencies likely vary between blogs, news and twitter. Since using all of the data at once will likely not be possible anyway due to computational constraints having the user select which of the data sets to use as the model’s input will likely not reduce the data available to the model.