This milestone report presents the exploratory analysis, data cleaning, and plan for further work as part of the Coursera Data Science Capstone. The project for this Capstone course involves creating a shiny app that is able to predict what word a user will want to type next, given input words from the user.
This project falls under what is known as Natural Language Processing.
The training data for this project are available from HC Corpora here:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The zip file contains 3 text files with text examples from Twitter, news, and blogs.
As mentioned in the introduction, this project works to create a predictive model that takes words from the user and tries to predict what word will be typed next.
First the necessary libraries are loaded and the data is downloaded via the URL given above. The data is quite large so it is best to download once in to a working directory.
# Load libraries to be used
suppressWarnings(library(tm))
suppressWarnings(library(ggplot2))
suppressWarnings(library(stringi))
suppressWarnings(library(data.table))
suppressWarnings(library(wordcloud))
suppressWarnings(library(quanteda))
# Download and load data
if(!file.exists("Coursera-SwiftKey.zip")){
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "Dataset.zip")
unlink(fileURL)
unzip("Dataset.zip")
}else{
print("Already Have Data")
}
## [1] "Already Have Data"
Next, the data is loaded in to R
# Load the three data files
twitter <- readLines(file("en_US.twitter.txt"),skipNul = TRUE, encoding="UTF-8")
blogs <- readLines(file("en_US.blogs.txt"), skipNul = TRUE, encoding="UTF-8")
news <- readLines(file("en_US.news.txt"), skipNul = TRUE, encoding="UTF-8")
Then some summary information is presented: File Size, Word count, and number of lines for the 3 data files.
twitterinfo <- c(sum(stri_count_words(twitter)), length(twitter), round((file.info("final/en_US/en_US.twitter.txt")$size / (1024^2)),2))
blogsinfo <- c(sum(stri_count_words(blogs)), length(blogs), round((file.info("final/en_US/en_US.blogs.txt")$size / (1024^2)),2))
newsinfo <- c(sum(stri_count_words(news)), length(news), round((file.info("final/en_US/en_US.news.txt")$size / (1024^2)),2))
suminfo<- as.data.frame(rbind(twitterinfo,blogsinfo,newsinfo))
rownames(suminfo) <- c('Twitter', 'Blogs', 'News')
colnames(suminfo) <- c('Word Count', 'Line Count', 'File Size (MB)')
suminfo
## Word Count Line Count File Size (MB)
## Twitter 30093410 2360148 159.36
## Blogs 37546246 899288 155.75
## News 2674536 77259 196.28
Since the data files are so large, each file was sampled (10% of each file was used). This allowed for easier data manipulation and will hopefully allow the eventual prediction models to run much quicker.
# Sample 10% of data for each file
stwitter <- twitter[as.logical(rbinom(length(twitter),1, prob=0.1))]
sblogs <- blogs[as.logical(rbinom(length(blogs),1, prob=0.1))]
snews <- news[as.logical(rbinom(length(news),1, prob=0.1))]
Then, any “strange” / non-english characters were removed. Since these files can contain characters from other languages and things like emojis, they were removed from the data.
stwitter <- unlist(strsplit(stwitter, split=", "))
twitterremove <- grep("stwitter", iconv(stwitter, "latin1", "ASCII", sub="stwitter"))
stwitter <- stwitter[-twitterremove]
stwitter<- paste(stwitter, collapse = ", ")
sblogs <- unlist(strsplit(sblogs, split=", "))
blogsremove <- grep("sblogs", iconv(sblogs, "latin1", "ASCII", sub="sblogs"))
sblogs <- sblogs[-blogsremove]
sblogs<- paste(sblogs, collapse = ", ")
snews <- unlist(strsplit(snews, split=", "))
newsremove <- grep("snews", iconv(snews, "latin1", "ASCII", sub="snews"))
snews <- snews[-newsremove]
snews<- paste(snews, collapse = ", ")
The sampled and partially cleaned data was then used to create a corpus that would allow for further cleaning and manipulation.
While there are several libraries in R that can be used for this, tm and quanteda are the two that I foudn most useful. Quanteda was used for most of this project because it seemed to be faster and easier to use.
A Document Feature Matrix was then created to find word frequencies. To do this, further cleaning was also done, all words were changed to all lowercase letters, numbers were removed, punctuation was removed, twitter characters (@ and #) were removed, and finally stop words were removed. The list of stopwords can be found in the Quanteda information found at https://cran.r-project.org/web/packages/quanteda/quanteda.pdf – this pdf has links to websites housing the stopwords.
# Put all 3 sampled files in to one
allsamples <- c(stwitter, sblogs, snews)
# Create Corpus of 3 sampled files, allows easy processing with quanteda library
samplecorpus <- corpus(allsamples)
# Create Document Feature Matrix while changing to all lowercase, removing stop words & punctuation,
# and stemming
samplesdfm <- dfm(samplecorpus, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE,
removeTwitter = TRUE, ignoredFeatures = stopwords("english"), stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 148,297 feature types
## ... removed 174 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 35074 feature variants
## ... created a 3 x 113049 sparse dfm
## ... complete.
## Elapsed time: 16.39 seconds.
Profanity was then removed from the data. While it is my opinion that profanity is actually a meaningful and informative part of the English language, the project did suggest removing these words. I did a quick google search for a list of profane words, and used one of the smaller lists for quick processing - found at http://www.bannedwordlist.com/lists/swearWords.txt
# Create Profanity word list for filtering
profanity <- read.table("http://www.bannedwordlist.com/lists/swearWords.txt")
profanity <- as.character(profanity[-c(37:40),])
# Remove profanity
samplesdfm <- removeFeatures(samplesdfm, profanity)
## removed 64 features, from 80 supplied (glob) feature types
Now that the data is sufficiently cleaned, N-grams were created of the data. An N-gram is essentially a group of words that appear in order, with the n value representing how many words are used.
For example, in the sentence: How are you today
2-gram: How are 3-gram: How are you 4-gram: How are you today
Using n-grams gives more context and information on how words are used in the English language to create phrases, and will allow for a better predicition model. For this report, only the 2-gram matrix was created to save time, but the commented-out code is shown below for 3- and 4-grams.
##### Create N grams
twograms <- dfm(samplecorpus, ngrams = 2, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE,
removeTwitter = TRUE, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 1,831,233 feature types
## ... stemming features (English), trimmed 239959 feature variants
## ... created a 3 x 1591274 sparse dfm
## ... complete.
## Elapsed time: 191.58 seconds.
# threegrams <- dfm(samplecorpus, ngrams = 3, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE,
# removeTwitter = TRUE, stem = TRUE)
# fourgrams <- dfm(samplecorpus, ngrams = 4, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE,
# removeTwitter = TRUE, stem = TRUE)
The wordcloud package allows a quick and easy way to create a very cool vizualizatioin of the word frequencies from the data. The larger a word appears, the more times it appear in the data.
A list of the top 20 most frequent words, as well as the number of times they appear, is also given.
# Create wordcloud plot of all words appearing at least 1000 times
plot(samplesdfm, random.order=FALSE, min.freq=1000, colors=brewer.pal(1, "Set2"))
# Table of the 20 most frequent words
as.data.frame(topfeatures(samplesdfm, 20))
## topfeatures(samplesdfm, 20)
## just 22165
## get 21473
## like 20968
## will 18993
## one 18745
## go 18608
## love 17015
## time 16905
## can 16740
## day 16210
## thank 14407
## good 14014
## make 13268
## know 13249
## now 12821
## see 11696
## new 11276
## work 11198
## look 10801
## think 10759
A similar approach was taken when looking at the various N-grams. While values of n included 1-4, only the 2-gram data is shown here. All of the n-grams will be used for the prediction model.
A word cloud was not used, but a frequency histogram as well as the top 20 list of n-grams was used.
topn <- as.data.frame(topfeatures(twograms, 20))
topn
## topfeatures(twograms, 20)
## of_the 20393
## in_the 20313
## for_the 12289
## to_the 11349
## on_the 11049
## to_be 9845
## at_the 7686
## go_to 6975
## i_have 6916
## i_am 6331
## and_the 6285
## i_was 6251
## want_to 6171
## is_a 6161
## have_a 6068
## and_i 5975
## it_was 5950
## in_a 5935
## if_you 5721
## for_a 5688
#Plot Word Frequencies
topndf<-data.frame(names(topfeatures(twograms, 20)),topn)
ggplot(data=topndf, aes(x=reorder(topndf[,1], topndf[,2]), y=topndf[,2])) + geom_bar(stat="identity", fill = "red") +
coord_flip() + ylab("Frequency") + xlab("N-gram") + ggtitle("2-gram Frequency - Top 20")
The plan going forward is to use the n-gram data that has been constructed to help with the predictive text model. Due to the time it takes to get the data ready, I stopped at 4 gram for this milestone but plan to go up to 10-gram combinations. I also plan to see if the data sampling needs to be increased or can be decreased to help with model speed.
Mis-spellings and instances where a word or n-gram has not been seen in the data will need to be taken care of as well. The ideas benhind Markov chains will also be explored more, and Katz back-off models will need to be investigated. These ideas were mentioned in the course materials and will hopefully help with developing the predicition model.