Introduction

This milestone report presents the exploratory analysis, data cleaning, and plan for further work as part of the Coursera Data Science Capstone. The project for this Capstone course involves creating a shiny app that is able to predict what word a user will want to type next, given input words from the user.

This project falls under what is known as Natural Language Processing.

Data

The training data for this project are available from HC Corpora here:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The zip file contains 3 text files with text examples from Twitter, news, and blogs.

Objective

As mentioned in the introduction, this project works to create a predictive model that takes words from the user and tries to predict what word will be typed next.

Data Processing

First the necessary libraries are loaded and the data is downloaded via the URL given above. The data is quite large so it is best to download once in to a working directory.

# Load libraries to be used
suppressWarnings(library(tm))
suppressWarnings(library(ggplot2))
suppressWarnings(library(stringi))
suppressWarnings(library(data.table))
suppressWarnings(library(wordcloud))
suppressWarnings(library(quanteda))

# Download and load data
if(!file.exists("Coursera-SwiftKey.zip")){
        fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        download.file(fileURL, destfile = "Dataset.zip")
        unlink(fileURL)
        unzip("Dataset.zip")
}else{
        print("Already Have Data")
}
## [1] "Already Have Data"

Next, the data is loaded in to R

# Load the three data files
twitter <- readLines(file("en_US.twitter.txt"),skipNul = TRUE, encoding="UTF-8") 
blogs <- readLines(file("en_US.blogs.txt"), skipNul = TRUE, encoding="UTF-8") 
news <- readLines(file("en_US.news.txt"), skipNul = TRUE, encoding="UTF-8")

Then some summary information is presented: File Size, Word count, and number of lines for the 3 data files.

twitterinfo <- c(sum(stri_count_words(twitter)), length(twitter), round((file.info("final/en_US/en_US.twitter.txt")$size / (1024^2)),2))

blogsinfo <- c(sum(stri_count_words(blogs)), length(blogs), round((file.info("final/en_US/en_US.blogs.txt")$size / (1024^2)),2))

newsinfo <- c(sum(stri_count_words(news)), length(news), round((file.info("final/en_US/en_US.news.txt")$size / (1024^2)),2))

suminfo<- as.data.frame(rbind(twitterinfo,blogsinfo,newsinfo))
rownames(suminfo) <- c('Twitter', 'Blogs', 'News')
colnames(suminfo) <- c('Word Count', 'Line Count', 'File Size (MB)')

suminfo
##         Word Count Line Count File Size (MB)
## Twitter   30093410    2360148         159.36
## Blogs     37546246     899288         155.75
## News       2674536      77259         196.28

Since the data files are so large, each file was sampled (10% of each file was used). This allowed for easier data manipulation and will hopefully allow the eventual prediction models to run much quicker.

# Sample 10% of data for each file
stwitter <- twitter[as.logical(rbinom(length(twitter),1, prob=0.1))]
sblogs <- blogs[as.logical(rbinom(length(blogs),1, prob=0.1))]
snews <- news[as.logical(rbinom(length(news),1, prob=0.1))]

Then, any “strange” / non-english characters were removed. Since these files can contain characters from other languages and things like emojis, they were removed from the data.

stwitter <- unlist(strsplit(stwitter, split=", "))
twitterremove <- grep("stwitter", iconv(stwitter, "latin1", "ASCII", sub="stwitter"))
stwitter <- stwitter[-twitterremove]
stwitter<- paste(stwitter, collapse = ", ")

sblogs <- unlist(strsplit(sblogs, split=", "))
blogsremove <- grep("sblogs", iconv(sblogs, "latin1", "ASCII", sub="sblogs"))
sblogs <- sblogs[-blogsremove]
sblogs<- paste(sblogs, collapse = ", ")

snews <- unlist(strsplit(snews, split=", "))
newsremove <- grep("snews", iconv(snews, "latin1", "ASCII", sub="snews"))
snews <- snews[-newsremove]
snews<- paste(snews, collapse = ", ")

Data Corpus and tokenization

The sampled and partially cleaned data was then used to create a corpus that would allow for further cleaning and manipulation.

While there are several libraries in R that can be used for this, tm and quanteda are the two that I foudn most useful. Quanteda was used for most of this project because it seemed to be faster and easier to use.

A Document Feature Matrix was then created to find word frequencies. To do this, further cleaning was also done, all words were changed to all lowercase letters, numbers were removed, punctuation was removed, twitter characters (@ and #) were removed, and finally stop words were removed. The list of stopwords can be found in the Quanteda information found at https://cran.r-project.org/web/packages/quanteda/quanteda.pdf – this pdf has links to websites housing the stopwords.

# Put all 3 sampled files in to one
allsamples <- c(stwitter, sblogs, snews)

# Create Corpus of 3 sampled files, allows easy processing with quanteda library
samplecorpus <- corpus(allsamples)

# Create Document Feature Matrix while changing to all lowercase, removing stop words & punctuation,
# and stemming
samplesdfm <- dfm(samplecorpus, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE,
                  removeTwitter = TRUE, ignoredFeatures = stopwords("english"), stem = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 148,297 feature types
##    ... removed 174 features, from 174 supplied (glob) feature types
##    ... stemming features (English), trimmed 35074 feature variants
##    ... created a 3 x 113049 sparse dfm
##    ... complete. 
## Elapsed time: 16.39 seconds.

Profanity was then removed from the data. While it is my opinion that profanity is actually a meaningful and informative part of the English language, the project did suggest removing these words. I did a quick google search for a list of profane words, and used one of the smaller lists for quick processing - found at http://www.bannedwordlist.com/lists/swearWords.txt

# Create Profanity word list for filtering
profanity <- read.table("http://www.bannedwordlist.com/lists/swearWords.txt")
profanity <- as.character(profanity[-c(37:40),])

# Remove profanity 
samplesdfm <- removeFeatures(samplesdfm, profanity)
## removed 64 features, from 80 supplied (glob) feature types

Now that the data is sufficiently cleaned, N-grams were created of the data. An N-gram is essentially a group of words that appear in order, with the n value representing how many words are used.

For example, in the sentence: How are you today

2-gram: How are 3-gram: How are you 4-gram: How are you today

Using n-grams gives more context and information on how words are used in the English language to create phrases, and will allow for a better predicition model. For this report, only the 2-gram matrix was created to save time, but the commented-out code is shown below for 3- and 4-grams.

##### Create N grams
twograms <- dfm(samplecorpus, ngrams = 2, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE,
                  removeTwitter = TRUE, stem = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 1,831,233 feature types
##    ... stemming features (English), trimmed 239959 feature variants
##    ... created a 3 x 1591274 sparse dfm
##    ... complete. 
## Elapsed time: 191.58 seconds.
# threegrams <- dfm(samplecorpus, ngrams = 3, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE,
#                removeTwitter = TRUE, stem = TRUE)
# fourgrams <- dfm(samplecorpus, ngrams = 4, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE,
#                removeTwitter = TRUE, stem = TRUE)

Exploratory Analysis

The wordcloud package allows a quick and easy way to create a very cool vizualizatioin of the word frequencies from the data. The larger a word appears, the more times it appear in the data.

A list of the top 20 most frequent words, as well as the number of times they appear, is also given.

# Create wordcloud plot of all words appearing at least 1000 times
plot(samplesdfm, random.order=FALSE, min.freq=1000, colors=brewer.pal(1, "Set2"))

# Table of the 20 most frequent words
as.data.frame(topfeatures(samplesdfm, 20))
##       topfeatures(samplesdfm, 20)
## just                        22165
## get                         21473
## like                        20968
## will                        18993
## one                         18745
## go                          18608
## love                        17015
## time                        16905
## can                         16740
## day                         16210
## thank                       14407
## good                        14014
## make                        13268
## know                        13249
## now                         12821
## see                         11696
## new                         11276
## work                        11198
## look                        10801
## think                       10759

A similar approach was taken when looking at the various N-grams. While values of n included 1-4, only the 2-gram data is shown here. All of the n-grams will be used for the prediction model.

A word cloud was not used, but a frequency histogram as well as the top 20 list of n-grams was used.

topn <- as.data.frame(topfeatures(twograms, 20))
topn
##         topfeatures(twograms, 20)
## of_the                      20393
## in_the                      20313
## for_the                     12289
## to_the                      11349
## on_the                      11049
## to_be                        9845
## at_the                       7686
## go_to                        6975
## i_have                       6916
## i_am                         6331
## and_the                      6285
## i_was                        6251
## want_to                      6171
## is_a                         6161
## have_a                       6068
## and_i                        5975
## it_was                       5950
## in_a                         5935
## if_you                       5721
## for_a                        5688
#Plot Word Frequencies
topndf<-data.frame(names(topfeatures(twograms, 20)),topn)

ggplot(data=topndf, aes(x=reorder(topndf[,1], topndf[,2]), y=topndf[,2])) + geom_bar(stat="identity", fill = "red") + 
        coord_flip() + ylab("Frequency") + xlab("N-gram") + ggtitle("2-gram Frequency - Top 20")

Shiny App Plan

The plan going forward is to use the n-gram data that has been constructed to help with the predictive text model. Due to the time it takes to get the data ready, I stopped at 4 gram for this milestone but plan to go up to 10-gram combinations. I also plan to see if the data sampling needs to be increased or can be decreased to help with model speed.

Mis-spellings and instances where a word or n-gram has not been seen in the data will need to be taken care of as well. The ideas benhind Markov chains will also be explored more, and Katz back-off models will need to be investigated. These ideas were mentioned in the course materials and will hopefully help with developing the predicition model.