Milestone Report - Capstone Project

Introduction

Loading the data

Here’s the code to check if the zip file for the project is in the working directory and download it if it isn’t. It will then unzip it if it hasn’t already been unzipped.

filename <- "Coursera-SwiftKey.zip"
if(!file.exists(filename)){
        fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        download.file(fileURL, filename)
}
if(!dir.exists("/final/en_US/")){
        unzip(filename)
}

Preliminary Analysis

First we’ll just look at how long each of the data sets are and some summary stats for the number of characters in the strings.

explore <- data.frame(row.names = c("Number of Strings", "Longest String", "Shortest String", "Average String", "Median String", "File Size (mb)"))

con <- file("./final/en_US/en_US.blogs.txt", "r")
tmp <- readLines(con)
explore$blogs <- c(length(tmp), max(nchar(tmp)), min(nchar(tmp)), round(mean(nchar(tmp)),0), median(nchar(tmp)), round(file.info("./final/en_US/en_US.blogs.txt")$size / 1024 ^ 2, 0))
close(con)

con <- file("./final/en_US/en_US.news.txt", "r")
tmp <- readLines(con)
explore$news <- c(length(tmp), max(nchar(tmp)), min(nchar(tmp)), round(mean(nchar(tmp)),0), median(nchar(tmp)), round(file.info("./final/en_US/en_US.news.txt")$size / 1024 ^ 2, 0))
close(con)

con <- file("./final/en_US/en_US.twitter.txt", "r")
tmp <- readLines(con)
explore$twitter <- c(length(tmp), max(nchar(tmp)), min(nchar(tmp)), round(mean(nchar(tmp)),0), median(nchar(tmp)), round(file.info("./final/en_US/en_US.twitter.txt")$size / 1024 ^ 2, 0))
close(con)
rm(tmp)

explore

                   blogs  news twitter
Number of Strings 899288 77259 2360148
Longest String     40835  5760     213
Shortest String        1     2       2
Average String       232   203      69
Median String        157   186      64
File Size (mb)       200   196     159

There are many more strings in the twitter data set but the strings are much shorter on average. 550mb of text is a huge data set.

Data Subsetting

Since we don’t need all of the data I’ll sample 10% of it using the LaF package and move them to a new sub-directory.

library(NLP)
library(tm)
library(LaF)

set.seed(33)
if(!dir.exists("./sample")){
        dir.create("./sample")
}

writeLines(sample_lines("./final/en_US/en_US.blogs.txt", determine_nlines("./final/en_US/en_US.blogs.txt")*0.1, determine_nlines("./final/en_US/en_US.blogs.txt")), "blogsSample.txt")
file.rename(from="./blogsSample.txt", to="./sample/blogsSample.txt")

writeLines(sample_lines("./final/en_US/en_US.news.txt", determine_nlines("./final/en_US/en_US.news.txt")*0.1, determine_nlines("./final/en_US/en_US.news.txt")), "newsSample.txt")
file.rename(from="./newsSample.txt", to="./sample/newsSample.txt")

writeLines(sample_lines("./final/en_US/en_US.twitter.txt", determine_nlines("./final/en_US/en_US.twitter.txt")*0.1, determine_nlines("./final/en_US/en_US.twitter.txt")), "twitterSample.txt")
file.rename(from="./twitterSample.txt", to="./sample/twitterSample.txt")

Preprocessing

Now that we have our sample we can load it into a corpus and start using the tm package to clean it up.

As part of the cleaning process we will do the following:

Use a custom function to remove any non ASCII text
Transform all upper case letters to lower case
Remove the profanity
Remove all punctuation
Remove extra spaces (white space)
Remove numbers
For trigrams and higher I want to have the so called stop words in the algorithm for predictive purposes but we’ll take them out of the unigrams and digrams so I’m making a second corpus called stoppedcorp

filename <- "badwords.txt"
if(!file.exists(filename)){
        fileURL <- "https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt"
        download.file(fileURL, filename)
}
badwords <- read.csv("badwords.txt")
badwords <- as.character(badwords[,1])
badwords <- gsub('[])(;:#%$^*\\~{}[&+=@/"`|<>_]+', "", badwords)

corp <- Corpus(DirSource("./sample"))
corp <- tm_map(corp, content_transformer(function(x){return(iconv(x, from = "UTF-8", to = "ASCII", sub = ""))}))
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeWords, badwords)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removeWords, stopwords("SMART"))

Exploratory Analysis

Now that we have a clean corpus we can start digging into it to see what we see. I’ll create a Term Document Matrix using the tm package.

tdm <- removeSparseTerms(TermDocumentMatrix(corp), 0.999)

Load some visualization packages

library(RColorBrewer)
library(ggplot2)
library(wordcloud)

Change the tdm into a matrix object, sort it by the word frequency after summing up the word across the 3 sources and store it as a data frame.

m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d1 <- data.frame(word = names(v),freq=v)
save(d1, file="d1.RData")

Now we can make a word cloud to visualize the top 100 words.

wordcloud(words = d1$word, freq = d1$freq, scale = c(4, 0.5), random.order = FALSE, colors = brewer.pal(9, "Blues"), max.words = 100)

Lets take the top 30 words and plot them as a bar chart to better show their relative frequency.

top30 <- d1[1:30,]
top30$word <- factor(top30$word, levels = top30$word[order(top30$freq)])
p <- ggplot(data=top30, aes(y=freq, x=word), ylab="Frequency")
p <- p + geom_bar(stat="identity", width=1, fill="steelblue", color="blue")
p <- p + geom_text(aes(label=freq), color="white", hjust=1.5, size=3)
p <- p + theme_minimal()
p <- p + labs(y="Count in the Sample", x="Top 30 Words in the Sample")
p <- p + coord_flip()
p

Tokenization

The RWeka package adds the ability to tokenize a corpus into n-grams. I’ll build Term Document Matrices for digrams through quadgrams.

library(RWeka)
diTok <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
diTdm <- TermDocumentMatrix(corp, control = list(tokenize = diTok, bounds = list(global = c(3, Inf))))

triTok <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
triTdm <- TermDocumentMatrix(corp, control = list(tokenize = triTok, bounds = list(global = c(3, Inf))))

quadTok <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
quadTdm <- TermDocumentMatrix(corp, control = list(tokenize = quadTok, bounds = list(global = c(2, Inf))))

Now we can do the same matrix conversion we did for the single words to make a frequency list for each of the others and save them so we can use them later.

m2 <- as.matrix(diTdm)
v2 <- sort(rowSums(m2),decreasing=TRUE)
d2 <- data.frame(word = names(v2),freq=v2)

m3 <- as.matrix(triTdm)
v3 <- sort(rowSums(m3),decreasing=TRUE)
d3 <- data.frame(word = names(v3),freq=v3)

m4 <- as.matrix(quadTdm)
v4 <- sort(rowSums(m4),decreasing=TRUE)
d4 <- data.frame(word = names(v4),freq=v4)

library(stringr)
d2$c1 <- word(d2$word, 1)
d2$c2 <- word(d2$word, -1)
save(d2, file="d2.RData")

d3$c1 <- word(d3$word, 1, 2)
d3$c2 <- word(d3$word, -1)
save(d3, file="d3.RData")

d4$c1 <- word(d4$word, 1, 3)
d4$c2 <- word(d4$word, -1)
save(d4, file="d4.RData")

Now we can make a word cloud to visualize the top digrams.

wordcloud(words = d2$word, freq = d2$freq, scale = c(4, 0.5), random.order = FALSE, colors = brewer.pal(9, "Blues"), max.words = 100)

And the top trigrams.

wordcloud(words = d3$word, freq = d3$freq, scale = c(4, 0.5), random.order = FALSE, colors = brewer.pal(9, "Blues"), max.words = 100)

And the top quadgrams.

wordcloud(words = d4$word, freq = d4$freq, scale = c(4, 0.5), random.order = FALSE, colors = brewer.pal(9, "Blues"), max.words = 100)

Relative frequency of the top 30 digrams

top30 <- d2[1:30,]
top30$word <- factor(top30$word, levels = top30$word[order(top30$freq)])
p <- ggplot(data=top30, aes(y=freq, x=word), ylab="Frequency")
p <- p + geom_bar(stat="identity", width=1, fill="steelblue", color="blue")
p <- p + geom_text(aes(label=freq), color="white", hjust=1.5, size=3)
p <- p + theme_minimal()
p <- p + labs(y="Count in the Sample", x="Top 30 Digrams in the Sample")
p <- p + coord_flip()
p

Relative frequency of the top 30 trigrams

top30 <- d3[1:30,]
top30$word <- factor(top30$word, levels = top30$word[order(top30$freq)])
p <- ggplot(data=top30, aes(y=freq, x=word), ylab="Frequency")
p <- p + geom_bar(stat="identity", width=1, fill="steelblue", color="blue")
p <- p + geom_text(aes(label=freq), color="white", hjust=1.5, size=3)
p <- p + theme_minimal()
p <- p + labs(y="Count in the Sample", x="Top 30 Trigrams in the Sample")
p <- p + coord_flip()
p

Relative frequency of the top 30 quadgrams

top30 <- d4[1:30,]
top30$word <- factor(top30$word, levels = top30$word[order(top30$freq)])
p <- ggplot(data=top30, aes(y=freq, x=word), ylab="Frequency")
p <- p + geom_bar(stat="identity", width=1, fill="steelblue", color="blue")
p <- p + geom_text(aes(label=freq), color="white", hjust=1.5, size=3)
p <- p + theme_minimal()
p <- p + labs(y="Count in the Sample", x="Top 30 Trigrams in the Sample")
p <- p + coord_flip()
p

Next Steps

Create a function that when given a series of words will return the top 5 most likely next words. I’ll probably take each of the frequency lists I just generated and use them to try to match the 6th word from 5 words using the hexagrams, if it doesn’t find a match try 4 words with the pentagrams etc.
Test the function on other text samples. We’ve only used 5% of the lines in the data set to build our corpus. If I change the seed I’ll be able to generate a new data set to compare against the first.
I’ll try out different prediction techniques and find the one that gives the best accuracy/time.
Build a Shiny app that takes input from the user and generates the prediction outcome.