Here’s the code to check if the zip file for the project is in the working directory and download it if it isn’t. It will then unzip it if it hasn’t already been unzipped.
filename <- "Coursera-SwiftKey.zip"
if(!file.exists(filename)){
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, filename)
}
if(!dir.exists("/final/en_US/")){
unzip(filename)
}
First we’ll just look at how long each of the data sets are and some summary stats for the number of characters in the strings.
explore <- data.frame(row.names = c("Number of Strings", "Longest String", "Shortest String", "Average String", "Median String", "File Size (mb)"))
con <- file("./final/en_US/en_US.blogs.txt", "r")
tmp <- readLines(con)
explore$blogs <- c(length(tmp), max(nchar(tmp)), min(nchar(tmp)), round(mean(nchar(tmp)),0), median(nchar(tmp)), round(file.info("./final/en_US/en_US.blogs.txt")$size / 1024 ^ 2, 0))
close(con)
con <- file("./final/en_US/en_US.news.txt", "r")
tmp <- readLines(con)
explore$news <- c(length(tmp), max(nchar(tmp)), min(nchar(tmp)), round(mean(nchar(tmp)),0), median(nchar(tmp)), round(file.info("./final/en_US/en_US.news.txt")$size / 1024 ^ 2, 0))
close(con)
con <- file("./final/en_US/en_US.twitter.txt", "r")
tmp <- readLines(con)
explore$twitter <- c(length(tmp), max(nchar(tmp)), min(nchar(tmp)), round(mean(nchar(tmp)),0), median(nchar(tmp)), round(file.info("./final/en_US/en_US.twitter.txt")$size / 1024 ^ 2, 0))
close(con)
rm(tmp)
explore
blogs news twitter
Number of Strings 899288 77259 2360148
Longest String 40835 5760 213
Shortest String 1 2 2
Average String 232 203 69
Median String 157 186 64
File Size (mb) 200 196 159
There are many more strings in the twitter data set but the strings are much shorter on average. 550mb of text is a huge data set.
Since we don’t need all of the data I’ll sample 10% of it using the LaF package and move them to a new sub-directory.
library(NLP)
library(tm)
library(LaF)
set.seed(33)
if(!dir.exists("./sample")){
dir.create("./sample")
}
writeLines(sample_lines("./final/en_US/en_US.blogs.txt", determine_nlines("./final/en_US/en_US.blogs.txt")*0.1, determine_nlines("./final/en_US/en_US.blogs.txt")), "blogsSample.txt")
file.rename(from="./blogsSample.txt", to="./sample/blogsSample.txt")
writeLines(sample_lines("./final/en_US/en_US.news.txt", determine_nlines("./final/en_US/en_US.news.txt")*0.1, determine_nlines("./final/en_US/en_US.news.txt")), "newsSample.txt")
file.rename(from="./newsSample.txt", to="./sample/newsSample.txt")
writeLines(sample_lines("./final/en_US/en_US.twitter.txt", determine_nlines("./final/en_US/en_US.twitter.txt")*0.1, determine_nlines("./final/en_US/en_US.twitter.txt")), "twitterSample.txt")
file.rename(from="./twitterSample.txt", to="./sample/twitterSample.txt")
Now that we have our sample we can load it into a corpus and start using the tm package to clean it up.
As part of the cleaning process we will do the following:
Use a custom function to remove any non ASCII text
Transform all upper case letters to lower case
Remove the profanity
Remove all punctuation
Remove extra spaces (white space)
Remove numbers
For trigrams and higher I want to have the so called stop words in the algorithm for predictive purposes but we’ll take them out of the unigrams and digrams so I’m making a second corpus called stoppedcorp
filename <- "badwords.txt"
if(!file.exists(filename)){
fileURL <- "https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt"
download.file(fileURL, filename)
}
badwords <- read.csv("badwords.txt")
badwords <- as.character(badwords[,1])
badwords <- gsub('[])(;:#%$^*\\~{}[&+=@/"`|<>_]+', "", badwords)
corp <- Corpus(DirSource("./sample"))
corp <- tm_map(corp, content_transformer(function(x){return(iconv(x, from = "UTF-8", to = "ASCII", sub = ""))}))
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeWords, badwords)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removeWords, stopwords("SMART"))
Now that we have a clean corpus we can start digging into it to see what we see. I’ll create a Term Document Matrix using the tm package.
tdm <- removeSparseTerms(TermDocumentMatrix(corp), 0.999)
Load some visualization packages
library(RColorBrewer)
library(ggplot2)
library(wordcloud)
Change the tdm into a matrix object, sort it by the word frequency after summing up the word across the 3 sources and store it as a data frame.
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d1 <- data.frame(word = names(v),freq=v)
save(d1, file="d1.RData")
Now we can make a word cloud to visualize the top 100 words.
wordcloud(words = d1$word, freq = d1$freq, scale = c(4, 0.5), random.order = FALSE, colors = brewer.pal(9, "Blues"), max.words = 100)
Lets take the top 30 words and plot them as a bar chart to better show their relative frequency.
top30 <- d1[1:30,]
top30$word <- factor(top30$word, levels = top30$word[order(top30$freq)])
p <- ggplot(data=top30, aes(y=freq, x=word), ylab="Frequency")
p <- p + geom_bar(stat="identity", width=1, fill="steelblue", color="blue")
p <- p + geom_text(aes(label=freq), color="white", hjust=1.5, size=3)
p <- p + theme_minimal()
p <- p + labs(y="Count in the Sample", x="Top 30 Words in the Sample")
p <- p + coord_flip()
p
The RWeka package adds the ability to tokenize a corpus into n-grams. I’ll build Term Document Matrices for digrams through quadgrams.
library(RWeka)
diTok <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
diTdm <- TermDocumentMatrix(corp, control = list(tokenize = diTok, bounds = list(global = c(3, Inf))))
triTok <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
triTdm <- TermDocumentMatrix(corp, control = list(tokenize = triTok, bounds = list(global = c(3, Inf))))
quadTok <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
quadTdm <- TermDocumentMatrix(corp, control = list(tokenize = quadTok, bounds = list(global = c(2, Inf))))
Now we can do the same matrix conversion we did for the single words to make a frequency list for each of the others and save them so we can use them later.
m2 <- as.matrix(diTdm)
v2 <- sort(rowSums(m2),decreasing=TRUE)
d2 <- data.frame(word = names(v2),freq=v2)
m3 <- as.matrix(triTdm)
v3 <- sort(rowSums(m3),decreasing=TRUE)
d3 <- data.frame(word = names(v3),freq=v3)
m4 <- as.matrix(quadTdm)
v4 <- sort(rowSums(m4),decreasing=TRUE)
d4 <- data.frame(word = names(v4),freq=v4)
library(stringr)
d2$c1 <- word(d2$word, 1)
d2$c2 <- word(d2$word, -1)
save(d2, file="d2.RData")
d3$c1 <- word(d3$word, 1, 2)
d3$c2 <- word(d3$word, -1)
save(d3, file="d3.RData")
d4$c1 <- word(d4$word, 1, 3)
d4$c2 <- word(d4$word, -1)
save(d4, file="d4.RData")
Now we can make a word cloud to visualize the top digrams.
wordcloud(words = d2$word, freq = d2$freq, scale = c(4, 0.5), random.order = FALSE, colors = brewer.pal(9, "Blues"), max.words = 100)
And the top trigrams.
wordcloud(words = d3$word, freq = d3$freq, scale = c(4, 0.5), random.order = FALSE, colors = brewer.pal(9, "Blues"), max.words = 100)
And the top quadgrams.
wordcloud(words = d4$word, freq = d4$freq, scale = c(4, 0.5), random.order = FALSE, colors = brewer.pal(9, "Blues"), max.words = 100)
Relative frequency of the top 30 digrams
top30 <- d2[1:30,]
top30$word <- factor(top30$word, levels = top30$word[order(top30$freq)])
p <- ggplot(data=top30, aes(y=freq, x=word), ylab="Frequency")
p <- p + geom_bar(stat="identity", width=1, fill="steelblue", color="blue")
p <- p + geom_text(aes(label=freq), color="white", hjust=1.5, size=3)
p <- p + theme_minimal()
p <- p + labs(y="Count in the Sample", x="Top 30 Digrams in the Sample")
p <- p + coord_flip()
p
Relative frequency of the top 30 trigrams
top30 <- d3[1:30,]
top30$word <- factor(top30$word, levels = top30$word[order(top30$freq)])
p <- ggplot(data=top30, aes(y=freq, x=word), ylab="Frequency")
p <- p + geom_bar(stat="identity", width=1, fill="steelblue", color="blue")
p <- p + geom_text(aes(label=freq), color="white", hjust=1.5, size=3)
p <- p + theme_minimal()
p <- p + labs(y="Count in the Sample", x="Top 30 Trigrams in the Sample")
p <- p + coord_flip()
p
Relative frequency of the top 30 quadgrams
top30 <- d4[1:30,]
top30$word <- factor(top30$word, levels = top30$word[order(top30$freq)])
p <- ggplot(data=top30, aes(y=freq, x=word), ylab="Frequency")
p <- p + geom_bar(stat="identity", width=1, fill="steelblue", color="blue")
p <- p + geom_text(aes(label=freq), color="white", hjust=1.5, size=3)
p <- p + theme_minimal()
p <- p + labs(y="Count in the Sample", x="Top 30 Trigrams in the Sample")
p <- p + coord_flip()
p
Create a function that when given a series of words will return the top 5 most likely next words. I’ll probably take each of the frequency lists I just generated and use them to try to match the 6th word from 5 words using the hexagrams, if it doesn’t find a match try 4 words with the pentagrams etc.
Test the function on other text samples. We’ve only used 5% of the lines in the data set to build our corpus. If I change the seed I’ll be able to generate a new data set to compare against the first.
I’ll try out different prediction techniques and find the one that gives the best accuracy/time.
Build a Shiny app that takes input from the user and generates the prediction outcome.