Introduction

The purpose of this report is to demonstrate my understanding and capability of the first 2 weeks of the Capstone material. The motivation for this piece will be to demonstrate: 1. Downloading the data and successfully loading into the R environment. 2. Creating a basic report of summary statistics about the data sets. 3. Reporting any interesting findings. 4. Outlining plans for the prediction algorithm and Shiny app.

Load Libaries

library(stringi)
library(tm)
library(magrittr)
library(RWeka)
library(ggplot2)

Load Data

# Download the file from the source, store in directory
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "training.zip")

#  Unzip file
unzip("training.zip")

We have 4 languages in the dataset: German, English (US), Finnish & Russian. Simply because I do not fully understand the semantics and syntax of other languages I will only be working with the English files to ensure no mistakes are made.

Within the English section we have data from blogs, news and twitter. We will load all 3, but first let’s get some info on the 3 files.

##      file file_size_MB
## 1   blogs     200.4242
## 2    news     196.2775
## 3 twitter     159.3641

The files sizes seem manageable so we will load them in.

blogs<-readLines("training/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news<-readLines("training/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter<-readLines("training/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Data EDA

Let’s learn a bit about our data to get a sense of the type and magnitude.

# Get the line count for each dataset
blog_lines<-length(blogs)
news_lines<-length(news)
twitter_lines<-length(twitter)

# Get the word count for each dataset
blog_wordcount<-sum(stri_count_words(blogs))
news_wordcount<-sum(stri_count_words(news))
twitter_wordcount<-sum(stri_count_words(twitter))

# Get the avg words per line
blog_avg<-round(blog_wordcount/blog_lines,1)
news_avg<-round(news_wordcount/news_lines,1)
twitter_avg<-round(twitter_wordcount/twitter_lines,1)

#Output in a nice table
Stats<-data.frame(Dataset=c("blogs","news","twitter"),
                  LineCount=c(blog_lines,news_lines,twitter_lines),
                  WordCount=c(blog_wordcount,news_wordcount,twitter_wordcount))
Stats
##   Dataset LineCount WordCount
## 1   blogs    899288  37546246
## 2    news     77259   2674536
## 3 twitter   2360148  30093410

Sampling & Corpus

As our datasets are quite large I will only be working with a sample of 5000 lines per dataset moving forward. This is to speed up the demonstration of the preprocessing stage. When it comes to the final product a more careful selection of dataset size will be used.

# Set seed for reproducibility
set.seed(1234)

# Set sample size
sample.size<-1000

# Sample the 3 datasets
blogs_sample<-sample(blogs, sample.size, replace=FALSE)
news_sample<-sample(news, sample.size, replace=FALSE)
twitter_sample<-sample(twitter, sample.size, replace=FALSE)

# Combine 3 samples into one 
sample<-c(blogs_sample,news_sample,twitter_sample)

Next we need to transform the sample into a format (corpus) to allow us to map our cleaning functions to it. First we will apply a conversion to remove all non-English characters. I haven’t figured out how to apply this at the VCorpus level without ruining the object type so it will be used here.

sample<-iconv(sample, "latin1", "ASCII", sub="")
sample_corpus<-sample %>%
              VectorSource() %>%
              VCorpus()

Preprocessing

Now we can use a mix of custom and out of the box functions to clean the data.

# Remove punctuation
sample_corpus<-tm_map(sample_corpus,removePunctuation)
# Remove numbers
sample_corpus<-tm_map(sample_corpus,removeNumbers)
# Make all lowercase
sample_corpus<-tm_map(sample_corpus,content_transformer(tolower))
# Set function to sub a space into the desired input pattern
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
# Remove Hashtags
sample_corpus<-tm_map(sample_corpus,toSpace, "#\\w+")
# Remove Tags
sample_corpus<-tm_map(sample_corpus,toSpace, "@\\w+")
# Remove http links
sample_corpus<-tm_map(sample_corpus, toSpace, "http[^[:space:]]*")
# Stem words
sample_corpus<-tm_map(sample_corpus,stemDocument)
# Remove stop words
sample_corpus<-tm_map(sample_corpus, removeWords, stopwords("en"))

N-Grams

An n-gram is a sequence of words of n length. We can use them to look at word frequencies not only as singular words but in groupings. For this demonstration we will be looking at unigrams (n=1), bigrams (n=2), trigrams (n=3) and quadgrams (n=4). We will visualise them in our EDA but for now we need to create the various N-grams. Strategy

# Create Tokenizers
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
quadgramTokenizer <-function(x) NGramTokenizer(x, Weka_control(min =4, max = 4))

# Create Term Document Matrix
uniGramTDM <- TermDocumentMatrix(sample_corpus, control = list(tokenize = unigramTokenizer))
biGramTDM <- TermDocumentMatrix(sample_corpus, control = list(tokenize = bigramTokenizer))
triGramTDM <- TermDocumentMatrix(sample_corpus, control = list(tokenize = trigramTokenizer))
quadGramTDM <- TermDocumentMatrix(sample_corpus, control = list(tokenize = quadgramTokenizer))

# Convert TDM to Matrix
uniGramMatrix<-as.matrix(uniGramTDM)
biGramMatrix<-as.matrix(biGramTDM)
triGramMatrix<-as.matrix(triGramTDM)
quadGramMatrix<-as.matrix(quadGramTDM)

include checking out a couple of spot checks here and there.,

Exploratory Data Analysis

After all our data load, preprocessing and N-gram creation - it is finally time to visualise some of our findings. We are looking at the highest occuring term groupings for each of our N-grams. ### Unigram

termfrequency<-rowSums(uniGramMatrix) %>%
                sort(,decreasing=TRUE)
term_frequency1<- data.frame(Unigram=names(termfrequency),frequency=termfrequency)

g<-ggplot(head(term_frequency1,30), aes(x=reorder(Unigram, frequency), y=frequency))+
  geom_bar(stat="identity")+
  coord_flip()+
  theme(legend.title=element_blank())+
  xlab("Unigram")+
  ylab("Frequency")+
  labs(title="Top Occuring Unigrams")
print(g)

Bigram

termfrequency<-rowSums(biGramMatrix) %>%
                sort(,decreasing=TRUE)
term_frequency2<- data.frame(Bigram=names(termfrequency),frequency=termfrequency)

g<-ggplot(head(term_frequency2,30), aes(x=reorder(Bigram, frequency), y=frequency))+
  geom_bar(stat="identity")+
  coord_flip()+
  theme(legend.title=element_blank())+
  xlab("Bigram")+
  ylab("Frequency")+
  labs(title="Top Occuring Bigrams")
print(g)

Trigram

termfrequency<-rowSums(triGramMatrix) %>%
                sort(,decreasing=TRUE)
term_frequency3<- data.frame(Trigram=names(termfrequency),frequency=termfrequency)

g<-ggplot(head(term_frequency3,30), aes(x=reorder(Trigram, frequency), y=frequency))+
  geom_bar(stat="identity")+
  coord_flip()+
  theme(legend.title=element_blank())+
  xlab("Trigram")+
  ylab("Frequency")+
  labs(title="Top Occuring Trigrams")
print(g)

Quadgram

termfrequency<-rowSums(quadGramMatrix) %>%
                sort(,decreasing=TRUE)
term_frequency4<- data.frame(Quadgram=names(termfrequency),frequency=termfrequency)

g<-ggplot(head(term_frequency4,30), aes(x=reorder(Quadgram, frequency), y=frequency))+
  geom_bar(stat="identity")+
  coord_flip()+
  theme(legend.title=element_blank())+
  xlab("Quadgram")+
  ylab("Frequency")+
  labs(title="Top Occuring Quadgrams")
print(g)

Next Steps

From basic EDA I understand that how the data is preproccessed is going to have a big impact on the prediction model. When it comes to Tri/Quadgrams we see odd character words popping up, degrading the model potential. But this will form the basis of the predictive model. I will be using N-grams as above as the building blocks of my prediction. I will need to explore how this works whether it takes the first word of an N-gram and predicts the next likely word based on bi/trigram and I assume look for patterns across multiple N-grams. I will wrap this all around a Shiny app allowing user input of words as well as a visualised output of the predicted word.