Data Science Capstone Project Milestone Report

Adrian Lim 18 March 2016

Synopsis

This Report details the current progress of the capstone project which is to design and implement a shiny application for text prediction using the HC Corpora Dataset. In particular we will show the data cleansing steps, exploratory data analysis and preliminary findings that we have found so far and detail out the planning for the rest of the project

Data Processing

1)Prerequisite Libraries

Load the required R libraries.

library(tm)
library(RWeka)
library(wordcloud)
library(stringi)
library(R.utils)
library(dplyr)
library(ggplot2)
library(knitr)

2)Unzip and read the data file

The data is downloaded from Capstone Dataset.
The data is from the HC Corpora corpus which is further described in HC Corpora README The zipped file is downloaded into the current working directory for further processing.
There are 3 files in the English directory of the data namely blogs,news and twitter data which we will utilise for this project.

#file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#destination <- "./download/Coursera-Swiftkey.zip"
#download.file(file,destination)
#unzip("./download/Coursera-SwiftKey.zip")

blogs <- readLines("./en_US.blogs.txt",encoding='UTF-8',skipNul=TRUE)
news <- readLines("./en_US.news.txt",encoding='UTF-8',skipNul=TRUE)
twitter <- readLines("./en_US.twitter.txt",encoding='UTF-8',skipNul=TRUE)

3)Basic File Information

We now find basic file information such as the file size, number of lines in the file, number of words and number of maximum/minimum characters per line in each of the 3 files.

blogsFileSize <- file.info("./en_US.blogs.txt")$size/(1024*1024)
newsFileSize <- file.info("./en_US.news.txt")$size/(1024*1024)
twitterFileSize <- file.info("./en_US.twitter.txt")$size/(1024*1024)

blogsNumWords <- sum(stri_count_words(blogs))
newsNumWords <- sum(stri_count_words(news))
twitterNumWords <- sum(stri_count_words(twitter))

blogsMaxCharsLine <- max(nchar(blogs))
newsMaxCharsLine <- max(nchar(news))
twitterMaxCharsLine <- max(nchar(twitter))

blogsMinCharsLine <- min(nchar(blogs))
newsMinCharsLine <- min(nchar(news))
twitterMinCharsLine <- min(nchar(twitter))

summary <- data.frame(filename = c("blogs","news","twitter"),
                      filesizeMB = c(blogsFileSize, newsFileSize, twitterFileSize),
                      numLines = c(length(blogs),length(news),length(twitter)),
                      numWords = c(blogsNumWords,newsNumWords,twitterNumWords),
                      maxCharsLine = c(blogsMaxCharsLine,newsMaxCharsLine,twitterMaxCharsLine),
                      minCharsLine = c(blogsMinCharsLine,newsMinCharsLine,twitterMinCharsLine))
print(kable(summary))

## 
## 
## filename    filesizeMB   numLines   numWords   maxCharsLine   minCharsLine
## ---------  -----------  ---------  ---------  -------------  -------------
## blogs         200.4242     899288   37546246          40833              1
## news          196.2775    1010242   34762395          11384              1
## twitter       159.3641    2360148   30093410            140              2

4)Sampling Data

As the files are large, we sample 3% of the lines from each file and combine into our test data set. This is to enable faster processing.

set.seed(80) #enable reproducibility
Tblogs <- blogs[sample(1:length(blogs),0.03*length(blogs))]
Tnews <- news[sample(1:length(news),0.03*length(news))]
Ttwitter <- twitter[sample(1:length(twitter),0.03*length(twitter))]
Testfile <- c(Tblogs,Tnews,Ttwitter)
writeLines(Testfile,"./data/TestFile.txt")

The results in a test file of 128089 lines with 3059625 words with the longest line length being 3985 characters long and the shortest line length being 2 characters long.

Cleaning the Test File

We now clean the testfile by:

standardising on lower case characters
removing punctuation
removing numbers
removing whitespace
removing profanity. This is accomplished by removing the words found in a list downloaded from Profane Word List
removing common english stopwords.

This is accomplished by using the tm library

Cleanfile <- Corpus(DirSource("./data"))
Cleanfile <- tm_map(Cleanfile,content_transformer(tolower))
Cleanfile <- tm_map(Cleanfile,removePunctuation)
Cleanfile <- tm_map(Cleanfile,removeNumbers)                   
Cleanfile <- tm_map(Cleanfile,stripWhitespace)

#Read in list of profanity words that we want to remove
profanity <- readLines("./download/profanity.txt",encoding='UTF-8',skipNul=TRUE)
Cleanfile <- tm_map(Cleanfile,removeWords, profanity)
Cleanfile <- tm_map(Cleanfile,removeWords, stopwords("english"))

Tokenisation

We now use the RWeka library to tokenise the test file into 1,2 and 3 word clusters called N-Grams. By counting the frequency of those word combinations, this can then become a basis of our prediction algorithim.

Unigram wordcount

First we tokenise single words (Unigrams) and visualise the words with the highest frequency in the testfile.

Unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
UniDoc <- DocumentTermMatrix(Cleanfile,control=list(tokenize = Unigram))
UniDoc.matrix <- as.matrix(UniDoc)
frequency <- colSums(UniDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
UniGramFrequency <- data.frame(word=names(frequency),freq=frequency)   
colspectrum <- brewer.pal(6, "Dark2")   
wordcloud(names(frequency), frequency, max.words=50, rot.per=0.1, colors=colspectrum)

Bigram Frequency

Here we tokenise word pairs (Bigrams) and create a chart of the highest frequency Bigrams.

Bigram <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
BiDoc <- DocumentTermMatrix(Cleanfile,control=list(tokenize = Bigram))
BiDoc.matrix <- as.matrix(BiDoc)
frequency <- colSums(BiDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
BiGramFrequency <- data.frame(word=names(frequency),freq=frequency)
BiGramFrequency %>%
  #filter(freq > 750) %>%
  ggplot(aes(word,freq)) +
  geom_bar(stat="identity",colour="red",fill="blue") +
  ggtitle("Bigrams with the highest frequencies") +
  xlab("Bigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Trigram Frequency

Here we tokenise word triplets (Trigrams) and create a chart of the highest frequency Trigrams.

Trigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
TriDoc <- DocumentTermMatrix(Cleanfile,control=list(tokenize = Trigram))
TriDoc.matrix <- as.matrix(TriDoc)
frequency <- colSums(TriDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
TriGramFrequency <- data.frame(word=names(frequency),freq=frequency)
TriGramFrequency %>%
  #filter(freq) %>%
  ggplot(aes(word,freq)) +
  geom_bar(stat="identity",colour="red",fill="blue") +
  ggtitle("Trigrams with the highest frequencies") +
  xlab("Trigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Quadrigram Frequency

Here we tokenise word quadruplets (Quadrigrams) and create a chart of the highest frequency Quadrigrams.

Quadrigram <- function(x) NGramTokenizer(x,Weka_control(min=4,max=4))
QuadriDoc <- DocumentTermMatrix(Cleanfile,control=list(tokenize = Quadrigram))
QuadriDoc.matrix <- as.matrix(QuadriDoc)
frequency <- colSums(QuadriDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
QuadriGramFrequency <- data.frame(word=names(frequency),freq=frequency)
QuadriGramFrequency %>%
  #filter(freq) %>%
  ggplot(aes(word,freq)) +
  geom_bar(stat="identity",colour="red",fill="blue") +
  ggtitle("Quadrigrams with the highest frequencies") +
  xlab("Quadrigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

It is clear that some more cleaning needs to be performed to account for multiple repeats of the same word for example “omg omg omg omg” above.

Summary of Plan Moving Forward

The goal of the project is to build a shiny app that accepts text entry and predicts the likely next word to be entered by the user. The shiny app will give a list of several predicted words that the user can choose. This can be accomplished by building a prediction model that uses the frequency of the N grams (1,2,3 and even 4 N grams) as a predictor for the likely word.