Adrian Lim 18 March 2016
This Report details the current progress of the capstone project which is to design and implement a shiny application for text prediction using the HC Corpora Dataset. In particular we will show the data cleansing steps, exploratory data analysis and preliminary findings that we have found so far and detail out the planning for the rest of the project
Load the required R libraries.
library(tm)
library(RWeka)
library(wordcloud)
library(stringi)
library(R.utils)
library(dplyr)
library(ggplot2)
library(knitr)
The data is downloaded from Capstone Dataset.
The data is from the HC Corpora corpus which is further described in HC Corpora README The zipped file is downloaded into the current working directory for further processing.
There are 3 files in the English directory of the data namely blogs,news and twitter data which we will utilise for this project.
#file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#destination <- "./download/Coursera-Swiftkey.zip"
#download.file(file,destination)
#unzip("./download/Coursera-SwiftKey.zip")
blogs <- readLines("./en_US.blogs.txt",encoding='UTF-8',skipNul=TRUE)
news <- readLines("./en_US.news.txt",encoding='UTF-8',skipNul=TRUE)
twitter <- readLines("./en_US.twitter.txt",encoding='UTF-8',skipNul=TRUE)
We now find basic file information such as the file size, number of lines in the file, number of words and number of maximum/minimum characters per line in each of the 3 files.
blogsFileSize <- file.info("./en_US.blogs.txt")$size/(1024*1024)
newsFileSize <- file.info("./en_US.news.txt")$size/(1024*1024)
twitterFileSize <- file.info("./en_US.twitter.txt")$size/(1024*1024)
blogsNumWords <- sum(stri_count_words(blogs))
newsNumWords <- sum(stri_count_words(news))
twitterNumWords <- sum(stri_count_words(twitter))
blogsMaxCharsLine <- max(nchar(blogs))
newsMaxCharsLine <- max(nchar(news))
twitterMaxCharsLine <- max(nchar(twitter))
blogsMinCharsLine <- min(nchar(blogs))
newsMinCharsLine <- min(nchar(news))
twitterMinCharsLine <- min(nchar(twitter))
summary <- data.frame(filename = c("blogs","news","twitter"),
filesizeMB = c(blogsFileSize, newsFileSize, twitterFileSize),
numLines = c(length(blogs),length(news),length(twitter)),
numWords = c(blogsNumWords,newsNumWords,twitterNumWords),
maxCharsLine = c(blogsMaxCharsLine,newsMaxCharsLine,twitterMaxCharsLine),
minCharsLine = c(blogsMinCharsLine,newsMinCharsLine,twitterMinCharsLine))
print(kable(summary))
##
##
## filename filesizeMB numLines numWords maxCharsLine minCharsLine
## --------- ----------- --------- --------- ------------- -------------
## blogs 200.4242 899288 37546246 40833 1
## news 196.2775 1010242 34762395 11384 1
## twitter 159.3641 2360148 30093410 140 2
As the files are large, we sample 3% of the lines from each file and combine into our test data set. This is to enable faster processing.
set.seed(80) #enable reproducibility
Tblogs <- blogs[sample(1:length(blogs),0.03*length(blogs))]
Tnews <- news[sample(1:length(news),0.03*length(news))]
Ttwitter <- twitter[sample(1:length(twitter),0.03*length(twitter))]
Testfile <- c(Tblogs,Tnews,Ttwitter)
writeLines(Testfile,"./data/TestFile.txt")
The results in a test file of 128089 lines with 3059625 words with the longest line length being 3985 characters long and the shortest line length being 2 characters long.
We now clean the testfile by:
This is accomplished by using the tm library
Cleanfile <- Corpus(DirSource("./data"))
Cleanfile <- tm_map(Cleanfile,content_transformer(tolower))
Cleanfile <- tm_map(Cleanfile,removePunctuation)
Cleanfile <- tm_map(Cleanfile,removeNumbers)
Cleanfile <- tm_map(Cleanfile,stripWhitespace)
#Read in list of profanity words that we want to remove
profanity <- readLines("./download/profanity.txt",encoding='UTF-8',skipNul=TRUE)
Cleanfile <- tm_map(Cleanfile,removeWords, profanity)
Cleanfile <- tm_map(Cleanfile,removeWords, stopwords("english"))
We now use the RWeka library to tokenise the test file into 1,2 and 3 word clusters called N-Grams. By counting the frequency of those word combinations, this can then become a basis of our prediction algorithim.
First we tokenise single words (Unigrams) and visualise the words with the highest frequency in the testfile.
Unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
UniDoc <- DocumentTermMatrix(Cleanfile,control=list(tokenize = Unigram))
UniDoc.matrix <- as.matrix(UniDoc)
frequency <- colSums(UniDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
UniGramFrequency <- data.frame(word=names(frequency),freq=frequency)
colspectrum <- brewer.pal(6, "Dark2")
wordcloud(names(frequency), frequency, max.words=50, rot.per=0.1, colors=colspectrum)
Here we tokenise word pairs (Bigrams) and create a chart of the highest frequency Bigrams.
Bigram <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
BiDoc <- DocumentTermMatrix(Cleanfile,control=list(tokenize = Bigram))
BiDoc.matrix <- as.matrix(BiDoc)
frequency <- colSums(BiDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
BiGramFrequency <- data.frame(word=names(frequency),freq=frequency)
BiGramFrequency %>%
#filter(freq > 750) %>%
ggplot(aes(word,freq)) +
geom_bar(stat="identity",colour="red",fill="blue") +
ggtitle("Bigrams with the highest frequencies") +
xlab("Bigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
Here we tokenise word triplets (Trigrams) and create a chart of the highest frequency Trigrams.
Trigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
TriDoc <- DocumentTermMatrix(Cleanfile,control=list(tokenize = Trigram))
TriDoc.matrix <- as.matrix(TriDoc)
frequency <- colSums(TriDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
TriGramFrequency <- data.frame(word=names(frequency),freq=frequency)
TriGramFrequency %>%
#filter(freq) %>%
ggplot(aes(word,freq)) +
geom_bar(stat="identity",colour="red",fill="blue") +
ggtitle("Trigrams with the highest frequencies") +
xlab("Trigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
Here we tokenise word quadruplets (Quadrigrams) and create a chart of the highest frequency Quadrigrams.
Quadrigram <- function(x) NGramTokenizer(x,Weka_control(min=4,max=4))
QuadriDoc <- DocumentTermMatrix(Cleanfile,control=list(tokenize = Quadrigram))
QuadriDoc.matrix <- as.matrix(QuadriDoc)
frequency <- colSums(QuadriDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
QuadriGramFrequency <- data.frame(word=names(frequency),freq=frequency)
QuadriGramFrequency %>%
#filter(freq) %>%
ggplot(aes(word,freq)) +
geom_bar(stat="identity",colour="red",fill="blue") +
ggtitle("Quadrigrams with the highest frequencies") +
xlab("Quadrigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
It is clear that some more cleaning needs to be performed to account for multiple repeats of the same word for example “omg omg omg omg” above.
The goal of the project is to build a shiny app that accepts text entry and predicts the likely next word to be entered by the user. The shiny app will give a list of several predicted words that the user can choose. This can be accomplished by building a prediction model that uses the frequency of the N grams (1,2,3 and even 4 N grams) as a predictor for the likely word.