Introduction

This Coursera Capstone project is an application of data science in the area of natural language processing. The project aims to produce Shiny application for word prediction, similar to SwiftKey mobile application.

In this milestone, the following tasks are completed:

  1. Downloading of dataset, and loading it for processing

  2. Exploratory data analysis, includes basic summary and other findings

  3. Planning for Shiny application

Data Loading & Basic Statistics

Downloading Dataset

# downloading of dataset
url <- 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
saveAsFilename <- 'coursera-swiftkey.zip'
download.file(url, saveAsFilename)

# extract downloaded zip 
unzip(saveAsFilename)

# list the downloaded dataset
datasetPath <- '.'
list.files(datasetPath, all.files=TRUE, full.names=TRUE, recursive=TRUE, include.dirs=TRUE)
##  [1] "Coursera-SwiftKey.zip"         "final"                        
##  [3] "final/de_DE"                   "final/de_DE/de_DE.blogs.txt"  
##  [5] "final/de_DE/de_DE.news.txt"    "final/de_DE/de_DE.twitter.txt"
##  [7] "final/en_US"                   "final/en_US/en_US.blogs.txt"  
##  [9] "final/en_US/en_US.news.txt"    "final/en_US/en_US.twitter.txt"
## [11] "final/fi_FI"                   "final/fi_FI/fi_FI.blogs.txt"  
## [13] "final/fi_FI/fi_FI.news.txt"    "final/fi_FI/fi_FI.twitter.txt"
## [15] "final/ru_RU"                   "final/ru_RU/ru_RU.blogs.txt"  
## [17] "final/ru_RU/ru_RU.news.txt"    "final/ru_RU/ru_RU.twitter.txt"

Basic Statistics

Our focus is on English dataset:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt
# get the list of files
datasetPath <- paste0(datasetPath,'/final/en_US/')
files <- list.files(datasetPath)
files
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
# get basic statistics
basic_stats <- data.frame('filename'=character(0), 'filesize'=integer(0), 'linecount'=integer(0), 'wordcount'=integer(0))

for(f in files) {
  # initialize commands to get basic statistics: file size, line count, words count
  cmd_filesize  <- paste('ls -alb ', datasetPath, f, " | awk '{ print $5 }'", sep='')
  cmd_linecount <- paste('wc -l '  , datasetPath, f, " | awk '{ print $1 }'", sep='')
  cmd_wordcount <- paste('wc -c '  , datasetPath, f, " | awk '{ print $1 }'", sep='')
  
  # get the statistics
  filesize  <- system(command=cmd_filesize , intern=TRUE)
  linecount <- system(command=cmd_linecount, intern=TRUE)
  wordcount <- system(command=cmd_wordcount, intern=TRUE)
  
  # store the statistics
  basic_stats <- rbind(basic_stats, data.frame('filename'=as.character(f),
                                               'filesize'=as.integer (filesize),
                                               'linecount'=as.integer(linecount),
                                               'wordcount'=as.integer(wordcount)))
}

basic_stats
##            filename  filesize linecount wordcount
## 1   en_US.blogs.txt 210160014    899288 210160014
## 2    en_US.news.txt 205811889   1010242 205811889
## 3 en_US.twitter.txt 167105338   2360148 167105338

Sampling ‘EN’ Dataset

Since the files in dataset are large, the lines in the files are sampled for exploratory purpose. Immediate lines may contain very similar words as they may be continuation, and may not reflect fair distribution of words. Hence, sampling of lines is done randomly.

en_txt <- ''
for(i in 1:nrow(basic_stats)) {
  # sample data - 15000 lines each
  set.seed(basic_stats[i,'filesize'])
  sample_n_rows <- as.integer(runif(15000, 1, basic_stats[i,'linecount']))
  #sample_n_rows <- 1:5000
  
  # open connector to file
  con <- file(description=paste0(datasetPath, basic_stats[i,'filename']), open='r')
  
  # run though the file line by line
  for(j in 1:basic_stats[i,'linecount']) {
    tmp <- readLines(con, n=1)
    
    # if the current line is in sampled lines, keep it as sampled data
    if(j %in% sample_n_rows) en_txt <- paste(en_txt, tmp)
  }
  
  # close connector
  close(con)
}

Exploratory Analysis

In this exploratory analysis, the objective is to come out with preliminary model for prediction of next word. The approach used here is by using n-gram model.

First, we convert the the rawEN text into ‘corpus’ of text, i.e. structured text for analysis. We do this using tm library.

Second, cleaning the corpus by making adjustment to the text: removing whitespaces, removing punctuation, removing numbers, converting to lower case and removing bad-words (words that we would not like to predict).

Third, we perform stemming, which truncates words e.g. ‘calculate’, ‘calculates’ and ‘calculating’ to be ‘calculat’. This steps is important to reduce number of similar words.

After the first 3 steps above, we have a clean corpus to build initial model.

Corpus Definition and Data Cleaning

library(tm)

# define corpus from the sampled dataset
en_corpus <- VCorpus(VectorSource(en_txt))

# cleaning the data:
# 1. remove extra whitespace
en_corpus <- tm_map(en_corpus, stripWhitespace)
# 2. remove punctuation
en_corpus <- tm_map(en_corpus, removePunctuation)
# 3. remove numbers
en_corpus <- tm_map(en_corpus, removeNumbers)
# 4. convert to lower case
en_corpus <- tm_map(en_corpus, content_transformer(tolower))
# 5. remove bad words
en_corpus <- tm_map(en_corpus, removeWords, stopwords('english'))

# stem documents
en_corpus <- tm_map(en_corpus, stemDocument)

n-Gram

The fourth step is to build n-gram model using RWeka library. N-gram model is generated by tokenizing corpus, that is generating unique group of n words from the corpus. For this exploratory, we will generate 1,2,3,4-gram and test all four for the preciseness in prediction.

Fifth, we generate Document Term Matrix (DTM), that is a matrix that lists each n-gram and its frequency for each document (in this case, we only have 1 document generated from sampling all 3 source file).

Sixth, we limit the terms that occurs very infrequently (i.e. sparse term). After this step, the DTM is cleaned and ready to be used as source for prediction.

library(RWeka)

# function to generate n-gram token
nGramToken <- function(c, n) NGramTokenizer(c, Weka_control(min=n, max=n))

# function to generate Document Term Matrix (DTM)
dtm <- function(c, n) DocumentTermMatrix(c, 
                        control=list(tokenize=function(x) nGramToken(x,n)))

# generate DTM for 1,2,3-gram tokens
unigram.dtm   <- removeSparseTerms(dtm(en_corpus, 1) , 0.6)
bigram.dtm    <- removeSparseTerms(dtm(en_corpus, 2) , 0.6)
trigram.dtm   <- removeSparseTerms(dtm(en_corpus, 3) , 0.6)
quadigram.dtm <- removeSparseTerms(dtm(en_corpus, 4) , 0.6)


# function to transform DTM to term-count dataframe
getTermCount <- function(dtm) {
  # count term matrix
  dtm.count.matrix <- colSums(as.matrix(dtm))
  
  # create frequency dataframe of term-count
  term.count <- data.frame(term=names(dtm.count.matrix), count=dtm.count.matrix)
  
  # return ordered term.count dataframe
  return( term.count[with(term.count, order(-count)), ] )
}

# convert DTM to term-count dataframe 
unigram.df  <-getTermCount(unigram.dtm);   rownames(unigram.df)  <-NULL
bigram.df   <-getTermCount(bigram.dtm) ;   rownames(bigram.df)   <-NULL
trigram.df  <-getTermCount(trigram.dtm);   rownames(trigram.df)  <-NULL
quadigram.df<-getTermCount(quadigram.dtm); rownames(quadigram.df)<-NULL

Seventh, in order for easier processing, DTM is converted to simpler data frame, containing n-gram term and count.

Visualization

Here we plot the data frame generated after seventh step above.

library(ggplot2)

# calculate average length of words/term
unigram.df$length <- nchar(as.vector(unigram.df$term))

# plot average length of words/term
ggplot(unigram.df, aes(x=length)) +
  geom_bar(stat='bin', binwidth=1) +
  ggtitle('Distribution of Word Length')

# function to plot distribution of n-gram frequency
barplotNGram <- function(ngram.df, title) {
  ggplot(ngram.df, aes(x=reorder(term, -count), y=count)) +
    geom_bar(stat='identity') +
    theme(axis.title.x = element_blank(),
          axis.text.x  = element_text(angle=45, hjust=1)) +
    ggtitle(title)  
}

# plot n-gram frequency distribution
barplotNGram(unigram.df[1:30, ], 'Most Frequent 1-gram Distribution')

barplotNGram(bigram.df [1:30, ], 'Most Frequent 2-gram Distribution')

barplotNGram(trigram.df[1:30, ], 'Most Frequent 3-gram Distribution')

barplotNGram(quadigram.df[1:30, ], 'Most Frequent 4-gram Distribution')

library(wordcloud)
## Loading required package: RColorBrewer
### Word Cloud Visualization

par(mfrow = c(2,2))
palette <- brewer.pal(5, 'Dark2') # wordcloud colur theme

wordcloud(unigram.df$term, unigram.df$count, min.freq=1, max.words=100, colors=palette)
text(x=0.5, y=0, '1-gram Word Cloud')

wordcloud(bigram.df$term, bigram.df$count, min.freq=1, max.words=100, colors=palette)
text(x=0.5, y=0, '2-gram Word Cloud')

wordcloud(trigram.df$term, trigram.df$count, min.freq=1, max.words=100, colors=palette)
text(x=0.5, y=0, '3-gram Word Cloud')

wordcloud(quadigram.df$term, quadigram.df$count, min.freq=1, max.words=100, colors=palette)
text(x=0.5, y=0, '4-gram Word Cloud')

Further Plan for Shiny App

From the basic n-gram model above, I have the following plan for final project delivery:

  1. Only English languange for this project.

  2. To use the whole dataset to improve the n-gram model.

  3. To improve algorithm for dealing with spelling error both in dataset or prediction

  4. To generate X-gram, that is modelling not just based consecutive n words, but also based on tupple in one sentence.

  5. To aim for acceptable performance based prediction timing on equivalent handphone system.

  6. To present testing performance in final report using testing dataset of equivalent size as one provided.