Data Science Capstone Interim Report

Introduction

It is common to find predictive keyboards, such as SwiftKey, on most modern mobile devices nowadays. The goal of the project is to develop an algorithm for the purpose of next-word prediction. The algorithm should find a good balance between three aspects: memory efficiency, speed and accuracy.

In order to train the algorithm, a corpora consisting of text taken from blogs, news feeds and Twitter was used. While the corpora came in 4 different languages, this report will only examine the features of the English (US) data.

Data Pre-processing outside of R

Attempting to read the news text into R is hampered by the presence of unrecognized symbols such as the left arrow (which you may type using Alt + 26). Instances of the symbol were manually removed outside of R in order to allow the full dataset to be utilized.

Reading the Data

There were warning messages when reading in the twitter file. However, these are not major errors which prevent subsequent lines from being read into R and are ignored. The warning messages are not displayed for tidiness.

blogs <- readLines(paste0(directory, "en_US.blogs.txt"))
news <- readLines(paste0(directory, "en_US.news.txt"))
twitter <- readLines(paste0(directory, "en_US.twitter.txt"))

Data Overview

Each dataset’s size on memory and the number of lines it contains is displayed below.

require(data.table)
datasets <- c("blogs", "news", "twitter")
objSize <- sapply(datasets, function(x) {format(object.size(get(x)), units = "Mb")})
lines <- sapply(datasets, function(x) {length(get(x))})
overview <- data.table("Dataset" = datasets, "Object Size" = objSize, "Lines" = lines)
overview

##    Dataset Object Size   Lines
## 1:   blogs    248.5 Mb  899288
## 2:    news    249.6 Mb 1010242
## 3: twitter    301.4 Mb 2360148

Data Cleaning

The following steps were taken to clean each dataset:
1. Removal characters that are not white spaces, alphabets, digits and punctuations
2. Removal of a list of profanities found here.
3. Removal numbers, punctuations and websites
4. Words converted to lowercase
5. Words are separated by spaces and their frequencies are counted

The data cleaning codes are not displayed here for brevity. Instead, the cleaned datasets are loaded and combined together for exploratory data analysis.

blogs <- readRDS(paste0(directory, "Blogs Unigrams.RData"))
news <- readRDS(paste0(directory, "News Unigrams.RData"))
twitter <- readRDS(paste0(directory, "Twitter Unigrams.RData"))
data <- rbindlist(list(blogs, news, twitter))
data <- data[, list("Count" = sum(Count)), by = Word]

Exploratory Data Analysis

A brief summary of the cleaned dataset is shown below.

summary(data)

##      Word               Count        
##  Length:577831      Min.   :      1  
##  Class :character   1st Qu.:      1  
##  Mode  :character   Median :      1  
##                     Mean   :    174  
##                     3rd Qu.:      4  
##                     Max.   :4759890

Having cleaned the data, word clouds are used to visualize the most frequent terms in the entire dataset. Histograms are impractical to use here as there is a total of 577831 unique words in the cleaned dataset.

require(wordcloud)
palette <- brewer.pal(9, "Set1")
wordcloud(data[["Word"]], data[["Count"]], scale=c(10, .5), min.freq = 2, max.words = 1000,
          random.order = F, random.color = F, colors = palette)

The word cloud is dominated by stop words. A list of stop words are removed and the word cloud is generated again.

stopWords <- readLines(paste0(directory, "Stop Word List.txt"))
require(data.table)
setkey(data, "Word")
dataWithoutStopWords <- data[!.(stopWords)]
setkey(data, "Count")
wordcloud(dataWithoutStopWords[["Word"]], 
          dataWithoutStopWords[["Count"]], scale=c(4, .5), min.freq = 2, max.words = 500,
          random.order = F, random.color = F, colors = palette)

N-grams and Future Development

The word clouds shown are really just the word clouds of uni-grams, with the word frequencies corresponding to their probabilities. However, using uni-grams alone will generate a next-word prediction model that is highly ineffective: it will always predict the most frequent word.

The intention is to incorporate high order n-grams (bi-grams, tri-grams) in future development to achieve better next-word predictions. However, doing so will result in many n-grams being unobserved in the training data. This will require more advanced statistical methods, such as modified Kneser-Ney smoothing, to deal with this issue.

Conclusion

The corpora was cleaned and read into R to generate a word cloud for uni-grams. However, relying on uni-grams alone for next-word prediction is highly ineffective. For future development, high order n-grams and more advanced statistical methods will be incorporated to improve the prediction algorithm. It remains to be seen what the optimal order of n-gram should be and whether the training data should be further augmented with additional corpora.