Executive Summary

This report is to give a first insight into the progress of the project to create a word prediction algorithm from the three text files provided by SwiftKey.

Data acquisition and cleaning (First look at the data)

After downloading the files, I read in the english files to RStudio to perform an initial analysis of the raw data. The questions I want to answer in this first look at the data are:

  1. How big are the files?
  2. How many lines of text does each file contain?
  3. How many words are in each text file?

By answering this questions we can gain a better perspective on what kind of recources we have available to work out a successful prediction model

library(tm)
Loading required package: NLP
library(RWeka)
library(ggplot2)

Attaching package: 'ggplot2'

The following object is masked from 'package:NLP':

    annotate
library(qdap)
Loading required package: qdapDictionaries
Loading required package: qdapRegex
Loading required package: qdapTools
Loading required package: RColorBrewer

Attaching package: 'qdap'

The following objects are masked from 'package:tm':

    as.DocumentTermMatrix, as.TermDocumentMatrix

The following object is masked from 'package:base':

    Filter
library(stringi)
# create a destination file for the download
swiftkey_zip <- "Coursera-SwiftKey.zip"

# download the compressed files from online source
source <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(source, swiftkey_zip)

# extract the files from the zip file
unzip(swiftkey_zip)

We will be working with the english text files first.

con1 <- file("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.blogs.txt", "r")
en_US_blogs <- readLines(con1, encoding="UTF-8")
close(con1)
con2 <- file("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.news.txt", "r")
en_US_news <- readLines(con1, encoding="UTF-8")
close(con2)
con3 <- file("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.twitter.txt", "r")
en_US_twitter <- readLines(con3, encoding="UTF-8")
close(con3)

Now we can process the answers to the first three questions.

num_words_blogs <- stri_count_words(en_US_blogs)
num_words_news <- stri_count_words(en_US_news)
num_words_twitter <- stri_count_words(en_US_twitter)
file_size_blogs <- file.info("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.blogs.txt")$size/1024^2
file_size_news <- file.info("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.news.txt")$size/1024^2
file_size_twitter <- file.info("~/Documents/Coursera/Data Science/Data Science Capstone/final/en_US/en_US.twitter.txt")$size/1024^2
summary_table <- data.frame(filename = c("blogs","news","twitter"),
                            file_size_MB = c(file_size_blogs, file_size_news, file_size_twitter),
                            num_lines = c(length(en_US_blogs),length(en_US_news),length(en_US_twitter)),
                            num_words = c(sum(num_words_blogs),sum(num_words_news),sum(num_words_twitter)))
summary_table
  filename file_size_MB num_lines num_words
1    blogs     200.4242    899288  37541795
2     news     196.2775   1010242  34762303
3  twitter     159.3641   2360148  30092866

As part of cleaning the data, I will create a training sample of smaller size for each file, which makes application of different tools for cleaning and analysing the data much easier and faster. From Statistical Inference we know that we only need a very small amount of data to be able to make predictions about the full data set.

set.seed(1234)
blogs_train <- sample(en_US_blogs, length(en_US_blogs)*0.01)
news_train <- sample(en_US_news, length(en_US_news)*0.01)
twitter_train <- sample(en_US_twitter, length(en_US_twitter)*0.01)

I combine these samples to one single training set for further processing.

train <- paste(blogs_train, news_train, twitter_train)

The next step is the so-called ‘Tokenization’, which means to transform the text to identify characterisitcs (tokens) like words, punctuation and numbers. For this purpose I first subdivide the text into single sentences and then create a function which removes numbers, whitespaces and punctuation, and transforms the whole text to lower case characters.

train <- sent_detect(train, language = "en", model = NULL) # subdivide into sentences

# Helper functions for tokenization and cleaning
repl_patt <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
    corpus <- tm_map(corpus, repl_patt, "/|@|\\|")
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, removePunctuation)    
    return(corpus)
}

# resulting text corpus
train.corpus <- VCorpus(VectorSource(train))
train.corpus <- preprocessCorpus(train.corpus)

I also want to remove profane words. I use a modified list of profane (“bad”) words for this filtering.

bad_words <- VectorSource(readLines("bad_words.txt"))
Warning in readLines("bad_words.txt"): incomplete final line found on
'bad_words.txt'
train.corpus <- tm_map(train.corpus, removeWords, bad_words)

Exploratory Analysis

To be able to make assertions about the quantitative properties of the data, i have to convert the now clean text data into a Matrix (data frame), that allows to count words and word combinations (n-grams). For this project we will look at Unigrams (single words), Bigrams (combinations of two words) and Trigrams (combinations of three words) and their frequencies.

# create a data frame from the clean data
training <- data.frame(text=unlist(sapply(train.corpus, `[`, "content")), stringsAsFactors=F)

# create n-grams
unigrams <- NGramTokenizer(training, Weka_control(min = 1, max = 1))
bigrams <- NGramTokenizer(training, Weka_control(min = 2, max = 2))
trigrams <- NGramTokenizer(training, Weka_control(min = 3, max = 3))

# create data frames from n-grams
uni <- data.frame(table(unigrams))
bi <- data.frame(table(bigrams))
tri <- data.frame(table(trigrams))

# sorting the resulting data frames
uni_sorted <- uni[order(uni$Freq,decreasing = TRUE),]
bi_sorted <- bi[order(bi$Freq,decreasing = TRUE),]
tri_sorted <- tri[order(tri$Freq,decreasing = TRUE),]

Now we can take a look at the most frequent words and word combinations. The following plots will show the top 20 occurences for each n-gram.

# First plot for Unigrams
uni_top20 <- uni_sorted[1:20,]
colnames(uni_top20) <- c("Word","Frequency")

ggplot(uni_top20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="darkblue") + coord_flip() + theme(axis.title.y = element_blank()) + geom_text(aes(label=Frequency, title = "Top 20 Words"), vjust=-0.2)

# Second plot for Bigrams
bi_top20 <- bi_sorted[1:20,]
colnames(bi_top20) <- c("Word","Frequency")

ggplot(bi_top20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + coord_flip() + theme(axis.title.y = element_blank()) + geom_text(aes(label=Frequency, title = "Top 20 Bigrams"), vjust=-0.2)

# Third plot for Trigrams
tri_top20 <- tri_sorted[1:20,]
colnames(tri_top20) <- c("Word","Frequency")

ggplot(tri_top20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="lightblue") + coord_flip() + theme(axis.title.y = element_blank()) + geom_text(aes(label=Frequency, title = "Top 20 Trigrams"), vjust=-0.2)

Outlook

For the prediction model and the subsequent application we can now take the n-grams we created and start testing different prediction algorithms based on probabilities for the next word. The most promising data will come from the Trigrams or a combination of n-grams, but it is also planned to implement methods for ‘smoothing’, like the Kneser-Ney smoothing, which can make the predictions more precise.