Objective

Many people spend a considerable amount of time typing on their mobile devices. The over-arching objective of this exercise is to make it easier for people to type on their mobile devices by having a smart keyboard that is able to predict the next word that a person would type in. Therefore, the immediate objective of this project is to take a body of text from both formal and informal sources, clean the data and build a predictive text algorithm to predict the next word based on the previous 1, 2 or 3 words.

Download Data

The data comes in a zip file. So we download and unzip all the files.

library(downloader)
## Warning: package 'downloader' was built under R version 3.2.3
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download(url, dest="dataset.zip", mode="wb") 
dir.create("dataset")
## Warning in dir.create("dataset"): 'dataset' already exists
unzip("dataset.zip", exdir = "dataset")

The files comes in 4 folders, one for each of the 4 languages : German, English, Finnish, Russian.
For each language, there are 3 files of text lines that come from blogs, news and twitter.

Basic Summary of Data

We will focus only on the 3 English files. Use readLines to load the text lines into R and do some preliminary summary of the 3 files.

blogs <-readLines("./dataset/final/en_US/en_US.blogs.txt", skipNul=TRUE, encoding="UTF-8")
news <-readLines("./dataset/final/en_US/en_US.news.txt", n=77258, skipNul=TRUE, encoding="UTF-8")
twitter <-readLines("./dataset/final/en_US/en_US.twitter.txt", skipNul=TRUE, encoding="UTF-8")

length(blogs); length(news); length(twitter)
max(nchar(blogs)); max(nchar(news)); max(nchar(twitter))

Observations :

  1. There are 899288 lines in “blog.txt”, 77259 lines in “news.txt” and 2360148 lines in “twitter.txt”.
  2. The longest line in “blog.txt”, “news.txt” and “twitter.txt” has 40833, 5760, 140 characters respectively.
  3. Not a surprise that twitter has a max of 140 characters. This is compensated by higher volume in number of lines.

Sampling of Data

As the text files are fairly large, we will do random sampling by taking 3% of text lines from each file.
This is split into 60:20:20 train set, development test set and test sets.

Summary of Train, DevTest and Test Sets

The sample size for each set and by each type of data are shown below.

##         train devtest  test sample
## blogs   16186    5396  5396  26978
## news     1390     464   463   2317
## twitter 42482   14161 14161  70804

Creating Corpus and Cleaning Data

We load the text mining packages in R (tm) and create a corpus using the sample data.

Data cleaning includes :
1. changing to lower case
2. removing punctuations
3. removing numbers
4. removing symbols (e.g. apostrophes)
5. removing white spaces.

As the purpose is not to do text classification, but to predict the next word,
1. we did not remove stopwords
2. we do not do stemming.

Make 1/2/3/4-gram Document Term Matrix

Next, we create the N-gram tokens using document to term matrix in Rweka package in R. The purpose is :
-to set the stage for the features to be used in our predictive text algorithm;
-to facilitate exploration of word frequencies later in this report.

library(RWeka)

tokenizer1g <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
dtm1g <- DocumentTermMatrix(corpus_data, control = list(tokenize = tokenizer1g))

tokenizer2g <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2g <- DocumentTermMatrix(corpus_data, control = list(tokenize = tokenizer2g))

tokenizer3g <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3g <- DocumentTermMatrix(corpus_data, control = list(tokenize = tokenizer3g))

tokenizer4g <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
dtm4g <- DocumentTermMatrix(corpus_data, control = list(tokenize = tokenizer4g))

Summary of Data Table

##         docLength totalWordTokens uniqueWords unique_div_total
## blogs       16186          511702       40324       0.07880368
## news         1390           44137        9658       0.21881868
## twitter     42482          406986       36801       0.09042326
## all 3       60058          962825       60392       0.06272376

Observation

  1. Only 6% of total words account for the unique vocabularies used.
  2. Text strings from “news” doc use considerably more unique vocabularies.

Tables for Word Frequencies

We explore the frequencies using tables sorted by decreasing order for 1/2/3/4-gram tokens.

##   n-Gram num_nGram_freq>5 total_num_nGram  proportion
## 1 1-gram            11519           60392 0.190737184
## 2 2-gram            23454          477155 0.049153839
## 3 3-gram             9381          865249 0.010841966
## 4 4-gram             1336          967640 0.001380679

Observation

  1. 20% of the one-gram tokens has frequencies > 5.
  2. % drops drastically to 4% as we move to two-gram tokens.
  3. % is a mere 0.1% for four-gram tokens.
  4. This means that for cases with frequencies <= 5, we will use Good-Turing method to Smooth the probabilities and to cater for never-seen-before n-grams.

Plot Word Frequencies

We do some exploratory data analysis plots on the frequencies of 1/2/3/4-gram tokens.

## Warning: package 'ggplot2' was built under R version 3.2.4

####Observations
1. Top 20 one-gram tokens - Frequencies are mostly in the 4000-10000 range, with 5 exceptions in the 10000-50000 range.
2. Top 20 two-gram tokens - Frequencies are mostly in the 1000-1600 range, with 6 exceptions in the 2000-4500 range.
3. Top 20 three-gram tokens - Frequencies are mostly in the 150-250 range, with 5 exceptions in the 350-400 range.
4. Top 20 four-gram tokens - Frequencies are mostly in the 40-100 range.

Planning Ahead

  1. Using the train set, a look-up table will be built to consolidate all the 1/2/3/4-gram tokens and frequencies.
  2. Based on the look-up table, a prediction model will be written to predict the next word. This will be tested using the devtest set.
  3. Further enhancements will then be added and tested on the final test set.
  4. A data product and presentation slides will be final product of this project.

End of Milestone Report