Many people spend a considerable amount of time typing on their mobile devices. The over-arching objective of this exercise is to make it easier for people to type on their mobile devices by having a smart keyboard that is able to predict the next word that a person would type in. Therefore, the immediate objective of this project is to take a body of text from both formal and informal sources, clean the data and build a predictive text algorithm to predict the next word based on the previous 1, 2 or 3 words.
The data comes in a zip file. So we download and unzip all the files.
library(downloader)
## Warning: package 'downloader' was built under R version 3.2.3
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download(url, dest="dataset.zip", mode="wb")
dir.create("dataset")
## Warning in dir.create("dataset"): 'dataset' already exists
unzip("dataset.zip", exdir = "dataset")
The files comes in 4 folders, one for each of the 4 languages : German, English, Finnish, Russian.
For each language, there are 3 files of text lines that come from blogs, news and twitter.
We will focus only on the 3 English files. Use readLines to load the text lines into R and do some preliminary summary of the 3 files.
blogs <-readLines("./dataset/final/en_US/en_US.blogs.txt", skipNul=TRUE, encoding="UTF-8")
news <-readLines("./dataset/final/en_US/en_US.news.txt", n=77258, skipNul=TRUE, encoding="UTF-8")
twitter <-readLines("./dataset/final/en_US/en_US.twitter.txt", skipNul=TRUE, encoding="UTF-8")
length(blogs); length(news); length(twitter)
max(nchar(blogs)); max(nchar(news)); max(nchar(twitter))
As the text files are fairly large, we will do random sampling by taking 3% of text lines from each file.
This is split into 60:20:20 train set, development test set and test sets.
The sample size for each set and by each type of data are shown below.
## train devtest test sample
## blogs 16186 5396 5396 26978
## news 1390 464 463 2317
## twitter 42482 14161 14161 70804
We load the text mining packages in R (tm) and create a corpus using the sample data.
Data cleaning includes :
1. changing to lower case
2. removing punctuations
3. removing numbers
4. removing symbols (e.g. apostrophes)
5. removing white spaces.
As the purpose is not to do text classification, but to predict the next word,
1. we did not remove stopwords
2. we do not do stemming.
Next, we create the N-gram tokens using document to term matrix in Rweka package in R. The purpose is :
-to set the stage for the features to be used in our predictive text algorithm;
-to facilitate exploration of word frequencies later in this report.
library(RWeka)
tokenizer1g <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
dtm1g <- DocumentTermMatrix(corpus_data, control = list(tokenize = tokenizer1g))
tokenizer2g <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2g <- DocumentTermMatrix(corpus_data, control = list(tokenize = tokenizer2g))
tokenizer3g <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3g <- DocumentTermMatrix(corpus_data, control = list(tokenize = tokenizer3g))
tokenizer4g <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
dtm4g <- DocumentTermMatrix(corpus_data, control = list(tokenize = tokenizer4g))
## docLength totalWordTokens uniqueWords unique_div_total
## blogs 16186 511702 40324 0.07880368
## news 1390 44137 9658 0.21881868
## twitter 42482 406986 36801 0.09042326
## all 3 60058 962825 60392 0.06272376
We explore the frequencies using tables sorted by decreasing order for 1/2/3/4-gram tokens.
## n-Gram num_nGram_freq>5 total_num_nGram proportion
## 1 1-gram 11519 60392 0.190737184
## 2 2-gram 23454 477155 0.049153839
## 3 3-gram 9381 865249 0.010841966
## 4 4-gram 1336 967640 0.001380679
We do some exploratory data analysis plots on the frequencies of 1/2/3/4-gram tokens.
## Warning: package 'ggplot2' was built under R version 3.2.4
####Observations
1. Top 20 one-gram tokens - Frequencies are mostly in the 4000-10000 range, with 5 exceptions in the 10000-50000 range.
2. Top 20 two-gram tokens - Frequencies are mostly in the 1000-1600 range, with 6 exceptions in the 2000-4500 range.
3. Top 20 three-gram tokens - Frequencies are mostly in the 150-250 range, with 5 exceptions in the 350-400 range.
4. Top 20 four-gram tokens - Frequencies are mostly in the 40-100 range.