Introduction

This report is Milestone report of Swiftkey next word prediction task based on NLP. The Swhiftkey NLP task is create a next word prediction algorithms that would help mobile user to type the messages faster recommending the next possible word based on the text user has types in.

About the report

In this Milestone report, the basic preparation of raw data, preliminary statistics/visualization analysis, plans for algorithm and applications are introduced in order to provide readers with overall concepts about the project.

Packages used in the report

We use following packages in the text mining report

library(knitr)
options(java.parameters = "-Xmx4g" )
library(RWeka)
library(tm)
## Loading required package: NLP
library(stringr)

Exploratory Analysis

Data capture

As the first step in the analysis is to gather the data. We download the data and unzip. Following code was used to capture the data.

# url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# download.file(url = url,method = "curl",destfile = "Coursera-SwiftKey.zip")
# unzip(zipfile = "Coursera-SwiftKey.zip")

About the data

The Swiftkey task data consist of tweets, blog posts and news in 4 different languages - English, German, Russian and Finnish. There are 12 different large file. For the Milestone report we look only the English language files

English language test files, size in MB, number of lines and number of words

kable(read.csv("file_size.csv",sep = ";"))
File Size Lines WordCount
en_US/en_US.blogs.txt 200MB 899288 37334690
en_US/en_US.news.txt 196MB 1010242 34372720
en_US/en_US.twitter.txt 159MB 2360148 30374206

As the next step - we load the data. As the data is large we sample only for first 3000 rows in each data file and merge them.

n=3000
eng_tw = list(readLines(file("final/en_US/en_US.twitter.txt", "r"), n, encoding = "UTF-8", warn = FALSE))
eng_news = list(readLines(file("final/en_US/en_US.news.txt", "r"), n, encoding = "UTF-8", warn = FALSE))
eng_blogs = list(readLines(file("final/en_US/en_US.blogs.txt", "r"), n, encoding = "UTF-8", warn = FALSE))

eng_txt = paste(eng_tw,eng_news,eng_blogs)
save("eng_txt",file = "eng_txt.txt")

Creating corpus

In order to create a corpus we the text documents should be prepared. For this we execute several standard procedures such as:

As a result we get a corpus of texts that will be the basis of application.

corpus <- VCorpus(VectorSource(eng_txt))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers) 
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
writeCorpus(corpus,filenames = "corpus.txt")

Creating the Term-document Matrix

dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 1, terms: 26378)>>
## Non-/sparse entries: 26378/0
## Sparsity           : 0%
## Maximal term length: 95
## Weighting          : term frequency (tf)

N-grams - Unigrams, Bigrams and Trigrams

Creating n-grams for the Data tokens for next word prediction algorithms. Ngrams are sequential words that exist in the text.

First we create Unigram tokens and the top of frequency table

token1 <- NGramTokenizer(corpus, Weka_control(min=1, max=1))
token1 <- data.frame(table(token1))
token1 <- token1[order(token1$Freq,decreasing=TRUE),]
colnames(token1) <- c("Word", "Freq")
token1[1:15,]
##       Word  Freq
## 23715  the 13094
## 24078   to  7208
## 875    and  6853
## 18       a  6316
## 16377   of  5591
## 11687   in  4379
## 11456    i  3958
## 23699 that  2886
## 9042   for  2784
## 12257   is  2724
## 12298   it  2383
## 16489   on  2059
## 26222 with  1993
## 26601  you  1986
## 25746  was  1727

Histogram of Unigrams

par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(token1[1:40,]$Freq,
        col = "blue",
        main = "Most frequent Unigram words",
        ylab = "Word frequencies",
        names.arg = token1$Word[1:40])

Secondly Bigram tokens and the top frequency table

token2 <- NGramTokenizer(corpus, Weka_control(min=2, max=2))
token2 <- data.frame(table(token2))
token2 <- token2[order(token2$Freq,decreasing=TRUE),]
colnames(token2) <- c("Word", "Freq")
token2[1:15,]
##            Word Freq
## 91486    of the 1195
## 65402    in the 1133
## 137945   to the  582
## 49170   for the  503
## 93198    on the  491
## 136356    to be  453
## 10605   and the  386
## 14822    at the  344
## 64329      in a  334
## 151015 with the  309
## 69217    it was  286
## 50932  from the  267
## 89610      of a  247
## 9200      and i  237
## 67577      is a  233

Histogram of most frequent Bigrams

par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(token2[1:40,]$Freq,
        col = "blue",
        main = "Most frequent 2-gram words",
        ylab = "Word frequencies",
        names.arg = token2$Word[1:40])

Thirdly 3-gram tokens and the top frequency table

token3 <- NGramTokenizer(corpus, Weka_control(min=3, max=3))
token3 <- data.frame(table(token3))
token3 <- token3[order(token3$Freq,decreasing=TRUE),]
colnames(token3) <- c("Word", "Freq")
token3[1:15,]
##               Word Freq
## 142331  one of the   98
## 2663      a lot of   96
## 76921  going to be   44
## 105297    it was a   41
## 93099    i want to   40
## 189573  the end of   40
## 148027 part of the   38
## 91851     i have a   37
## 23525   as well as   34
## 91445  i dont know   34
## 205606     to be a   34
## 27999   be able to   32
## 145951  out of the   31
## 935    a couple of   30
## 91930    i have to   30

Histogram of 3-gram word frequencies

par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(token3[1:40,]$Freq,
        col = "blue",
        main = "Most frequent 3-gram words",
        ylab = "Word frequencies",
        names.arg = token3$Word[1:40])

Plan for Word Prediction

As the plan for next steps we consider the following improvements for the current setup - Review of corpus document term matrix and cleanup for sparse and profanity words - Random selection and larger number of documents for corpus to represent better of the texts

As a plan for shiny application we consider to improve the document term matrix that would be small enough to be able to be used in shiny application in light of performance. Also we will seek for the appropriate prediction model for next word prediction algorithm.