Milestone Report | Swiftkey Nextword Prediction

Downloading and Loading Data

#download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip","Coursera-SwiftKey.zip")
#unzip("Coursera-SwiftKey.zip")

blog <- "final/en_US/en_US.blogs.txt"
news <- "final/en_US/en_US.news.txt"
twitter <- "final/en_US/en_US.twitter.txt"

Summary statistics

Here, we will be computing the following summary statistics for each file:

size (in bytes)
number of lines
length of longest entry

File size of each document :

library(R.utils)

## Loading required package: R.oo

## Loading required package: R.methodsS3

## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.

## R.oo v1.22.0 (2018-04-21) successfully loaded. See ?R.oo for help.

## 
## Attaching package: 'R.oo'

## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods

## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save

## R.utils v2.7.0 successfully loaded. See ?R.utils for help.

## 
## Attaching package: 'R.utils'

## The following object is masked from 'package:utils':
## 
##     timestamp

## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

file.info(blog)$size

## [1] 210160014

file.info(news)$size

## [1] 205811889

file.info(twitter)$size

## [1] 167105338

Number of Lines in each document :

lblog <- countLines(blog)
lblog

## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE

lnews <- countLines(news)
lnews

## [1] 1010242
## attr(,"lastLineHasNewline")
## [1] TRUE

ltwitter <- countLines(twitter)
ltwitter

## [1] 2360148
## attr(,"lastLineHasNewline")
## [1] TRUE

Length of Longest line in each document :

library(stringi)

max(stri_length(readLines(blog)))

## [1] 40835

max(stri_length(readLines(news)))

## [1] 5760

max(stri_length(readLines(twitter)))

## [1] 213

Sampling

From summary statistics we can see that the files are of huge size,so we will take a sample from the huge data.From the samples from each file, we will be combing the samples then clean and finally create a corpus file

library(openNLP)
library(tm)

## Loading required package: NLP

library(qdap)

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

## 
## Attaching package: 'qdapRegex'

## The following object is masked from 'package:R.utils':
## 
##     validate

## Loading required package: qdapTools

## Loading required package: RColorBrewer

## 
## Attaching package: 'qdap'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

## The following object is masked from 'package:NLP':
## 
##     ngrams

## The following object is masked from 'package:base':
## 
##     Filter

library(RWeka)

DataBlog <-readLines(blog, (lblog/1000),encoding="latin1")
writeLines(DataBlog, con="samp_blog.txt", "\n")

DataNews <-readLines(news, (lnews/1000),encoding="latin1")
writeLines(DataNews, con="samp_news.txt", "\n")

DataTwitter <-readLines(twitter, (ltwitter/1000),encoding="latin1")
writeLines(DataTwitter, con="samp_twitter.txt", "\n")

combi <- paste(DataBlog,DataNews,DataTwitter)  
combi <- sent_detect(combi, language = "en", model = NULL)

Cleaning

So now we have combined the samples, now we need to clean the data. Which include:

Remove URLs
Remove signs
Remove numbers
Remove apostraphes
Convert every character to lower case
Remove spaces
Remove NonASCII characters and
Remove some bad words ( I downloaded a list of profane words, which can be available on internet)

removeURL<-function(x) gsub("http[[:alnum:]]*","",x)
removeSign<-function(x) gsub("[[:punct:]]","",x)
removeNum<-function(x) gsub("[[:digit:]]","",x)
removeapo<-function(x) gsub("'","",x)
removeNonASCII<-function(x) iconv(x, "latin1", "ASCII", sub="")
toLowerCase <- function(x) sapply(x,tolower)
removeSpace<-function(x) gsub("\\s+"," ",x)

samp_data <- VCorpus(VectorSource(combi))
samp_data<-tm_map(samp_data,content_transformer(removeapo))
samp_data<-tm_map(samp_data,content_transformer(removeNum))
samp_data<-tm_map(samp_data,content_transformer(removeURL))
samp_data<-tm_map(samp_data,content_transformer(removeSign))
samp_data<-tm_map(samp_data,content_transformer(removeNonASCII))
samp_data<-tm_map(samp_data,content_transformer(toLowerCase))
samp_data<-tm_map(samp_data,content_transformer(removeSpace))
samp_data<-tm_map(samp_data,removeWords,stopwords("english"))

profane <- VectorSource(readLines("bad_words.txt"))
samp_data <- tm_map(samp_data, removeWords, profane)

Term Document Matrix

Now we have cleaned the data, we will be converting the sample data into Term Document Matrix (tdm) form, which is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.From the tdm we will be going for N-grams

The N-gram representation of text lists all N-tuples of words that appear in a given selection of text. The simplest case is the unigram which is based on individual words, followed by bigrams (which lists all pair of words), and so on.

library(wordcloud)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:qdapRegex':
## 
##     %+%

## The following object is masked from 'package:NLP':
## 
##     annotate

tdm<- TermDocumentMatrix(samp_data)
tdm <- removeSparseTerms(tdm, 0.999)
matr = as.data.frame((as.matrix(tdm))) 
matr_s <- sort(rowSums(matr),decreasing=TRUE)
matr_d <- data.frame(word = names(matr_s),freq=matr_s)
wordcloud(matr_d$word, matr_d$freq, random.order=FALSE,  colors=brewer.pal(8, "Dark2"),min.freq=15)

unig <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
big <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
trig <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))

uni_tdm <- TermDocumentMatrix(samp_data, control=list(tokenize=unig))
bi_tdm <- TermDocumentMatrix(samp_data, control=list(tokenize=big))
tri_tdm <- TermDocumentMatrix(samp_data, control=list(tokenize=trig))

uni <- removeSparseTerms(uni_tdm, 0.999)
freq <- sort(rowSums(as.matrix(uni)),decreasing=TRUE)
uni <- data.frame(word = names(freq),freq=freq)

bi <- removeSparseTerms(bi_tdm, 0.999)
freq <- sort(rowSums(as.matrix(bi)),decreasing=TRUE)
bi <- data.frame(word = names(freq),freq=freq)

tri <- removeSparseTerms(tri_tdm, 0.999)
freq <- sort(rowSums(as.matrix(tri)),decreasing=TRUE)
tri <- data.frame(word = names(freq),freq=freq)

Unigram

ggplot(data=uni[1:20,],aes(x=reorder(word, -freq),y=freq)) + geom_bar(stat="identity")

Bigram

ggplot(data=bi[1:20,],aes(x=reorder(word, -freq),y=freq)) + geom_bar(stat="identity")

Conclusion

In the next steps, we build a Shiny App that can provide the user with multiple suggestions for the next word based on prediction algorithm.