Milestone Report for Coursera-SwiftKey Data Science Coursera Project

1. Introduction

The main objective of this project is to to create an algorithm to predict the next word based on th previous words typed by a user. We use the Natural Language Processing algorithms to work on the prediction. We utlise data sets from a corpus called HC Corpora. The corpora contains material published from 2005 and up to the date of the corpus . The data set we use for our prediction are - twitter data data pulled from twitter by webcrawler - news data data pulled from news online platforms by webcrawler - blogs data data pulled from blogs by webcrawler

On this document we show some explolatory analysis and graphic representation of the words in the texts. We will also describe the Katz backoff algorithm that we will use for prediction.The algorithm based on the frequency of k-grams.

2. Data acquisition and cleaning

We pull our data sets from here and our main data of interest is the english based data - twitter data - en_US.twitter - news data - en_US.news.txt - blogs data - en_US.blogs.txt

2.1 Downloading the data

#creating the data folder if it doesnt exist
if (!file.exists("data")) {
  dir.create("data")
}

#downloadTheData
if (!file.exists("data/Coursera-SwiftKey.zip")){
  fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(fileUrl, destfile = "data/Coursera-SwiftKey.zip")
}else {message("Data Already downloaded")}

Data Already downloaded

if (!file.exists("data/SwiftKey")){
unzip(zipfile = "data/Coursera-SwiftKey.zip",exdir = "data/SwiftKey")
}else {message("Data Already unzipped")}

Data Already unzipped

2.1 Loading the data

# import the blogs and twitter datasets in text mode
blogs <- readLines("data/SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("data/SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")
# import the news dataset in binary mode
con <- file("data/SwiftKey/final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

After loading the data we create a sample size to work with since the original data sets are huge to work with

#create a samplle datas
#1% Sampling
twitter.smpl <- sample(twitter, size = round(length(twitter)/100))
blogs.smpl <- sample(blogs, size = round(length(blogs)/100))
news.smpl <- sample(blogs, size = round(length(news)/100))

2.2 Data cleanup

This process involves removal of non-english words, URL’s, NonAlphabets, Non Numbers , Symbols or emails. We use the tm package to generate VectorSource datasets.

hashtags <- "#[0-9][a-z][A-Z]+"
special <- c("®","™", "¥", "£", "¢", "€", "#", "â€" , "ð" , "Ÿ˜","Š","í", "½","ð","$")
urls <- "(f|ht)tp(s?)://(.*)[.][a-z]+"
email <- "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+"
email2 <- "^[[:alnum:].-]+@[[:alnum:].-]+$"
date <- "[0-9]{2}/[0-9]{2}/[0-9]{4}"

2.2.1 Blog data cleanup

Here we remove the no english characters annd symbols from the blog data

blogs.smpl <- gsub('[[:cntrl:]]',"",blogs.smpl)
blogs.smpl <-  gsub(paste0(urls),"",blogs.smpl)
blogs.smpl <- gsub("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9.-]+","",blogs.smpl)
blogs.smpl <- gsub(paste0(special, collapse = '|'),"",blogs.smpl)

The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaaly we convert the whole text data to lower case.

blogs.smpl_c <- Corpus(VectorSource(blogs.smpl))
# Remove english common stopwords
blogs.smpl_c <- tm_map(blogs.smpl_c, removeWords, stopwords("english"))
# Eliminate extra white spaces
blogs.smpl_c <- tm_map(blogs.smpl_c, stripWhitespace)
#inspect(blogs.smpl_c)

#convert all values to lower case
blogs.smpl_c <- tm_map(blogs.smpl_c, content_transformer(tolower))

#inspect(blogs.smpl_c)

2.2.2 News data cleanup

Here we remove the no english characters annd symbols from the news data

news.smpl <- gsub('[[:cntrl:]]',"",news.smpl)
news.smpl <- gsub(paste0(urls),"",news.smpl) 
news.smpl <- gsub(paste0(email2),"",news.smpl)
news.smpl <- gsub(paste0(special, collapse = '|'),"",news.smpl)

The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaly we convert the whole text data to lower case.

news.smpl_c <- Corpus(VectorSource(news.smpl))
# Remove english common stopwords
news.smpl_c <- tm_map(news.smpl_c, removeWords, stopwords("english"))
# Eliminate extra white spaces
news.smpl_c<- tm_map(news.smpl_c, stripWhitespace)
#inspect(news.smpl_c)

#convert all values to lower case
news.smpl_c <- tm_map(news.smpl_c, content_transformer(tolower))

2.2.3 Twitter data cleanup

Here we remove the no english characters annd symbols from the twitter data

twitter.smpl <- gsub('[[:cntrl:]]',"",twitter.smpl)
twitter.smpl <- gsub(paste0(urls),"",twitter.smpl) 
twitter.smpl <- gsub(paste0(email2),"",twitter.smpl)
twitter.smpl <- gsub(paste0(special, collapse = '|'),"",twitter.smpl)

The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaly we convert the whole text data to lower case.

twitter.smpl_c <- Corpus(VectorSource(twitter.smpl))
#inspect(twitter.smpl_c )
# Remove english common stopwords
twitter.smpl_c <- tm_map(twitter.smpl_c , removeWords, stopwords("english"))
# Eliminate extra white spaces
twitter.smpl_c  <- tm_map(twitter.smpl_c , stripWhitespace)
#convert all values to lower case
twitter.smpl_c  <- tm_map(twitter.smpl_c , content_transformer(tolower))

3. Data Tokenization

Tokenization where n-grams are extracted is also useful. N-grams are sequences of words. So a 2-gram would be two words together. This allows the bag of words model to have some information about word ordering. In this process we create - unigrams - for 1-grams - bigrams - for 2-grams - trigrams - for 3-grams

blogsTok <- MC_tokenizer(blogs.smpl_c)
twitTok <- MC_tokenizer(twitter.smpl_c )
newsTok <- MC_tokenizer(news.smpl_c)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

3. Data Explolaration

3.1 Words Explolaration

# Word stats?
newsDF <- data.frame(charCount=nchar(news.smpl), wordCount=sapply(strsplit(news.smpl, " "), length))
blogDF <- data.frame(charCount=nchar(blogs.smpl), wordCount=sapply(strsplit(blogs.smpl, " "), length))
twitrDF <- data.frame(charCount=nchar(twitter.smpl), wordCount=sapply(strsplit(twitter.smpl, " "), length))

Blog

The most frequent words are as shown below for blog data

####blog corpus
blogTok2  <- tm_map(blogs.smpl_c, PlainTextDocument)
dtm_blog <- DocumentTermMatrix(blogTok2) 
dtms_blog <- removeSparseTerms(dtm_blog, 0.98)

freq_blog <- sort (colSums (as.matrix(dtms_blog)), decreasing =TRUE)
wf_blog <- data.frame (word=names(freq_blog), freq=freq_blog)

wf_sub <- wf_blog[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),] 

ggplot(wf_sub,aes(x=reorder(word,freq),y=freq)) + geom_bar(stat="identity") +
  xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")

Twitter

The most frequent words are as shown below for twitter data

twitTok2  <- tm_map(twitter.smpl_c, PlainTextDocument)
dtm_twit <- DocumentTermMatrix(twitTok2) 
dtms_twit <- removeSparseTerms(dtm_twit , 0.98)

freq_twit <- sort (colSums (as.matrix(dtms_twit)), decreasing =TRUE)
wf_twit <- data.frame (word=names(freq_twit), freq=freq_twit)

wf_sub <- wf_twit[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),] 

ggplot(wf_sub,aes(x=reorder(word,freq) ,y=freq)) + geom_bar(stat="identity") +
  xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")

Twitter

The most frequent words are as shown below for news data

newsTok2  <- tm_map(news.smpl_c, PlainTextDocument)
dtm_news <- DocumentTermMatrix(newsTok2) 
dtms_news <- removeSparseTerms(dtm_news , 0.98)

freq_news <- sort (colSums (as.matrix(dtms_news)), decreasing =TRUE)
wf_news <- data.frame (word=names(freq_news), freq=freq_news)

wf_sub <- wf_news[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),] 

ggplot(wf_sub,aes(x=reorder(word,freq),y=freq)) + geom_bar(stat="identity") +
  xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")

3.1 Tokens Explolaration

First we combine the three data sets to have a single text file

text_sample  <- c(blogs.smpl,news.smpl,twitter.smpl)
length(text_sample) #no of lines

## [1] 42696

sum(stri_count_words(text_sample))

## [1] 1101036

Do a cleanup of the merged data set and plot the top 15 n-grams for the data sets

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
  # Helper function to preprocess corpus
  corpus <- tm_map(corpus, toSpace, "/|@|\\|®|™| ¥| £| ¢| €| #| â€ | ð | Ÿ˜|Š|í| ½|ð|$ | Â")
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
 # corpus <- tm_map(corpus, removeWords, profanities)
 corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, content_transformer(tolower))
  return(corpus)
}

freq_frame <- function(tdm){
  # Helper function to tabulate frequency
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_frame <- data.frame(word=names(freq), freq=freq)
  return(freq_frame)
}


text_sample <- VCorpus(VectorSource(text_sample))
text_sample <- preprocessCorpus(text_sample)

tdm1a <- TermDocumentMatrix(text_sample)
tdm1 <- removeSparseTerms(tdm1a, 0.99)
freq1_frame <- freq_frame(tdm1)

tdm2a <- TermDocumentMatrix(text_sample, control=list(tokenize=BigramTokenizer))
tdm2 <- removeSparseTerms(tdm2a, 0.999)
freq2_frame <- freq_frame(tdm2)

tdm3a <- TermDocumentMatrix(text_sample, control=list(tokenize=TrigramTokenizer))
tdm3 <- removeSparseTerms(tdm3a, 0.9999)
freq3_frame <- freq_frame(tdm3)

tdm4a <- TermDocumentMatrix(text_sample, control=list(tokenize=QuadgramTokenizer))
tdm4 <- removeSparseTerms(tdm4a, 0.9999)
freq4_frame <- freq_frame(tdm4)

Unigrams Exploration

freq1_top15 <- head(freq1_frame,n=15)
ggplot(freq1_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common unigrams in text sample")

Bi Exploration

freq2_top15 <- head(freq2_frame,n=15)
ggplot(freq2_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common bigrams in text sample")

Tri Gram Exploration

freq3_top15 <- head(freq3_frame,n=15)
ggplot(freq3_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common trigrams in text sample")

Quad gram Exploration

freq4_top15 <- head(freq4_frame,n=15)
ggplot(freq4_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common quadgrams in text sample")

4. Data Prediction

We will apply the Katz Back-Off Model for trigrams and bigrams to predict the words. The application will be shared on my Github repository and a pitch presentation will be posted on my RPubs repository