Milestone Report for Coursera-SwiftKey Data Science Coursera Project

1. Introduction

The main objective of this project is to to create an algorithm to predict the next word based on th previous words typed by a user. We use the Natural Language Processing algorithms to work on the prediction. We utlise data sets from a corpus called HC Corpora. The corpora contains material published from 2005 and up to the date of the corpus . The data set we use for our prediction are - twitter data data pulled from twitter by webcrawler - news data data pulled from news online platforms by webcrawler - blogs data data pulled from blogs by webcrawler

On this document we show some explolatory analysis and graphic representation of the words in the texts. We will also describe the Katz backoff algorithm that we will use for prediction.The algorithm based on the frequency of k-grams.

library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

library(tm)
library(RWeka)
library(stringi)
library(stringr)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

2. Data acquisition and cleaning

We pull our data sets from here and our main data of interest is the english based data - twitter data - en_US.twitter - news data - en_US.news.txt - blogs data - en_US.blogs.txt

2.1 Downloading the data

#creating the data folder if it doesnt exist
if (!file.exists("data")) {
  dir.create("data")
}

#downloadTheData
if (!file.exists("data/Coursera-SwiftKey.zip")){
  fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(fileUrl, destfile = "data/Coursera-SwiftKey.zip")
}else {message("Data Already downloaded")}

Data Already downloaded

if (!file.exists("data/SwiftKey")){
unzip(zipfile = "data/Coursera-SwiftKey.zip",exdir = "data/SwiftKey")
}else {message("Data Already unzipped")}

Data Already unzipped

2.1 Loading the data

# import the blogs and twitter datasets in text mode
blogs <- readLines("data/Swiftkey/final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("data/Swiftkey/final/en_US/en_US.twitter.txt", encoding="UTF-8")

## Warning in readLines("data/Swiftkey/final/en_US/en_US.twitter.txt",
## encoding = "UTF-8"): line 167155 appears to contain an embedded nul

## Warning in readLines("data/Swiftkey/final/en_US/en_US.twitter.txt",
## encoding = "UTF-8"): line 268547 appears to contain an embedded nul

## Warning in readLines("data/Swiftkey/final/en_US/en_US.twitter.txt",
## encoding = "UTF-8"): line 1274086 appears to contain an embedded nul

## Warning in readLines("data/Swiftkey/final/en_US/en_US.twitter.txt",
## encoding = "UTF-8"): line 1759032 appears to contain an embedded nul

# import the news dataset in binary mode
con <- file("data/Swiftkey/final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

After loading the data we create a sample size to work with since the original data sets are huge to work with

#create a samplle datas
#1% Sampling
twitter.smpl <- sample(twitter, size = round(length(twitter)/100))
blogs.smpl <- sample(blogs, size = round(length(blogs)/100))
news.smpl <- sample(blogs, size = round(length(news)/100))

2.2 Raw data summary

The first and last 5 tweets of our sample twitter data

head(twitter.smpl, n=5)

## [1] "Oh... that's a good one. RT : really? Not CAR DASHIAN?"                                                         
## [2] "I'm using an interesting combination of methods :) It's unfortunately not easy to configure"                    
## [3] "And the signs are up!"                                                                                          
## [4] "....I feel like Larry Stylinson is getting more distance and it's more Zarry now. Oh okay sorry for my opinion."
## [5] "Ah its going to be a boy!!:D"

tail(twitter.smpl, n=5)

## [1] "why not finally say that policy of always opposing the president purely political reasons is completely unamerican/childish"               
## [2] "If you're wondering why I'm tweeting about candy so much it's cause there's a huge candy bowl in front of me at the salon & I have no will"
## [3] "Brendan Fraser."                                                                                                                           
## [4] "Bout to head to carmel to show the softball team some love!!!!!"                                                                           
## [5] "music is so amazing to me"

The first and last 5 lines of our sample blog data

head(blogs.smpl, n=5)

## [1] "Some alterations show the influence of painting over the composition of photograph, some cropping and brushing serves simply to create pleasing compositions, but most of the revisions seem to come from extra-aesthetic reasons either political or personal. In one publicity photo, Mao walks next to a gentleman and waves to the public. Mao is taller than the man and dominates the scene with his inherent charisma and personality. This man, however, is eliminated from the photograph anyway. Who was that man? Why was he removed? Can we safely assume that he fell out of favor with the Chairman or is this just our bias in viewing the working of a so called communist mentality? Another photograph shows a group of aviators posing in front of combat jet. The altered photograph shows that three of the men have simply vanished. This is the shadowy world of images, the working of those behind closed doors that can never be known."
## [2] "If you were at Cinefest 32 and saw last night's presentation of The Street of Forgotten Men, this blog would love to hear from you. Please post your thoughts or observations about the film and its screening in the comments field below. What did you think?"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [3] "Previously on my wishlist:"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [4] "I was elected by a majority, and have been a Board Member and Secretary-Treasurer for the District."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [5] "On World AIDS Day, one of the commitments we can make is to review the information HIV/AIDS and make sure we each do our part to eliminate the disease"

tail(blogs.smpl, n=5)

## [1] "About the card:"                                                                                                                                                                                                                                                                                                                                                         
## [2] "Obviously 2 very different styles of beer, but at the end of the day its still malts, water and hops."                                                                                                                                                                                                                                                                   
## [3] "BSS: \"I didn't.\""                                                                                                                                                                                                                                                                                                                                                      
## [4] "Is There Anti-Semitism on US College Campuses? Breaking: Rocket fired from #Gaza lands in Eshkol Regional Council, home to ~11,000 people Occupy AIPAC denies Holocaust-denying cartoonist works for them 3-Mar-12: Incoming Gazan missiles crashes into southern Israel, making zero impact (again) on non-Israeli news editors Israeli Arabs rally to free murderers"
## [5] "Even while starring on \"Buffy\", Gellar maintained a busy filming and voice-acting schedule, remaining in front of the camera and microphones pretty much non-stop through her teen years and 20s, and into the present day."

The first and last 5 lines of our sample news data

head(news.smpl , n=5)

## [1] "One more thing: You're probably wondering which one is \"Lothar\". The answer: None of them. Lothar was the THEREMIN who fronted the band. Duh."     
## [2] "Original Gravity: 1.053"                                                                                                                             
## [3] "You have time until 22 april 23:59."                                                                                                                 
## [4] "What should you be blogging about?"                                                                                                                  
## [5] "You mentioned before you cannot drink alcohol until this year according to your doctor's advice. Why was that? Are you allowed to drink alcohol now?"

tail(news.smpl , n=5)

## [1] "Now this is only a very short synopsis of a marvellous romantic comedy play that is extremely fun and merry to read. It does not and cannot encompass the plays beauty,comedy and complexity wholly. There are so many layers to A Midsummer Nights Dream that one is perplexed whether Shakespeare wrote it only for mere entertainment or for other instructional purposes as well. It may not be one of his best plays but still remains a popular one that is sure to make anyone laugh. And if the Shakespearean language daunts you, dont fret, it isnt that difficult. Just try reading it aloud, going with the flow of the verse, comprehension will come to you eventually (Plus the notes and the annotations help a lot). And if you can, go watch its performance which will further deepen your understanding of the play and perhaps kindle a love for Shakespeare forever freeing you from the tedious rom coms of today!"
## [2] "First youll going to start by fining your eyebrows, filing them in and Im going to be using Palatino color went off but it is medium brown, start here lightly fill them in and then have a concealer brush with a concealer on it to clean the edges."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [3] "And 100 desert island discs, although it pains me to think of what I left off. In fact, I should probably scrap this list and start over. But that would take a lot of time."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [4] "(Drop it, I tell you- put that kitten down!)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [5] "There he was placed in a holding cell where he was allegedly severely assaulted."

2.3 Data cleanup

This process involves removal of non-english words, URL’s, NonAlphabets, Non Numbers , Symbols or emails. We use the tm package to generate VectorSource datasets.

hashtags <- "#[0-9][a-z][A-Z]+"
special <- c("®","", "¥", "£", "¢", "", "#", "â" , "ð" , "","","í", "½","ð","$")
urls <- "(f|ht)tp(s?)://(.*)[.][a-z]+"
email <- "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+"
email2 <- "^[[:alnum:].-]+@[[:alnum:].-]+$"
date <- "[0-9]{2}/[0-9]{2}/[0-9]{4}"

2.3.1 Blog data cleanup

Here we remove the no english characters annd symbols from the blog data

blogs.smpl <- gsub('[[:cntrl:]]',"",blogs.smpl)
blogs.smpl <-  gsub(paste0(urls),"",blogs.smpl)
blogs.smpl <- gsub("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9.-]+","",blogs.smpl)
blogs.smpl <- gsub(paste0(special, collapse = '|'),"",blogs.smpl)

The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaaly we convert the whole text data to lower case.

blogs.smpl_c <- Corpus(VectorSource(blogs.smpl))
# Remove english common stopwords
blogs.smpl_c <- tm_map(blogs.smpl_c, removeWords, stopwords("english"))
# Eliminate extra white spaces
blogs.smpl_c <- tm_map(blogs.smpl_c, stripWhitespace)
#inspect(blogs.smpl_c)

#convert all values to lower case
blogs.smpl_c <- tm_map(blogs.smpl_c, content_transformer(tolower))

#inspect(blogs.smpl_c)

2.3.2 News data cleanup

Here we remove the no english characters annd symbols from the news data

news.smpl <- gsub('[[:cntrl:]]',"",news.smpl)
news.smpl <- gsub(paste0(urls),"",news.smpl) 
news.smpl <- gsub(paste0(email2),"",news.smpl)
news.smpl <- gsub(paste0(special, collapse = '|'),"",news.smpl)

The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaly we convert the whole text data to lower case.

news.smpl_c <- Corpus(VectorSource(news.smpl))
# Remove english common stopwords
news.smpl_c <- tm_map(news.smpl_c, removeWords, stopwords("english"))
# Eliminate extra white spaces
news.smpl_c<- tm_map(news.smpl_c, stripWhitespace)
#inspect(news.smpl_c)

#convert all values to lower case
news.smpl_c <- tm_map(news.smpl_c, content_transformer(tolower))

head(news.smpl_c)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 6

2.3.3 Twitter data cleanup

Here we remove the no english characters annd symbols from the twitter data

twitter.smpl <- gsub('[[:cntrl:]]',"",twitter.smpl)
twitter.smpl <- gsub(paste0(urls),"",twitter.smpl) 
twitter.smpl <- gsub(paste0(email2),"",twitter.smpl)
twitter.smpl <- gsub(paste0(special, collapse = '|'),"",twitter.smpl)

The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaly we convert the whole text data to lower case.

twitter.smpl_c <- Corpus(VectorSource(twitter.smpl))
#inspect(twitter.smpl_c )
# Remove english common stopwords
twitter.smpl_c <- tm_map(twitter.smpl_c , removeWords, stopwords("english"))
# Eliminate extra white spaces
twitter.smpl_c  <- tm_map(twitter.smpl_c , stripWhitespace)
twitter.smpl_c  <- tm_map(twitter.smpl_c , content_transformer(tolower))
#convert all values to lower case
head(twitter.smpl_c )

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 6

3. Data Tokenization

Tokenization where n-grams are extracted is also useful. N-grams are sequences of words. So a 2-gram would be two words together. This allows the bag of words model to have some information about word ordering. In this process we create - unigrams - for 1-grams - bigrams - for 2-grams - trigrams - for 3-grams

blogsTok <- MC_tokenizer(blogs.smpl_c)
twitTok <- MC_tokenizer(twitter.smpl_c )
newsTok <- MC_tokenizer(news.smpl_c)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

3. Data Explolaration

3.1 Words Explolaration

# Word stats?
newsDF <- data.frame(charCount=nchar(news.smpl), wordCount=sapply(strsplit(news.smpl, " "), length))
blogDF <- data.frame(charCount=nchar(blogs.smpl), wordCount=sapply(strsplit(blogs.smpl, " "), length))
twitrDF <- data.frame(charCount=nchar(twitter.smpl), wordCount=sapply(strsplit(twitter.smpl, " "), length))

The first ten records for each of the data category was as follows

# First 10 records of News, Blog and Twitter
newsDF[1:10,]

##    charCount wordCount
## 1        141        24
## 2         23         3
## 3         35         7
## 4         34         6
## 5        148        25
## 6         10         2
## 7         30         5
## 8        286        57
## 9         21         3
## 10       164        29

blogDF[1:10, ]

##    charCount wordCount
## 1        931       158
## 2        255        46
## 3         26         4
## 4         99        17
## 5        150        29
## 6        214        39
## 7        358        54
## 8         83        13
## 9         24         6
## 10        45         7

twitrDF[1:10,]

##    charCount wordCount
## 1         54        11
## 2         91        14
## 3         21         5
## 4        111        20
## 5         28         7
## 6         95        13
## 7         21         4
## 8        139        21
## 9         29         6
## 10        52        10

Blog

The most frequent words are as shown below for blog data

####blog corpus
blogTok2  <- tm_map(blogs.smpl_c, PlainTextDocument)
dtm_blog <- DocumentTermMatrix(blogTok2) 
dtms_blog <- removeSparseTerms(dtm_blog, 0.98)

freq_blog <- sort (colSums (as.matrix(dtms_blog)), decreasing =TRUE)
wf_blog <- data.frame (word=names(freq_blog), freq=freq_blog)

wf_sub <- wf_blog[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),] 

ggplot(wf_sub,aes(x=reorder(word,freq),y=freq)) + geom_bar(stat="identity") +
  xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")

Twitter

The most frequent words are as shown below for twitter data

twitTok2  <- tm_map(twitter.smpl_c, PlainTextDocument)
dtm_twit <- DocumentTermMatrix(twitTok2) 
dtms_twit <- removeSparseTerms(dtm_twit , 0.98)

freq_twit <- sort (colSums (as.matrix(dtms_twit)), decreasing =TRUE)
wf_twit <- data.frame (word=names(freq_twit), freq=freq_twit)

wf_sub <- wf_twit[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),] 

ggplot(wf_sub,aes(x=reorder(word,freq) ,y=freq)) + geom_bar(stat="identity") +
  xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")

Twitter

The most frequent words are as shown below for news data

newsTok2  <- tm_map(news.smpl_c, PlainTextDocument)
dtm_news <- DocumentTermMatrix(newsTok2) 
dtms_news <- removeSparseTerms(dtm_news , 0.98)

freq_news <- sort (colSums (as.matrix(dtms_news)), decreasing =TRUE)
wf_news <- data.frame (word=names(freq_news), freq=freq_news)

wf_sub <- wf_news[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),] 

ggplot(wf_sub,aes(x=reorder(word,freq),y=freq)) + geom_bar(stat="identity") +
  xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")

3.1 Tokens Explolaration

First we combine the three data sets to have a single text file

text_sample  <- c(blogs.smpl,news.smpl,twitter.smpl)
length(text_sample) #no of lines

## [1] 42696

sum(stri_count_words(text_sample))

## [1] 1098195

Do a cleanup of the merged data set and plot the top 15 n-grams for the data sets

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
  # Helper function to preprocess corpus
  corpus <- tm_map(corpus, toSpace, "/|@|\\|®|| ¥| £| ¢| | #| â | ð | ||í| ½|ð|$ | Â")
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
 # corpus <- tm_map(corpus, removeWords, profanities)
 corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, content_transformer(tolower))
  return(corpus)
}

freq_frame <- function(tdm){
  # Helper function to tabulate frequency
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_frame <- data.frame(word=names(freq), freq=freq)
  return(freq_frame)
}


text_sample <- VCorpus(VectorSource(text_sample))
text_sample <- preprocessCorpus(text_sample)

tdm1a <- TermDocumentMatrix(text_sample)
tdm1 <- removeSparseTerms(tdm1a, 0.99)
freq1_frame <- freq_frame(tdm1)

tdm2a <- TermDocumentMatrix(text_sample, control=list(tokenize=BigramTokenizer))
tdm2 <- removeSparseTerms(tdm2a, 0.999)
freq2_frame <- freq_frame(tdm2)

tdm3a <- TermDocumentMatrix(text_sample, control=list(tokenize=TrigramTokenizer))
tdm3 <- removeSparseTerms(tdm3a, 0.9999)
freq3_frame <- freq_frame(tdm3)

tdm4a <- TermDocumentMatrix(text_sample, control=list(tokenize=QuadgramTokenizer))
tdm4 <- removeSparseTerms(tdm4a, 0.9999)
freq4_frame <- freq_frame(tdm4)

Unigrams Explolaration

freq1_top15 <- head(freq1_frame,n=15)
ggplot(freq1_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common unigrams in text sample")

Bi Explolaration

freq2_top15 <- head(freq2_frame,n=15)
ggplot(freq2_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common bigrams in text sample")

Tri Gram Exploration

freq3_top15 <- head(freq3_frame,n=15)
ggplot(freq3_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common trigrams in text sample")

Quad gram exploaration

freq4_top15 <- head(freq4_frame,n=15)
ggplot(freq4_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common quadgrams in text sample")

4. Data Prediction

We will apply the Katz Back-Off Model for trigrams and bigrams to predict the words. The application will be shared on my Github repository and a pitch presentation will be posted on my RPubs repository