The main objective of this project is to to create an algorithm to predict the next word based on th previous words typed by a user. We use the Natural Language Processing algorithms to work on the prediction. We utlise data sets from a corpus called HC Corpora. The corpora contains material published from 2005 and up to the date of the corpus . The data set we use for our prediction are - twitter data data pulled from twitter by webcrawler - news data data pulled from news online platforms by webcrawler - blogs data data pulled from blogs by webcrawler
On this document we show some explolatory analysis and graphic representation of the words in the texts. We will also describe the Katz backoff algorithm that we will use for prediction.The algorithm based on the frequency of k-grams.
We pull our data sets from here and our main data of interest is the english based data - twitter data - en_US.twitter - news data - en_US.news.txt - blogs data - en_US.blogs.txt
#creating the data folder if it doesnt exist
if (!file.exists("data")) {
dir.create("data")
}
#downloadTheData
if (!file.exists("data/Coursera-SwiftKey.zip")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "data/Coursera-SwiftKey.zip")
}else {message("Data Already downloaded")}
Data Already downloaded
if (!file.exists("data/SwiftKey")){
unzip(zipfile = "data/Coursera-SwiftKey.zip",exdir = "data/SwiftKey")
}else {message("Data Already unzipped")}
Data Already unzipped
# import the blogs and twitter datasets in text mode
blogs <- readLines("data/SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("data/SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")
# import the news dataset in binary mode
con <- file("data/SwiftKey/final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
After loading the data we create a sample size to work with since the original data sets are huge to work with
#create a samplle datas
#1% Sampling
twitter.smpl <- sample(twitter, size = round(length(twitter)/100))
blogs.smpl <- sample(blogs, size = round(length(blogs)/100))
news.smpl <- sample(blogs, size = round(length(news)/100))
This process involves removal of non-english words, URL’s, NonAlphabets, Non Numbers , Symbols or emails. We use the tm package to generate VectorSource datasets.
hashtags <- "#[0-9][a-z][A-Z]+"
special <- c("®","â„¢", "Â¥", "£", "¢", "€", "#", "â€" , "ð" , "Ÿ˜","Å ","Ã", "½","ð","$")
urls <- "(f|ht)tp(s?)://(.*)[.][a-z]+"
email <- "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+"
email2 <- "^[[:alnum:].-]+@[[:alnum:].-]+$"
date <- "[0-9]{2}/[0-9]{2}/[0-9]{4}"
Here we remove the no english characters annd symbols from the blog data
blogs.smpl <- gsub('[[:cntrl:]]',"",blogs.smpl)
blogs.smpl <- gsub(paste0(urls),"",blogs.smpl)
blogs.smpl <- gsub("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9.-]+","",blogs.smpl)
blogs.smpl <- gsub(paste0(special, collapse = '|'),"",blogs.smpl)
The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaaly we convert the whole text data to lower case.
blogs.smpl_c <- Corpus(VectorSource(blogs.smpl))
# Remove english common stopwords
blogs.smpl_c <- tm_map(blogs.smpl_c, removeWords, stopwords("english"))
# Eliminate extra white spaces
blogs.smpl_c <- tm_map(blogs.smpl_c, stripWhitespace)
#inspect(blogs.smpl_c)
#convert all values to lower case
blogs.smpl_c <- tm_map(blogs.smpl_c, content_transformer(tolower))
#inspect(blogs.smpl_c)
Here we remove the no english characters annd symbols from the news data
news.smpl <- gsub('[[:cntrl:]]',"",news.smpl)
news.smpl <- gsub(paste0(urls),"",news.smpl)
news.smpl <- gsub(paste0(email2),"",news.smpl)
news.smpl <- gsub(paste0(special, collapse = '|'),"",news.smpl)
The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaly we convert the whole text data to lower case.
news.smpl_c <- Corpus(VectorSource(news.smpl))
# Remove english common stopwords
news.smpl_c <- tm_map(news.smpl_c, removeWords, stopwords("english"))
# Eliminate extra white spaces
news.smpl_c<- tm_map(news.smpl_c, stripWhitespace)
#inspect(news.smpl_c)
#convert all values to lower case
news.smpl_c <- tm_map(news.smpl_c, content_transformer(tolower))
Here we remove the no english characters annd symbols from the twitter data
twitter.smpl <- gsub('[[:cntrl:]]',"",twitter.smpl)
twitter.smpl <- gsub(paste0(urls),"",twitter.smpl)
twitter.smpl <- gsub(paste0(email2),"",twitter.smpl)
twitter.smpl <- gsub(paste0(special, collapse = '|'),"",twitter.smpl)
The we use the tm package to create a vector source corpus data. We strip the white spaces and remove the common stop words. Finaly we convert the whole text data to lower case.
twitter.smpl_c <- Corpus(VectorSource(twitter.smpl))
#inspect(twitter.smpl_c )
# Remove english common stopwords
twitter.smpl_c <- tm_map(twitter.smpl_c , removeWords, stopwords("english"))
# Eliminate extra white spaces
twitter.smpl_c <- tm_map(twitter.smpl_c , stripWhitespace)
#convert all values to lower case
twitter.smpl_c <- tm_map(twitter.smpl_c , content_transformer(tolower))
Tokenization where n-grams are extracted is also useful. N-grams are sequences of words. So a 2-gram would be two words together. This allows the bag of words model to have some information about word ordering. In this process we create - unigrams - for 1-grams - bigrams - for 2-grams - trigrams - for 3-grams
blogsTok <- MC_tokenizer(blogs.smpl_c)
twitTok <- MC_tokenizer(twitter.smpl_c )
newsTok <- MC_tokenizer(news.smpl_c)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
# Word stats?
newsDF <- data.frame(charCount=nchar(news.smpl), wordCount=sapply(strsplit(news.smpl, " "), length))
blogDF <- data.frame(charCount=nchar(blogs.smpl), wordCount=sapply(strsplit(blogs.smpl, " "), length))
twitrDF <- data.frame(charCount=nchar(twitter.smpl), wordCount=sapply(strsplit(twitter.smpl, " "), length))
The most frequent words are as shown below for blog data
####blog corpus
blogTok2 <- tm_map(blogs.smpl_c, PlainTextDocument)
dtm_blog <- DocumentTermMatrix(blogTok2)
dtms_blog <- removeSparseTerms(dtm_blog, 0.98)
freq_blog <- sort (colSums (as.matrix(dtms_blog)), decreasing =TRUE)
wf_blog <- data.frame (word=names(freq_blog), freq=freq_blog)
wf_sub <- wf_blog[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),]
ggplot(wf_sub,aes(x=reorder(word,freq),y=freq)) + geom_bar(stat="identity") +
xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")
The most frequent words are as shown below for twitter data
twitTok2 <- tm_map(twitter.smpl_c, PlainTextDocument)
dtm_twit <- DocumentTermMatrix(twitTok2)
dtms_twit <- removeSparseTerms(dtm_twit , 0.98)
freq_twit <- sort (colSums (as.matrix(dtms_twit)), decreasing =TRUE)
wf_twit <- data.frame (word=names(freq_twit), freq=freq_twit)
wf_sub <- wf_twit[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),]
ggplot(wf_sub,aes(x=reorder(word,freq) ,y=freq)) + geom_bar(stat="identity") +
xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")
The most frequent words are as shown below for news data
newsTok2 <- tm_map(news.smpl_c, PlainTextDocument)
dtm_news <- DocumentTermMatrix(newsTok2)
dtms_news <- removeSparseTerms(dtm_news , 0.98)
freq_news <- sort (colSums (as.matrix(dtms_news)), decreasing =TRUE)
wf_news <- data.frame (word=names(freq_news), freq=freq_news)
wf_sub <- wf_news[1:20,]
wf_sub <- wf_sub[order(wf_sub$freq),]
ggplot(wf_sub,aes(x=reorder(word,freq),y=freq)) + geom_bar(stat="identity") +
xlab("Word") + ylab("Frequency") + theme_classic(base_size = 16, base_family = "")
First we combine the three data sets to have a single text file
text_sample <- c(blogs.smpl,news.smpl,twitter.smpl)
length(text_sample) #no of lines
## [1] 42696
sum(stri_count_words(text_sample))
## [1] 1101036
Do a cleanup of the merged data set and plot the top 15 n-grams for the data sets
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
# Helper function to preprocess corpus
corpus <- tm_map(corpus, toSpace, "/|@|\\|®|â„¢| Â¥| £| ¢| €| #| †| ð | Ÿ˜|Å |Ã| ½|ð|$ | Â")
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# corpus <- tm_map(corpus, removeWords, profanities)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
return(corpus)
}
freq_frame <- function(tdm){
# Helper function to tabulate frequency
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freq_frame <- data.frame(word=names(freq), freq=freq)
return(freq_frame)
}
text_sample <- VCorpus(VectorSource(text_sample))
text_sample <- preprocessCorpus(text_sample)
tdm1a <- TermDocumentMatrix(text_sample)
tdm1 <- removeSparseTerms(tdm1a, 0.99)
freq1_frame <- freq_frame(tdm1)
tdm2a <- TermDocumentMatrix(text_sample, control=list(tokenize=BigramTokenizer))
tdm2 <- removeSparseTerms(tdm2a, 0.999)
freq2_frame <- freq_frame(tdm2)
tdm3a <- TermDocumentMatrix(text_sample, control=list(tokenize=TrigramTokenizer))
tdm3 <- removeSparseTerms(tdm3a, 0.9999)
freq3_frame <- freq_frame(tdm3)
tdm4a <- TermDocumentMatrix(text_sample, control=list(tokenize=QuadgramTokenizer))
tdm4 <- removeSparseTerms(tdm4a, 0.9999)
freq4_frame <- freq_frame(tdm4)
freq1_top15 <- head(freq1_frame,n=15)
ggplot(freq1_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Most common unigrams in text sample")
freq2_top15 <- head(freq2_frame,n=15)
ggplot(freq2_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Most common bigrams in text sample")
freq3_top15 <- head(freq3_frame,n=15)
ggplot(freq3_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Most common trigrams in text sample")
freq4_top15 <- head(freq4_frame,n=15)
ggplot(freq4_top15, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Most common quadgrams in text sample")
We will apply the Katz Back-Off Model for trigrams and bigrams to predict the words. The application will be shared on my Github repository and a pitch presentation will be posted on my RPubs repository