The purpose of this report is to explain the analysis of the training text documents provided and to discover the structure in the data by doing some basic exploratory analysis. The data set is acquired, cleansed and then analysed for the frequent words used in the documents. The analysis would be done on sample set of data. The report will also explain the next steps for building a predictive text model and presenting the model in Shiny app.
The Training Dataset is downloaded from the following website and unzipped and copied in the working folder https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
There are 4 folders in the dataset representing 4 different languages - German(de_DE), English(en_US),Finnish(fi_FI), Russian(ru_RU). We will be using the files under the English folder(en_US). The folder contains three files 1) en_US.blogs.txt 2) en_US.news.txt 3) en_US.twitter.txt
library(stringi)
library(ggplot2)
library(dplyr)
library(tm)
library(SnowballC)
library(NLP)
We load the dataset into three data files from the working directory.
setwd("C:/Sugi_R/Data Science Capstone Project/Coursera-SwiftKey/final/en_US")
con1<-file("en_US.blogs.txt","r")
blogs<-readLines(con1)
close(con1)
con2<-file("en_US.twitter.txt","r")
twitter<-readLines(con2)
close(con2)
con3<-file("en_US.news.txt","r")
news<-readLines(con3)
close(con3)
Lets explore the three data sets for Linecounts and Wordcounts
data.frame(filename=c("blogs","news","twitter"), rbind(stri_stats_general(blogs),stri_stats_general(news),stri_stats_general(twitter)),Wordcount=sapply(list(blogs,news,twitter),stri_stats_latex)[4,])
## filename Lines LinesNEmpty Chars CharsNWhite Wordcount
## 1 blogs 899288 899288 208361438 171926076 37865888
## 2 news 77259 77259 15683765 13117038 2665742
## 3 twitter 2360148 2360148 162384825 134370864 30578891
Statistics of the three file datasets based on the word counts
data.frame(filename=c("blogs","news","twitter"), rbind(summary(stri_count_words(blogs)),summary(stri_count_words(news)), summary(stri_count_words(twitter))))
## filename Min. X1st.Qu. Median Mean X3rd.Qu. Max.
## 1 blogs 0 9 29 42.43 61 6726
## 2 news 1 19 32 34.87 46 1123
## 3 twitter 1 7 12 12.80 18 60
For our exploratory analysis , lets take sample data(0.5%) from each data set since the dataset is quite large to create a corpus data.We will clean the Corpus data by removing punctuations, removing bad words(the bad words is downloaded and loaded into the working directory), white spaces, numbers and stopwords, as well as converting text to lowercase.
# Sample the data
set.seed(1000)
allsample<-c(sample(blogs, length(blogs) * 0.005),sample(news, length(news) * 0.005),sample(twitter, length(twitter)* 0.005))
# Create Corpus Data and Clean the data
sample_vcorp <- VCorpus(VectorSource(allsample), list(reader = readPlain))
# Function to replace unwanted characters with space
spaceadd<-content_transformer(function(x, pattern) gsub(pattern, " ", x))
sample_vcorp <- tm_map(sample_vcorp, spaceadd, "(f|ht)tp(s?)://(.*)[.][a-z]+")
sample_vcorp <- tm_map(sample_vcorp, spaceadd, "@[^\\s]+")
# Remove bad words from the corpus I downloaded the file from this URL and saved to my local working directory https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
bad_words<-read.csv2(file='bad_words.txt', header=TRUE,strip.white = TRUE, stringsAsFactors = FALSE)
sample_vcorp<-tm_map(sample_vcorp,removeWords,bad_words$bad_words)
sample_vcorp<-tm_map(sample_vcorp,content_transformer(tolower))
sample_vcorp<-tm_map(sample_vcorp,removePunctuation)
sample_vcorp<-tm_map(sample_vcorp,removeWords,stopwords("en"))
sample_vcorp<-tm_map(sample_vcorp,stripWhitespace)
sample_vcorp<-tm_map(sample_vcorp,removeNumbers)
sample_vcorp <- tm_map(sample_vcorp, PlainTextDocument)
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n in modern language, e.g., “four-gram”, “five-gram”, and so on.
we will create unigrams, bigrams and trigrams using the sample dataset in order for us explore the data better. The below tokenization function will create the different n-Grams using the NLP package.
UnigramTokenizer<-function(x)unlist(lapply(ngrams(words(x),1), paste, collapse = " "), use.names= FALSE)
BigramTokenizer <- function(x)unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),use.names =FALSE)
TrigramTokenizer <- function(x)unlist(lapply(ngrams(words(x), 3), paste, collapse = " "),use.names=FALSE)
Lets calculate the frequencies of the unigrams, bigrams, trigrams and plot these to understand the most frequent words used.Removing the sparse terms was a necessary step to get the TDM to a manageable size.
options(mc.cores=1)
# Unigram Frequencies
sample_vcorp.unigrams<-removeSparseTerms(TermDocumentMatrix(sample_vcorp, control = list(tokenize = UnigramTokenizer)),0.9999)
uni_matrix<- sort(rowSums(as.matrix(sample_vcorp.unigrams)), decreasing = TRUE)
uni_ngram<-data.frame(uni_ngram=names(uni_matrix),freq=uni_matrix)
# Bigram Frequencies
sample_vcorp.bigrams<-removeSparseTerms(TermDocumentMatrix(sample_vcorp, control = list(tokenize = BigramTokenizer)), 0.9999)
bi_matrix<- sort(rowSums(as.matrix(sample_vcorp.bigrams)), decreasing=TRUE)
bi_ngram<-data.frame(bi_ngram=names(bi_matrix),freq=bi_matrix)
# Trigram Frequencies
sample_vcorp.trigrams<-removeSparseTerms(TermDocumentMatrix(sample_vcorp, control = list(tokenize = TrigramTokenizer )), 0.9999)
tri_matrix<- sort(rowSums(as.matrix(sample_vcorp.trigrams)), decreasing=TRUE)
tri_ngram<-data.frame(tri_ngram=names(tri_matrix),freq=tri_matrix)
Lets plot the frequencies of the uni_ngram
ggplot(as.data.frame(head(uni_ngram,20)),aes(y=freq,x=reorder(uni_ngram,-freq), fill=freq,label=freq))+geom_bar(stat="identity") +geom_label(aes(fill=freq),size=3,colour = "white", fontface = "bold", show.legend=FALSE)+theme(axis.text.x = element_text(size=8,angle=45, vjust=1, hjust=1))+ggtitle("Top 20 uni grams")+ xlab("Uni Grams") + ylab("Frequencies")+guides(fill=FALSE)+theme(plot.title = element_text(hjust = 0.5))+coord_flip()
Plot the Frequencies of Bigram
ggplot(as.data.frame(head(bi_ngram,20)),aes(y=freq,x=reorder(bi_ngram,-freq), fill=freq,label=freq))+geom_bar(stat="identity") +geom_label(aes(fill=freq),size=3,colour = "white", fontface = "bold", show.legend=FALSE)+theme(axis.text.x = element_text(size=8,angle=45, vjust=1, hjust=1))+ggtitle("Top 20 bi grams")+ xlab("Bi Grams") + ylab("Frequencies")+guides(fill=FALSE)+theme(plot.title = element_text(hjust = 0.5))+coord_flip()
Plot the Frequencies of Trigram
ggplot(as.data.frame(head(tri_ngram,20)),aes(y=freq,x=reorder(tri_ngram,-freq), fill=freq,label=freq))+geom_bar(stat="identity") +geom_label(aes(fill=freq),size=3,colour = "white", fontface = "bold", show.legend=FALSE)+theme(axis.text.x = element_text(size=8,angle=45, vjust=1, hjust=1))+ggtitle("Top 20 Tri grams")+ xlab("Tri Grams") +ylab("Frequencies")+guides(fill=FALSE)+theme(plot.title = element_text(hjust = 0.5))+coord_flip()
Based on the analysis above, we have a better understanding of the frequencies of words used and have a good basis for creating a n-gram model for our predictive algorithm. We can extend the model to have fourgrams and fivegram as necessary.
The next steps for this project would be to create predictive algorithm and deploy the same as a Shiny app.
The predictive algorithm will be based on n-gram model- Markov probabilistic model for associating a certain probability to words, depending on the previous (N-1) words. with frequency lookup which predicts the word based on their frequency in trigram, bigram by summarizing frequency of tokens and by finding association between tokens.I might also build a model to handle unseen words by appling smoothing techniques by assigning non-zero probabilities to unseen words.
I’ll use a simplest backoff strategy, if there is no suggestion for the last 4 words (using 5-Grams and associated probabilities), then use last 3, if not, last 2, if not, last 1, and, if still not, use the highest 1-Gram probabilities (“the”, “end”, etc).
We will develop a data product i.e, a Shiny app that will have a simple interface with a text input box which will predict the word based on the user inputs.