Have you ever wondered how your smartphone corrects the spellings and even type suggests the next word in the text messages? Spelling correction and type aheads are based on prediction techniques combining statistics and grammer rules (language syntax).
This report presents the steps in building a word prediction app as a part of Coursera/Johns Hopkins Dat Science specialization capstone.
A prediction algorithm is built based on the observed patterns in the corpora of documents collected in the past. The steps to build the algorithm is summarized below.
* Data acquisition
* Cleaning and Transformation
* Slicing and Sampling
* Modeling (n-Gram model) and
* Predictive algorithm
The data is from HC Corpora which is free corpora available for learning and research purpose. See the readme file at About Corpus for details on the corpora available.
The RMD file will look for the data files in the current folder where ever the RMD file is located. If the data files are not present, the code will download them and save the data file in zip format. The files will be expanded into a sub-folder ./final. If you are running the code from the R console, please set the working directory using setwd() to the directory where the .RMD file is present.
# In order to ensure reproduceable results, lets download the files from the source
# If the data file does not exist, download again and save the zip file
if (!file.exists("Coursera-SwiftKey.zip"))
download.file(
"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip", method="curl")
# expand the zip file. Old files will be overwritten
unzip("Coursera-SwiftKey.zip",overwrite = TRUE)
Corpora has data files in four languages. We are interested in English tweets, blogs and news stored in the sub-folder.
Let’s show how much data is present in the Corpus by source.
# read the files
#docs <- Corpus(DirSource(dataFolder))
news=readLines(paste(DataLocation,"/","en_US.news.txt", sep=""),encoding='UTF-8',skipNul=TRUE)
blogs=readLines(paste(DataLocation,"/","en_US.blogs.txt", sep=""),encoding='UTF-8',skipNul=TRUE)
tweets=readLines(paste(DataLocation,"/","en_US.twitter.txt", sep=""),encoding='UTF-8',skipNul=TRUE)
#how many chars each?
tweets_char = nchar(tweets)#count the number of chars for each tweet
blogs_char = nchar(blogs)#count the number of chars for each blog
news_char = nchar(news)#count the number of chars for each news article
# how many document in each file?
news_num = length(news)
blogs_num = length(blogs)
tweets_num = length(tweets)
# how many words in each file?
news_Words <- sum(stri_count_words(news))
blogs_Words <- sum(stri_count_words(blogs))
twitter_Words <- sum(stri_count_words(tweets))
#size?
blogs_file_size = round(file.size(paste(DataLocation,"/","en_US.blogs.txt", sep=""))/1024^2, digits=1)
news_file_size = round(file.size(paste(DataLocation,"/","en_US.news.txt", sep=""))/1024^2,digits=1)
tweets_file_size = round(file.size(paste(DataLocation,"/","en_US.twitter.txt", sep=""))/1024^2, digits = 1)
filenames <- c("en_US.blogs.txt", "en_US.news.txt","en_US.twitter.txt")
filesize <- c(blogs_file_size,news_file_size,tweets_file_size) # files size in MB
wordCounts <- c(news_Words,blogs_Words,twitter_Words)
numDocs <- c(blogs_num, news_num, tweets_num)
dataSummary <- as.data.frame(cbind(filenames, filesize, numDocs, wordCounts))
colnames(dataSummary) <- c("Source","Size (MB)","Documents", "Words")
# print the table
kable(dataSummary, format="markdown")
| Source | Size (MB) | Documents | Words |
|---|---|---|---|
| en_US.blogs.txt | 200.4 | 899288 | 34762395 |
| en_US.news.txt | 196.3 | 1010242 | 37546246 |
| en_US.twitter.txt | 159.4 | 2360148 | 30093410 |
For the training exercise, I am using 3% of the sample data from the Corpora.
#read all the files containing profanity in any language and convert it to a single list of words
profanity <- readLines(paste(getwd(),"/","Bad-Words-master/profanity.txt",sep=""),encoding='UTF-8',skipNul=TRUE)
SampleNews=sample(news, 0.03* news_num) # 3% as a sample for training
SampleBlogs=sample(blogs, 0.03 * blogs_num)
SampleTweets=sample(tweets, 0.03 * tweets_num)
#combine the sample vectros into a single vector
sampleData=paste(SampleBlogs,SampleNews,SampleTweets, sep=" ")
writeLines(sampleData,"./TrainingData/TrainingData.txt")
For the purpose of Natural Language Processing (NLP), we dont need punctuations, white spaces etc. * remove white spaces
* convert all words to lower case
* remove stop words which has no function in NLP e.g. the, at, is, which and on
* remove all puntuation
* remove all numbers
* remove profanity and swear words, downloaded from (https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en)
* remove suffixes from the words (stemming) e.g. -es, -ed, s
* covert the documents to plain text
* remove special charecters/foreign words
After our data is cleaned up, we create a giant matrix to convert words into individual tokens and note them against the document a.k.a. document/term matrix (DTM). Lets map the frequency of the words. Instead of using histogram that utilizes bins on numeric scale, I am using a bar plot to show the words as labels.
I am using RWeka and tm packages to tokenize the content of the corpora into 1, 2 and 3 word clusters (N-gram).
We build a fun WordCloud to show most frequently occuring single words.
Unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
UniDoc <- DocumentTermMatrix(SwiftKey)
UniDoc.matrix <- as.matrix(UniDoc)
frequency <- colSums(UniDoc.matrix)
frequency <- sort(frequency,decreasing=TRUE)
UniGramFrequency <- data.frame(word=names(frequency),freq=frequency)
#build a wordcloud
colspectrum <- brewer.pal(6, "Dark2")
wordcloud(names(frequency), frequency, max.words=50, rot.per=0.1, colors=colspectrum)
Building bigram tokens.
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
biDTM <- DocumentTermMatrix(SwiftKey, control = list(tokenize = BigramTokenizer))
biDTM2 <- as.matrix(biDTM)
frequency <- colSums(biDTM2)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
#show the first few most used word pairs
BiGramFrequency <- data.frame(word=names(frequency),freq=frequency)
BiGramFrequency %>%
ggplot(., aes(x=reorder(word, -freq),freq)) +
geom_bar(stat="identity",colour="blue",fill="lightgreen") +
ggtitle("Bigrams with the highest frequencies") +
xlab("Bigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
Lastly we tokenize three words together.
TrigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
TriDTM <- DocumentTermMatrix(SwiftKey, control = list(tokenize = TrigramTokenizer))
TriDTM2 <- as.matrix(TriDTM)
frequency <- colSums(TriDTM2)
frequency <- sort(frequency,decreasing=TRUE)
frequency <- head(frequency,8)
TriGramFrequency <- data.frame(word=names(frequency),freq=frequency)
#show the first few most used word groups
head(TriGramFrequency, 10)
## word freq
## new york city new york city 147
## two years ago two years ago 130
## cant wait see cant wait see 129
## president barack obama president barack obama 112
## happy mothers day happy mothers day 107
## new york times new york times 95
## caprera hotel venice caprera hotel venice 84
## hotel venice italy hotel venice italy 84
TriGramFrequency %>%
ggplot(., aes(x=reorder(word, -freq),freq)) +
geom_bar(stat="identity",colour="blue",fill="lightgreen") +
ggtitle("Trigrams with the highest frequencies") +
xlab("Trigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
I plan to use the n-gram data as the foundation for further analysis. We build quadgram tokens as well and use it in the prediction model. After the basic analysis, our goal is to build different prediction algorithms based on popular models. Finally, the prediction algorithm will be packaged as an app and hosted on Shiny.io.
Re-producibility instructions This report uses following R packages. Please install and load them using the install.packages().
1. require(tm); require(RWeka);require(ggplot2);require(wordcloud);require(RColorBrewer);require(xtable); require(knitr); require(SnowballC);require(stringi);require(R.utils)
2. For reproduceability, set the seed to 80 using set.seed() command
3. Sampling size 3%