library(tm)
library(SnowballC)
library(RWeka)
library(slam)
library(wordcloud)
library(stringi)
library(RColorBrewer)
library(ggplot2)
For the purpose of this project we are using the Swiftkey English database. Therefore we check if the project’s data Coursera-Swiftkey.zip exists and download and unzip if necessary
#Check for zip file and download if necessary
if(!file.exists("Coursera-Swiftkey.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-Swiftkey.zip")
}
#Check for data file and unzip if necesssary
if (!file.exists("final/en_US/en_US.blogs.txt")) {
unzip("Coursera-SwiftKey.zip", exdir = "data/final/en_US", list = TRUE)
}
#Import blogs and twitter datasets in text mode
blogs <- readLines("final/en_US/en_US.blogs.txt", skipNul = T)
twitter <- readLines("final/en_US/en_US.twitter.txt", skipNul = T)
# import the news dataset in binary mode
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, skipNul = T)
close(con)
rm(con)
## File Lines Chars TotalWords
## 1 blogs 899288 208361438 37865888
## 2 news 1010242 203791405 34678691
## 3 twitter 2360148 162385035 30578933
As we can see from the above the imported files each have a really large number of lines, words and characters. Hence for the purpose of subsequent preprocessinng and other operations we sample about 1% of the data from each file
# Select a random 1% of lines
set.seed(123)
blogs_sample <- blogs[rbinom(length(blogs)*.01, length(blogs), .5)]
twitter_sample <- twitter[rbinom(length(twitter)*.01, length(twitter), .5)]
news_sample <- news[rbinom(length(news)*.01, length(news), .5)]
#Clean up the global environment
rm(blogs, news, twitter)
blogs_source <- VectorSource(blogs_sample)
blogs_corpus <- VCorpus(blogs_source)
news_source <- VectorSource(news_sample)
news_corpus <- VCorpus(news_source)
twitter_source <- VectorSource(twitter_sample)
twitter_corpus <- VCorpus(twitter_source)
Since raw text formats can cause significant issues when text mining, it’s necessary to pre-process text data by using common transformation and filtering functions. In the following we use a function ‘clean_corpus’ that takes a corpus and applies the following transformations one by one:
blogs_clean <- clean_corpus(blogs_corpus)
twitter_clean <- clean_corpus(twitter_corpus)
news_clean <- clean_corpus(news_corpus)
full_clean <- c(blogs_clean, twitter_clean, news_clean, recursive = F)
unitokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 1, max = 1))
bitokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 2, max = 2))
tritokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 3, max = 3))
blogs_tdm <- TermDocumentMatrix(blogs_clean, control = list(tokenize = unitokenizer))
news_tdm <- TermDocumentMatrix(news_clean, control = list(tokenize = unitokenizer))
twitter_tdm <- TermDocumentMatrix(twitter_clean, control = list(tokenize = unitokenizer))
uni_tdm <- TermDocumentMatrix(full_clean, control = list(tokenize = unitokenizer))
bi_tdm <- TermDocumentMatrix(full_clean, control = list(tokenize = bitokenizer))
tri_tdm <- TermDocumentMatrix(full_clean, control = list(tokenize = tritokenizer))
An interesting observation is that the most frequent words differ for each source - blog, news and twitter. This may or may not be a factor to consider when creating the n-gram model later.
A second interesting observation is that the relative frequencies of top unigrams are quite low (less than 1% for the top unigram ‘will’). The relative frequencies fall drastically as we move to bigrams and trigrams. This could have implications for future modeling with unigrams, bigrams and trigrams.
Final prediction algorithm will created as a n-gram model that predicts the next item in a sequence in the form of a (n-1) order Markov model. Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability – with larger n, a model can store more context with a well-understood space– time tradeoff, enabling small experiments to scale up efficiently.
The 2-gram, 3-gram frequency tables calculated above (and higher-order n-grams) will be used to train the model, with independence assumption. so that each word depends only on the last n − 1 words. This Markov model is used as an approximation of the true underlying language.
In a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution (often imprecisely called a “multinomial distribution”).
In practice, the probability distributions will be smoothed by assigning non-zero probabilities to unseen words or infrequent n-grams
Finally, a simple Shiny app will be created that
If there is a match, then the app will look for the maximum probability word that follows the n-gram. If there is no match, then it will check the (n-1)-grams, and then the (n-2)-grams and so on. At each step, the app will look for a match, and if there is a match, the app will identify the word with the highest probability of occurring next, using the smoothed frequency matrix. This is a backoff model approach.