The goal of this milestone report is to provide exploratory analysis and describe the goals of the text prediction app. I provide a report of summary statistics from the dataset.
For this Capstone Project, I’ll build a Shiny app that will suggest words based on text entered by the user. I’ll build a probabilistic model, based on Markov chains and trigrams, to provide word suggestions derived from a training set of data. The training set consists of word data from blogs, news sites, and tweets.
library(tm)
library(wordcloud)
setwd("/home/antonio/Downloads/Coursera-Swiftkey/en_US")
con1 = file("en_US.news.txt", "r")
con2 = file("en_US.blogs.txt", "r")
con3 = file("en_US.twitter.txt", "r")
For convenience, we use a subset of the training data during this exploratory analysis.
newsData <- readLines(con1, n = 2500, encoding = "UTF-8", warn = FALSE)
blogsData <- readLines(con2, n = 2500, encoding = "UTF-8", warn = FALSE)
tweetsData <- readLines(con3, n = 2500, encoding = "UTF-8", warn = FALSE)
close(con1)
close(con2)
close(con3)
As the 3 datasets (blogs, news, and tweets) will be used to train the text prediction model, we merge them before exploration.
# merging data
newsData <- paste(newsData, collapse= " ")
blogsData <- paste (blogsData, collapse= " ")
tweetsData <- paste (tweetsData, collapse = " ")
textData <- paste(tweetsData, blogsData, newsData)
# a glimpse of the content before cleaning
str(textData)
## chr "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long. When you meet som"| __truncated__
Punctuation, whitespace, and numbers will be omitted from the text prediction model. I use the tm package to clean the data.
# setting up source and corpus
review_source_text <- VectorSource(textData)
textCorpus <- Corpus(review_source_text)
#turn to lowercase
textCorpus <- tm_map(textCorpus, content_transformer(tolower))
#remove all punctuation
textCorpus <- tm_map(textCorpus, removePunctuation)
#take out whitespace
textCorpus <- tm_map(textCorpus, stripWhitespace)
#one version of the corpus will be explored with stopwords, as stopwords are often useful while completing typed text
textCorpusNoStopwords <- tm_map(textCorpus, removeWords, stopwords("english"))
#Document term matrix
dtm <- DocumentTermMatrix(textCorpus)
dtm2 <- as.matrix(dtm)
#find the most frequent words (considering stopwords)
frequency <- sort(colSums(dtm2), decreasing=TRUE)
frequency[1:10]
## the and that for with you was this have but
## 10767 5675 2373 2324 1650 1619 1413 1130 1121 1110
#Document term matrix
dtmNoStopwords <- DocumentTermMatrix(textCorpusNoStopwords)
dtm2NoStopwords <- as.matrix(dtmNoStopwords)
#find the most frequent words (considering stopwords)
frequencyNoStopwords <- sort(colSums(dtm2NoStopwords), decreasing=TRUE)
frequencyNoStopwords[1:10]
## said will one like just can time new get now
## 756 638 631 579 578 491 444 415 409 365
words <- names(frequency)
wordsNoStopwords <- names(frequencyNoStopwords)
Trigrams are groups of three adjacent words, and are derived from the training set. The more frequent a trigram is, the higher is its probability to appear as suggested text to complete a phrase.
library("RWeka")
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
trigramDtm <- DocumentTermMatrix(Corpus(VectorSource(textData)),
control = list(tokenize = TrigramTokenizer))
trigramDtm2 <- as.matrix(trigramDtm)
#find the most frequent trigrams
trigramFrequency <- sort(colSums(trigramDtm2), decreasing=TRUE)
trigramFrequency[1:12]
## a lot of one of the i don t i can t it s not i didn t
## 83 80 79 45 43 42
## it s a going to be i m not the u s don t know i want to
## 42 40 38 37 36 35
trigrams <- names(trigramFrequency)
Word clouds provide a visually amenable way to explore the training dataset. The most common words appear larger.
Trigrams will be the base of our Markov prediction model.
Although they are usually filtered while doing text analysis, the stopwords included in the training set for the model will be useful due to their highly frequent use. Minimal response time while doing common typing is a desirable feature for a text predictor.
The exploration without stopwords is only a curiosity, as they will remain useful for the Shiny app.
Implement a Markov chain for text prediction using bi-grams or trigrams. A profanity filter will also be implemented to exclude curse words from the training set.