The document at hand is the milestone report for Week 2 of the Coursera Data Science Capstone project. The report deals with the results of the expoloratory data analysis of natural language data from three sources, i.e. news, blogs, and tweets. The results of this report set the baseline for the subsequent development of a predicition algorithm embedded in a shiny app.
The data to be used is downloaded from the provided URL and stored locally.
We first load the necessary packages tm and Rewka (text mining) as well as ggplot2 for the plots.
library(tm)
## Loading required package: NLP
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Then we read in the data from the provided flat files.
blogs <-
readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <-
readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <-
readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
Next we generate an overview table for metrics that describe the loaded files.
summary <- data.frame(
'File' = c("Blogs", "News", "Twitter"),
"File Size" = sapply(list(blogs, news, twitter), function(x) {
format(object.size(x), "MB")
}),
'NoEntries' = sapply(list(blogs, news, twitter), function(x) {
length(x)
}),
'TotalNoCharacters' = sapply(list(blogs, news, twitter), function(x) {
sum(nchar(x))
}),
'MaxCharacters' = sapply(list(blogs, news, twitter), function(x) {
max(unlist(lapply(x, function(y)
nchar(y))))
})
)
summary
## File File.Size NoEntries TotalNoCharacters MaxCharacters
## 1 Blogs 248.5 Mb 899288 206824505 40833
## 2 News 19.2 Mb 77259 15639408 5760
## 3 Twitter 301.4 Mb 2360148 162096031 140
To prevent challenges at a later stage, data is converted from codepage Latin to ASCII.
blogs <- iconv(blogs, "latin1", "ASCII", sub = "")
news <- iconv(news, "latin1", "ASCII", sub = "")
twitter <- iconv(twitter, "latin1", "ASCII", sub = "")
For the sampling we use 1% of all data in files; reproducibility is assured by setting a seed.
set.seed(519) # assure reproducibility of sampling
sample_data <- c(
sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01)
)
After creating the so-called corpus with the tm package, we apply some functions from the mentioned package to streamline and clean the data.
corpus <- VCorpus(VectorSource(sample_data)) # create corpus
corpus <- tm_map(corpus, tolower) # transform to lower case
corpus <- tm_map(corpus, removePunctuation) # remove punctuations
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove blank spaces
corpus <- tm_map(corpus, PlainTextDocument) #generate plain text
corpus <- tm_map(corpus, removeWords, stopwords("english")) #remove engl.stopwords
Next we apply the Rweka package to generate the n-grams, i.e. unigrams, bigrams and trigrams.
uni_tokenizer <-
function(x)
NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_tokenizer <-
function(x)
NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_tokenizer <-
function(x)
NGramTokenizer(x, Weka_control(min = 3, max = 3))
In the next step, term document matrices which are an essenaital component of text mining are created.
uni_matrix <-
TermDocumentMatrix(corpus, control = list(tokenize = uni_tokenizer))
bi_matrix <-
TermDocumentMatrix(corpus, control = list(tokenize = bi_tokenizer))
tri_matrix <-
TermDocumentMatrix(corpus, control = list(tokenize = tri_tokenizer))
We then count the 50 most frequent unigrams and bigrams as well as the top 10 trigrams. The results are written into data frames, re-ordered and finally visualized by plots.
uni_corpus <- findFreqTerms(uni_matrix, lowfreq = 50)
bi_corpus <- findFreqTerms(bi_matrix, lowfreq = 50)
tri_corpus <- findFreqTerms(tri_matrix, lowfreq = 10)
# Count N-gram-frequencies and write results into a data frame
uni_corpus_freq <- rowSums(as.matrix(uni_matrix[uni_corpus, ]))
uni_corpus_freq <-
data.frame(word = names(uni_corpus_freq), frequency = uni_corpus_freq)
bi_corpus_freq <- rowSums(as.matrix(bi_matrix[bi_corpus, ]))
bi_corpus_freq <-
data.frame(word = names(bi_corpus_freq), frequency = bi_corpus_freq)
tri_corpus_freq <- rowSums(as.matrix(tri_matrix[tri_corpus, ]))
tri_corpus_freq <-
data.frame(word = names(tri_corpus_freq), frequency = tri_corpus_freq)
# Re-order n-gram data frames by frequency
uni_corpus_freq <-
uni_corpus_freq[order(-uni_corpus_freq$frequency), ]
bi_corpus_freq <- bi_corpus_freq[order(-bi_corpus_freq$frequency), ]
tri_corpus_freq <-
tri_corpus_freq[order(-tri_corpus_freq$frequency), ]
# Print results (top 20)
unigrams <- head(uni_corpus_freq, 20)
bigrams <- head(bi_corpus_freq, 20)
trigrams <- head(tri_corpus_freq, 20)
# Plots
unigramsplot <-ggplot(data=unigrams, aes(x=unigrams$word, y=unigrams$frequency)) +
geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
unigramsplot <- unigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
labs(title = "Unigrams - Most Frequently Used Words")
unigramsplot
bigramsplot <-ggplot(data=bigrams, aes(x=bigrams$word, y=bigrams$frequency)) +
geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
bigramsplot <- bigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
labs(title = "Bigrams - Most Frequently Used Words")
bigramsplot
trigramsplot <-ggplot(data=trigrams, aes(x=trigrams$word, y=trigrams$frequency)) +
geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
trigramsplot <- trigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
labs(title = "Trigrams - Most Frequently Used Words")
trigramsplot
Question 1) How can you efficiently store an n-gram model (think Markov Chains)? A Markov chain is a stochastic process that fulfils the so-called Markov property (sometimes also referred to as memorylessness). The Markov property is fulfilled if the future states of process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that happened before it. Speech and written language seem like suitable application cases for Markov chains. In addition to the reduction and increase in efficiency a model application will bring, I assume that relying on the most common generated n-grams only (an feed them into the model) will also increase storing efficiency.
Question 2) How can you use the knowledge about word frequencies to make your model smaller and more efficient? It can be assumed that focussing on bigrams and trigrams alone will incorporate many of the common unigrams without losing much information. This will at the same time allow us to reduce the quantity of data being processing and potentially lower the calculation speed of the model
Question 3) How many parameters do you need (i.e. how big is n in your n-gram model)? I assume that 3 types of n-grams, i.e. 2-grams, 3-grams and 4-grams are suitable parameters.
Question 4) Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data? This questions seems to point to a so called Hidden Markov Chain model (HMM). A HMM is one where the rules for producing the chain are not know i.e. “hidden”. The rules include two probabilities: - the probability that there will be a certain observation - the probability that there will be a certain state transition, given the state of the model at a certain time. Application cases of HMM are typically in the area of reinforcement learning and temporal pattern recognition (recognition of speech recognition, part-of speech tagging etc.).
Sources: - http://www.mi.fu-berlin.de/wiki/pub/ABI/HmmEm/hmms-em.pdf and https://en.wikipedia.org/wiki/Hidden_Markov_model
5) How do you evaluate whether your model is any good? A straightforward approach might be to split the cumulative dataset into a training and test dataset. Model development is carried out using the training dataset whereas accuracy is evaluated using the test dataset.
6) How can you use backoff models to estimate the probability of unobserved n-grams? A quick web research I have conducted suggests that the back-off model by Katz is an alternative to the Hidden Markov Chain model mentioned above.
A key challenge in n-gram modelling is general scarcity of data, e.g. the low frequency of trigrams. A method to tackle this is smoothing which is the process of modifying a probability distribution of a language model so that all reasonable word sequences can appear with a certain probability. This often implies broadening a distribution of n-grams and re-distributing weight form areas with high probability to zero-probability regions. Apparently, there are different approaches to backoff-modelling such as the models by Katz and Kneser-Ney. They definitely deserve further research within the next steps of the Capstone project.
Sources: https://www.isip.piconepress.com/courses/msstate/ece_8463/lectures/2004_fall/lecture_33/lecture_33.pdf as well as https://en.wikipedia.org/wiki/Katz%27s_back-off_model and https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing