This is my milestone report for Data Science Specification Capstone course provided by Johns Hopkins University on Coursera.
The motivation for this report is to:
Demonstrate that I’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings so far.
Set out plans for creating a prediction algorithm and Shiny app.
The dataset[1] is a zip file including blog posts, news articles, and Twitter tweets in four languages (English, German, Finnish, and Russian). The dataset was downloaded using R and and English text files, i.e. en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt, are used for the present study.
# download datasets unless dataset.zip exists and display files
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("dataset.zip")) {
download.file(url=url, destfile="dataset.zip", mode="wb")
unzip("dataset.zip")
} else {
list.files("./final/en_US", recursive=TRUE)
}
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Below lists key statistics of these files, including number of lines, word count, file sizes, and word count per line in each file. The wc command is used to obtain the file details. For reference to wc command application, please refer to this page.
| File.Name | Lines | Word.Count | Size.MB | Words.per.Line |
|---|---|---|---|---|
| blogs | 899288 | 37272578 | 200.42 | 41 |
| news | 1010242 | 34309642 | 196.28 | 34 |
| 2360148 | 30341028 | 159.36 | 13 | |
| Total | 4269678 | 101923248 | 556.07 | 24 |
To make current exploratory analysis easier, 10% of each text file are sampled using rbinom. Below lists key statistics of the samples.
| samples | Lines | Word.Count | Unique.words | Size.MB | Words.per.Line |
|---|---|---|---|---|---|
| s.blogs | 89339 | 3790752 | 109361 | 8.84 | 42 |
| s.news | 100963 | 3462770 | 114900 | 8.85 | 34 |
| s.twitter | 236513 | 3060412 | 107987 | 7.84 | 13 |
| Total | 426815 | 10313934 | 332248 | 25.54 | 24 |
This reduces the original files to a decent size of 26 MB in total. Samples generally maintain their features being that blogs have more words per line than news and twitter.
Let’s look at the three samples more closely and gain an insight into their features, specifically the type-token information of each sample, and top commonly occurring word strings (i.e. unigrams, bigrams and trigrams), and think of a strategy for building an effective and balanced N-gram prediction model.
quanteda packageInitially I tried to build a corpus from the samples using tm package, but I encountered the error message when converting my document-term matrix to a matrix (it has a vector of size 17.4 Gb!). Therefore, I decided to explore the package quanteda as it promises to provide a fast and convenient method not only for cleaning the data, but also to obtain a selected matrix of documents by features[2].
Comparing texts from each sample, the following plot shows twitter generally contains much shorter lines than blogs and news. However, it appears that, from looking at the slope of the graphs, the vocabulary is similarly ‘lexcially varied’ for the three samples.
In fact, the lexical diversity of a corpus can be quantified by a few indices proposed in textstat_lexdiv(). From the table below, we can see that the ‘ordinary Type-Token Ratio’ (TTR), ‘log Type-Token Ratio’ (C), ‘Carroll’s Corrected TTR’ (CTTR), and ‘Dugast’s Uber Index’ (U) of the three samples are all really close, which seemingly concur with what we’ve seen from the plot, i.e. the three samples are of similar lexical complexity.
| TTR | C | CTTR | U | |
|---|---|---|---|---|
| blogs | 0.0570036 | 0.8017523 | 55.35058 | 31.65468 |
| news | 0.0576011 | 0.8029056 | 56.82281 | 31.90964 |
| 0.0605030 | 0.8042740 | 55.37630 | 31.80018 |
Let’s have a look at the most frequent words appearing in the three samples. Below shows a comparion plot produced with textplot_wordcloud(). Here I’ve removed digits, punctuations, symbols, Twitter characters @ and #, and hyphens, as well as English stopwords (i.e. highly occurring words such as personal pronouns, linking verbs, contractions of negated forms of auxiliary verbs) and profanity words. Profanity word list is obtained from this page[3]. NB: I have not yet decided whether to keep or remove English stopwords. In addition, words are not stemmed at this stage.
You can see twitter features include quite a few abbreviations or shorthand slangs (i.e. twitter lingo), such as ‘lol’ (laugh out loud), ‘rt’ (retweet), and ‘ur’ (your). Considering such kind of strings are going to add noise to the data, I’ve created a list of twitter stopwords to be removed in the following analysis.
# create a twitter lingo dataset for removal
twitterlingo <- c("lol","oh","yeah","wow","rt","ur","ya","im","haha","hahaha","hahahaha","tn",letters)
A N-gram is a sequence of N words, e.g. a 2-gram (or bigram) is a two-words sequence like ‘may i’, ‘help you’, and a 3-gram (or trigram) is a three-word sequence like ‘may i help’, ‘i help you’. N-grams (Markov models[4]) assume that we can estimate the probability of a word given just the last few words, and N-gram probabilities can be calculated by counting in a corpus and normalising the counts by dividing by some total count (the maximum likelihood estimate or MLE). For instance, given the word “help”, the probability of next word being “you” is p(you|help) = count(“help you”)/count(“help”). In this section, we aim to construct the frequency table of N-grams (N=1,2,3).
All three samples are combined as one mixed corpus. Here we plot the top 30 most frequently occurring N-grams from the combined corpus.
Here we consider how many words required to cover 50%, 60%, 70%, 80%, 90% and 99% of the corpus. The following plot shows that unigram coverage.
As you can see, by considering 90% coverage, we dramatically reduce the number of unigrams from 179567 to 15541. By considering 99% coverage, we dramatically reduce the number of unigrams to 109373, which is 39% reduction.
At the moment I think to remove those N-grams that occur only once. After all, the very rare words or phrases are often typos and can be dismissed. After pruning the less frequent N-grams, sizes of N-gram tables shrink to a much smaller scale.
| Ngrams | Size.MB.Original | Size.MB.Pruned |
|---|---|---|
| Unigrams | 2.59 | 1.26 |
| Bigrams | 54.32 | 20.74 |
| Trigrams | 105.73 | 39.53 |
| Total | 162.64 | 61.53 |
So far, we’ve looked at the datasets given by Coursera course, made 10% of them as samples, explored the differences of the samples. We also contructed the frequency tables for the corpus built from the samples and pruned the data down to a manageable size.
For future work, I hope to complete the following tasks, as suggested in Jurafsky and Martin’s book[4] to build an N-gram language model:
Divide the data into 80% training, 10% development, and 10% test.
Rebuild N-gram frequency tables from the training data and model the probability of a word based on the Markov chain assumption.
Use a backoff strategy to predict. If n-gram is not available, use (n-1)-gram to predict, recurrently until unigram if necessary.
Employ Good-Turing smoothing methods to estimate the probabilities of words that have not seen before.
Evaluate the model on the test set using perplexity. A low perplexity indicates the probability distribution is good at predicting the sample.
Plan for Shiny App:
The Shiny App will have a text input box where users can enter words. The language algorithm will predict the next likely word options which will show as a word cloud plot with a bigger font size indicating higher probability.
[1] Coursera Swiftkey dataset, retrieved Feburary 5, 2018.
[2] [Package quanteda](https://cran.r-project.org/web/packages/quanteda/quanteda.pdf0). November 13, 2017.
[3] Parker, J. 2016. Bad Words, Swear Words, Offensive Words, Profanities Banned by Google
[4] Jurafsky, D. and Martin, J. H. Speech and Language Processing (3rd ed. draft). Draft chapters in progress, August 28, 2017.
# create assets folder
if (!file.exists("./assets")) {
dir.create("./assets")
}
assetfolder <- file.path(".","assets")
f.details <- file.path(assetfolder, "files_details")
# read lines from the three text files
if(!file.exists(f.details)) {
filenames <- Sys.glob(paste0(getwd(),"/final/en_US/","*.txt"))
con1 <- file(filenames[1], open = "rb")
blogs_all <- readLines(con1, skipNul=TRUE,encoding = "UTF-8")
close(con1)
con2 <- file(filenames[2], open = "rb")
news_all <- readLines(con2,skipNul=TRUE,encoding = "UTF-8")
close(con2)
con3 <- file(filenames[3], open = "rb")
twitter_all <- readLines(con3,skipNul=TRUE,encoding = "UTF-8")
close(con3)
dataset_summary <- system("wc ./final/en_US/*.txt", intern = T) # The wc command displays lines, words and bytes of the files. For reference [here](https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.cmds6/wc.htm)
properties <- na.omit(as.numeric(unlist(strsplit(dataset_summary," "))))
# it took too long time to execute the following codes, so i had to not include them.
# unique.words.blogs.all <- ntype(tolower(toString(blogs_all)),removePunct=TRUE)
# unique.words.news.all <- ntype(tolower(toString(news_all)),removePunct=TRUE)
# unique.words.twitter.all <- ntype(tolower(toString(twitter_all)),removePunct=TRUE)
files_details <- data.frame(File.Name = c("blogs","news","twitter","Total"),
Lines = c(properties[1],properties[4],properties[7],properties[10]),
Word.Count = c(properties[2],properties[5],properties[8],properties[11]),
Size.MB = round(c(properties[3]/1024/1024,properties[6]/1024/1024,properties[9]/1024/1024,properties[12]/1024/1024),digits=2),
Words.per.Line = round(c(properties[2]/properties[1],properties[5]/properties[4],properties[8]/properties[7],properties[11]/properties[10]),digits=0))
saveRDS(files_details, file=f.details)
} else {
files_details <- readRDS(f.details)
}
kable(files_details)
blogsample <- file.path(assetfolder,"blogs.txt")
newsample <- file.path(assetfolder,"news.txt")
twittersample <- file.path(assetfolder,"twitter.txt")
s.details <- file.path(assetfolder,"sample_details")
if (!file.exists(blogsample)) {
# randomly assign index 0 and 1 to each line; lines assigned 1 are taken
set.seed(214748)
blogs_index <- as.logical(rbinom(length(blogs_all),1,0.1))
news_index <- as.logical(rbinom(length(news_all),1,0.1))
twitter_index <- as.logical(rbinom(length(twitter_all),1,0.1))
blogs <- blogs_all[blogs_index]
news <- news_all[news_index]
twitter <- twitter_all[twitter_index]
# save the samples to .txt files in local folders so that you don't need to run the sampling in future R sessions
saveRDS(blogs, file=blogsample)
saveRDS(news, file=newsample)
saveRDS(twitter, file=twittersample)
} else {
# read in samples
blogs <- readRDS(blogsample)
news <- readRDS(newsample)
twitter <- readRDS(twittersample)
}
if(!file.exists(s.details)) {
# use wc command to count bytes of the samples
sample_info <- system("wc -c blogs.txt news.txt twitter.txt", intern = T)
s.info <- na.omit(as.numeric(unlist(strsplit(sample_info," "))))
wc.blogs <- sum(str_count(blogs,"[[:alpha:]]+"))
wc.news <- sum(str_count(news, "[[:alpha:]]+"))
wc.twitter <- sum(str_count(twitter,"[[:alpha:]]+"))
unique.words.blogs <- ntype(tolower(toString(blogs)),removePunct=TRUE)
unique.words.news <- ntype(tolower(toString(news)),removePunct=TRUE)
# for twitter due to the existence of emoji, extra step of processing is required
twitter <- sapply(twitter,function(row) iconv(row, "latin1", "ASCII", sub=""))
unique.words.twitter <- ntype(tolower(toString(twitter)),removePunct=TRUE)
sample_details <- data.frame(samples = c("s.blogs","s.news","s.twitter","Total"),
Lines = c(length(blogs),length(news),length(twitter),length(blogs)+length(news)+length(twitter)),
Word.Count = c(wc.blogs, wc.news, wc.twitter, wc.blogs+wc.news+wc.twitter),
Unique.words = c(unique.words.blogs, unique.words.news, unique.words.twitter,unique.words.blogs+unique.words.news+unique.words.twitter),
Size.MB = round(c(s.info[1]/1024/1024,s.info[2]/1024/1024,s.info[3]/1024/1024,s.info[4]/1024/1024),digits=2),
Words.per.Line = round(c(wc.blogs/length(blogs),wc.news/length(news),wc.twitter/length(twitter),(wc.blogs+wc.news+wc.twitter)/(length(blogs)+length(news)+length(twitter))),digits=0))
saveRDS(blogs, file=file.path(assetfolder,"sample_details"))
} else {
sample_details <- readRDS(s.details)
}
kable(sample_details)
tokeninfo <- file.path(assetfolder,"tokenInfo")
if (!file.exists(tokeninfo)) {
mycorpus.blogs <- corpus(c(blog=blogs),metacorpus = list(source="Capstone Project",notes="mycorpus"))
mycorpus.news <- corpus(c(news=news),metacorpus = list(source="Capstone Project",notes="mycorpus"))
mycorpus.twitter <- corpus(c(twitter=twitter),metacorpus = list(source="Capstone Project",notes="mycorpus"))
docvars(mycorpus.blogs,"Sources") <- "blogs"
docvars(mycorpus.news,"Sources") <- "news"
docvars(mycorpus.twitter,"Sources") <- "twitter"
tokenInfo.blogs <- summary(mycorpus.blogs)
tokenInfo.news <- summary(mycorpus.news)
tokenInfo.twitter <- summary(mycorpus.twitter)
tokenInfo <- rbind(tokenInfo.blogs,tokenInfo.news,tokenInfo.twitter)
saveRDS(tokenInfo, file = tokeninfo)
} else {
tokenInfo <- readRDS (tokeninfo)
}
# Plot tokens and types of the three different genres for the first 2000 lines
p1 <- ggplot(data=tokenInfo, aes(x = Tokens , y = Types, color = Sources)) + geom_line() + geom_point() + facet_grid(.~Sources) + labs(x = "Tokens - total occurrences of words", y = "Types - unique words")
p1
badwordstxt <- file.path(assetfolder,"badwords.txt")
con4 <- file(badwordstxt, open="rb")
badwords <- readLines(con4,skipNul=TRUE,encoding = "ISO-8859-1")
close(con4)
# Trim whitespace at the end of the word
badwords <- gsub("^\\s+|\\s+$", "", badwords) # trim off whitespace at the beginning and end of words http://www.endmemo.com/program/R/grepl.php
myCorpus.dfm <- file.path(assetfolder,"mycorpus.dfm")
if (!file.exists(myCorpus.dfm)) {
# build a corpus from the three sammples
corpus.blogs <- corpus(c(blog=toString(blogs)),metacorpus = list(source="Capstone Project",notes="mycorpus"))
corpus.news <- corpus(c(news=toString(news)),metacorpus = list(source="Capstone Project",notes="mycorpus"))
corpus.twitter <- corpus(c(twitter=toString(twitter)),metacorpus = list(source="Capstone Project",notes="mycorpus"))
docvars(corpus.blogs,"Sources") <- "blogs"
docvars(corpus.news,"Sources") <- "news"
docvars(corpus.twitter,"Sources") <- "twitter"
mycorpus <- corpus.blogs + corpus.news + corpus.twitter
# please note that the dfm() function applies tolower() by default
mycorpus.dfm <- mycorpus %>% dfm(groups = "Sources", remove = c(stopwords("english"),badwords), remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE)
saveRDS(mycorpus.dfm,file=myCorpus.dfm)
} else {
mycorpus.dfm <- readRDS(myCorpus.dfm)
}
# calculate the lexical diversity or complexity of the three samples
lex.div <- textstat_lexdiv(mycorpus.dfm, c("TTR", "C", "CTTR", "U"))
kable(lex.div)
# Plot a comparison cloud to compare the top features of the three samples
mycorpus.dfm %>% dfm_trim(min_count = 1000, verbose = FALSE) %>%
textplot_wordcloud(comparison = TRUE)
unigrams <- file.path(assetfolder,"myDfm.Unigrams")
bigrams <- file.path(assetfolder,"myDfm.Bigrams")
trigrams <- file.path(assetfolder,"myDfm.Trigrams")
if (!file.exists(unigrams)){
# removing stopwords before constructing ngrams
tokensAll <- tokens(mycorpus, what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE)
tokensNoStopwords <- tokens_tolower(tokens_remove(tokensAll, c(stopwords("english"),badwords,twitterlingo)))
tokens.Unigrams <- tokens_ngrams(tokensNoStopwords, 1)
tokens.Bigrams <- tokens_ngrams(tokensNoStopwords, 2, concatenator = " ")
tokens.Trigrams <- tokens_ngrams(tokensNoStopwords, 3, concatenator = " ")
myDfm.Unigrams<- dfm(tokens.Unigrams, verbose = FALSE)
myDfm.Bigrams <- dfm(tokens.Bigrams, verbose = FALSE)
myDfm.Trigrams <- dfm(tokens.Trigrams, verbose = FALSE)
saveRDS(myDfm.Unigrams,file = unigrams)
saveRDS(myDfm.Bigrams,file = bigrams)
saveRDS(myDfm.Trigrams,file = trigrams)
} else {
myDfm.Unigrams <- readRDS(unigrams)
myDfm.Bigrams <- readRDS(bigrams)
myDfm.Trigrams <- readRDS(trigrams)
}
features.unigrams <- file.path(assetfolder,"features.Unigrams")
features.bigrams <- file.path(assetfolder,"features.Bigrams")
features.trigrams <- file.path(assetfolder,"features.Trigrams")
if (!file.exists(features.unigrams)){
features.Unigrams <- textstat_frequency(myDfm.Unigrams)
features.Bigrams <- textstat_frequency(myDfm.Bigrams)
features.Trigrams <- textstat_frequency(myDfm.Trigrams)
# Sort by reverse frequency order
features.Unigrams$feature <- with(features.Unigrams, reorder(feature, -frequency))
features.Bigrams$feature <- with(features.Bigrams, reorder(feature, -frequency))
features.Trigrams$feature <- with(features.Trigrams, reorder(feature, -frequency))
saveRDS(features.Unigrams,file = features.unigrams)
saveRDS(features.Bigrams,file = features.bigrams)
saveRDS(features.Trigrams,file = features.trigrams)
} else {
features.Unigrams <- readRDS(features.unigrams)
features.Bigrams <- readRDS(features.bigrams)
features.Trigrams <- readRDS(features.trigrams)
}
# Top 30 features
features.Unigrams1 <- cbind(Ngrams = "Unigram", features.Unigrams[1:30,])
features.Bigrams1 <- cbind(Ngrams = "Bigram", features.Bigrams[1:30,])
features.Trigrams1 <- cbind(Ngrams = "Trigram", features.Trigrams[1:30,])
features.top <- rbind(features.Unigrams1,features.Bigrams1,features.Trigrams1)
# Plotting top features
# Plotting top features
f1 <- ggplot(features.Unigrams1, aes(x = reorder(feature, frequency), y = frequency, fill=frequency)) + geom_bar(stat='identity') + coord_flip() + ggtitle("Top 30 Most Frequently Occurring Unigrams") + labs(x = "Features",y = "Frequency") + theme(text = element_text(size=20),legend.title=element_blank(), legend.position = "none")
f2 <- ggplot(features.Bigrams1, aes(x = reorder(feature, frequency), y = frequency, fill=frequency)) + geom_bar(stat='identity') + coord_flip() + ggtitle("Top 30 Most Frequently Occurring Bigrams") + labs(x = "Features",y = "Frequency") + theme(text = element_text(size=20),legend.title=element_blank(), legend.position = "none")
f3 <- ggplot(features.Trigrams1, aes(x = reorder(feature, frequency), y = frequency, fill=frequency)) + geom_bar(stat='identity') + coord_flip() + ggtitle("Top 30 Most Frequently Occurring Trigrams") + labs(x = "Features",y = "Frequency") + theme(text = element_text(size=20),legend.title=element_blank(), legend.position = "none")
multiplot(f1,f2,f3,cols = 2)
# write a function to calculate how many n-grams are needed to cover certain percentage of the corpus
coverage <- function(features,pc,c){
total <- sum(features$frequency)
i = 0
covered = c # starting from a certain c to reduce looping times
while( covered < total*pc ){
i = i + 1
covered = covered + features[i,]$frequency
}
i
}
unigrams0.5 <- coverage(features.Unigrams,0.5,500)
unigrams0.6 <- coverage(features.Unigrams,0.6,1200)
unigrams0.7 <- coverage(features.Unigrams,0.7,1500)
unigrams0.8 <- coverage(features.Unigrams,0.8,2000)
unigrams0.9 <- coverage(features.Unigrams,0.9,6000)
unigrams0.99 <- coverage(features.Unigrams,0.99,15500)
unigramcoverage <- data.frame(Coverage.percentage = c("0.5", "0.6", "0.7", "0.8", "0.9", "0.99"),words.count = c(unigrams0.5,unigrams0.6,unigrams0.7,unigrams0.8,unigrams0.9,unigrams0.99))
coverage.plot <- ggplot(unigramcoverage,aes(x = words.count, y =Coverage.percentage, group=1)) + geom_line(color="blue") + geom_text(aes(label = words.count, vjust=-1.5)) + labs(x = "Number of unigrams required", y="Coverage") + ggtitle("Number of unigrams required to cover a certain percentage of the corpus")
coverage.plot
# not run due to being unable to produce results
# ngram.coverage <- as.data.frame(coverage = c("0.5", "0.6", "0.7", "0.8", "0.9", "0.95"), Unigrams =c(coverage(features.Unigrams,0.5),coverage(features.Unigrams,0.6),coverage(features.Unigrams,0.7),coverage(features.Unigrams,0.8),coverage(features.Unigrams,0.9),coverage(features.Unigrams,0.95)), Bigrams = c(coverage(features.Bigrams,0.5),coverage(features.Bigrams,0.6),coverage(features.Bigrams,0.7),coverage(features.Bigrams,0.8),coverage(features.Bigrams,0.9),coverage(features.Bigrams,0.95)), Trigrams = coverage(features.Trigrams,0.5),coverage(features.Trigrams,0.6),coverage(features.Trigrams,0.7),coverage(features.Trigrams,0.8),coverage(features.Trigrams,0.9),coverage(features.Trigrams,0.95))
# Calculate the size of Ngram frequency tables
featuresize <- system("wc -c ./assets/features.*", intern = T)
sizes <- na.omit(as.numeric(unlist(strsplit(featuresize," "))))
# calculate the size of pruned Ngram frequency tables
featuresmall.unigrams <- file.path(assetfolder,"featuresmall.Unigrams")
featuresmall.bigrams <- file.path(assetfolder,"featuresmall.Bigrams")
featuresmall.trigrams <- file.path(assetfolder,"featuresmall.Trigrams")
if(!file.exists(featuresmall.unigrams)){
featuresmall.Unigrams <- subset(features.Unigrams,features.Unigrams$frequency>1)
featuresmall.Bigrams <- subset(features.Bigrams,features.Bigrams$frequency>1)
featuresmall.Trigrams <- subset(features.Trigrams,features.Trigrams$frequency>1)
saveRDS(featuresmall.Unigrams,file = featuresmall.unigrams)
saveRDS(featuresmall.Bigrams,file = featuresmall.bigrams)
saveRDS(featuresmall.Trigrams,file = featuresmall.trigrams)
} else {
featuresmall.Unigrams <- readRDS(featuresmall.unigrams)
featuresmall.Bigrams <- readRDS(featuresmall.bigrams)
featuresmall.Trigrams <- readRDS(featuresmall.trigrams)
}
# Display the size of Ngram frequency tables
featuresamllsize <- system("wc -c ./assets/featuresmall.*", intern = T)
smallsizes <- na.omit(as.numeric(unlist(strsplit(featuresamllsize," "))))
feat.size.pruned <- data.frame(Ngrams = c("Unigrams","Bigrams","Trigrams","Total"),Size.MB.Original = round(c(sizes[3]/1024/1024,sizes[1]/1024/1024,sizes[2]/1024/1024,sizes[4]/1024/1024),2),Size.MB.Pruned = round(c(smallsizes[3]/1024/1024,smallsizes[1]/1024/1024,smallsizes[2]/1024/1024,smallsizes[4]/1024/1024),2))
kable(feat.size.pruned)