This report is a milestone report for the Data Science Captone class. The dataset is provided by SwiftKey, and this report focuses on the subset written in English. Tasks to accomplish:
Questions to consider
We first download the data and unzip the files.
url<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "SwiftKey.zip")
unzip("SwiftKey.zip")
We then identify and read the files that will be used in the subsequent analysis.
news<-readLines(con = file("./final/en_US/en_US.news.txt"))
twitter<-readLines(con = file("./final/en_US/en_US.twitter.txt"))
## Warning in readLines(con = file("./final/en_US/en_US.twitter.txt")): line 167155
## appears to contain an embedded nul
## Warning in readLines(con = file("./final/en_US/en_US.twitter.txt")): line 268547
## appears to contain an embedded nul
## Warning in readLines(con = file("./final/en_US/en_US.twitter.txt")): line
## 1274086 appears to contain an embedded nul
## Warning in readLines(con = file("./final/en_US/en_US.twitter.txt")): line
## 1759032 appears to contain an embedded nul
blogs<-readLines(con = file("./final/en_US/en_US.blogs.txt"))
The requried libraries are downloaded.
suppressPackageStartupMessages(library(stringi))
suppressPackageStartupMessages(library(dplyr))
#library(RWeka)
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(tm))
We have three different data files from three sources, blogs, news and twitter, and the files are named blogs, news and twitter. A quick data overview is generated to check the sizes and lines of each data file.
filesummary<-data.frame(Source=c("Blogs","News","Twitter"))
filesummary$sizeMB<-c(file.size("./final/en_US/en_US.blogs.txt")/1024^2, file.size("./final/en_US/en_US.news.txt")/1024^2,file.size("./final/en_US/en_US.twitter.txt")/1024^2 )
filesummary$lines<-c(length(blogs),length(news),length(twitter))
filesummary$Words_count<-c(sum(stri_count_words(blogs)),sum(stri_count_words(news)),sum(stri_count_words(twitter)))
filesummary$Character_count<-c(sum(nchar(blogs)),sum(nchar(news)),sum(nchar(twitter)))
print(filesummary)
## Source sizeMB lines Words_count Character_count
## 1 Blogs 200.4242 899288 37546239 206824505
## 2 News 196.2775 1010242 34762395 203223159
## 3 Twitter 159.3641 2360148 30093372 162096031
The three data files, blogs, news and twitter, contain large amounts of information. To allow quick processing, a sampling of 5,000 are generated for each data file.
set.seed(1234)
sample_set<-function (x){
sample(x, ifelse(length(x)*0.01<5000, length(x)*0.01, 5000))
}
blogs_sample<-sample_set(blogs)
news_sample<-sample_set(news)
twitter_sample<-sample_set(twitter)
Numbers, stopwords, whitespace, punctuation and endwords will be removed, and all words will be converted to lower case.
filtered<-function(x){
x<-VCorpus(VectorSource(x))
x <- tm_map(x, content_transformer(tolower))
x<- tm_map(x, stripWhitespace)
x <- tm_map(x, removePunctuation, preserve_intra_word_contractions = TRUE)
x <- tm_map(x, removeNumbers)
x<-tm_map(x,removeWords, stopwords("english"))
}
news_sample<-filtered(news_sample)
blogs_sample<-filtered(blogs_sample)
twitter_sample<-filtered(twitter_sample)
We first generate functions (NgramTokenizer) to generate n-gram tokens, and corresponding functions (ngram) extract n-grams and their frequencies from a dataset.
UnigramTokenizer <-
function(x) {
unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
}
BigramTokenizer <-
function(x){
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
TrigramTokenizer <-
function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
unigram<-function(x){
dtm <- DocumentTermMatrix(x,control = list(tokenize=UnigramTokenizer))
frequencylist<-colSums(as.matrix(dtm))
ngram<-data.frame(ngram=names(frequencylist),freq=frequencylist)
ngram<-ngram[order(-ngram$freq),]
}
bigram<-function(x){
dtm <- DocumentTermMatrix(x,control = list(tokenize=BigramTokenizer))
frequencylist<-colSums(as.matrix(dtm))
ngram<-data.frame(ngram=names(frequencylist),freq=frequencylist)
ngram<-ngram[order(-ngram$freq),]
}
trigram<-function(x){
dtm <- DocumentTermMatrix(x,control = list(tokenize=TrigramTokenizer))
frequencylist<-colSums(as.matrix(dtm))
ngram<-data.frame(ngram=names(frequencylist),freq=frequencylist)
ngram<-ngram[order(-ngram$freq),]
}
Using the functions described above, we generated unigrams, bigrams and trigrams from samples of blogs, news and twitter.
blogs_1<-unigram(blogs_sample)
blogs_2<-bigram(blogs_sample)
blogs_3<-trigram(blogs_sample)
news_1<-unigram(news_sample)
news_2<-bigram(news_sample)
news_3<-trigram(news_sample)
twitter_1<-unigram(twitter_sample)
twitter_2<-bigram(twitter_sample)
twitter_3<-trigram(twitter_sample)
We plotted the top 20 n-grams from each of the sources.
g_blogs1<-ggplot(data=blogs_1[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#FF66B2") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Unigram") + ylab("Frequency") +
ggtitle( "Top 20 1-grams\nin Blogs")+
theme(plot.title = element_text(size = 8, face = "bold"))
g_news1<-ggplot(data=news_1[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#66FFB2") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Unigram") + ylab("Frequency") +
ggtitle("Top 20 1-grams\nin News")+
theme(plot.title = element_text(size = 8, face = "bold"))
g_twitter1<-ggplot(data=twitter_1[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#66B2FF") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Unigram") + ylab("Frequency") +
ggtitle ("Top 20 1-grams\nin Twitter")+
theme(plot.title = element_text(size = 8, face = "bold"))
grid.arrange(g_blogs1, g_news1, g_twitter1, ncol=3)
g_blogs2<-ggplot(data=blogs_2[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#FF66B2") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Bigram") + ylab("Frequency") +
ggtitle( "Top 20 2-grams\nin Blogs")+
theme(plot.title = element_text(size = 8, face = "bold"))
g_news2<-ggplot(data=news_2[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#66FFB2") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Bigram") + ylab("Frequency") +
ggtitle("Top 20 2-grams\nin News")+
theme(plot.title = element_text(size = 8, face = "bold"))
g_twitter2<-ggplot(data=twitter_2[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#66B2FF") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Bigram") + ylab("Frequency") +
ggtitle ("Top 20 2-grams\nin Twitter")+
theme(plot.title = element_text(size = 8, face = "bold"))
grid.arrange(g_blogs2, g_news2, g_twitter2, ncol=3)
g_blogs3<-ggplot(data=blogs_3[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#FF66B2") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Trigram") + ylab("Frequency") +
ggtitle( "Top 20 3-grams\nin Blogs")+
theme(plot.title = element_text(size = 8, face = "bold"))
g_news3<-ggplot(data=news_3[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#66FFB2") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Trigram") + ylab("Frequency") +
ggtitle("Top 20 3-grams\nin News")+
theme(plot.title = element_text(size = 8, face = "bold"))
g_twitter3<-ggplot(data=twitter_3[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
geom_bar(stat = "identity", fill="#66B2FF") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Trigram") + ylab("Frequency") +
ggtitle ("Top 20 3-grams\nin Twitter")+
theme(plot.title = element_text(size = 8, face = "bold"))
grid.arrange(g_blogs3, g_news3, g_twitter3, ncol=3)
How many unique words does one need in a frequency sorted dictionary to cover 50% of all word instances in the language? We first merge the three samples together, and write a function to find the number of unique words/terms to cover a particular cover_rate.
merge_sample<-c(blogs_sample,news_sample,twitter_sample)
merge_1<-unigram(merge_sample)
wordcoverage<-function(x,cover_rate)
{nwords<-0 # initial counter
coverage<-cover_rate*sum(x$freq)
for (i in 1:nrow(x))
{if (nwords >= coverage) {return (i)}
nwords<-nwords+x$freq[i]
}
}
The number of unique words it takes to cover 50% and 90% of the language are shown here, using the merged sampled data sets from blogs, news and twitter.
wordcoverage(merge_1,0.5)
## [1] 1101
wordcoverage(merge_1,0.9)
## [1] 14678
We can evaluate how many of the words come from foreign languages by obtaining data set written in other languages, generating and tokenizing corpora, and then examining the overlapping tokens between the foreign language corpora and our current English corpus.
We use the unigrams name and frequency dataset from the merged sample as the example. A plot is generated to show the relationship between number of words and coverage rate.
merge_1 <- merge_1 %>%
mutate(Cumulative.Frequency = cumsum(freq))
total <- max(merge_1$Cumulative.Frequency)
merge_1 <- merge_1 %>%
mutate(Coverage = Cumulative.Frequency/total) %>%
mutate(`Number of Words` = 1:nrow(merge_1))
p <- ggplot(merge_1, aes(x = `Number of Words`, y = Coverage)) + ggtitle ("Corpus Coverage based on Number of Unique Unigrams")+ geom_line() + geom_hline(yintercept = c(0.5, 0.9))
p
We found that the coverage increases sharply from 0-0.75 with increasing number of words, and the rate of increase slows once the coverage reaches 0.9.