Executive Summary

This report is a milestone report for the Data Science Captone class. The dataset is provided by SwiftKey, and this report focuses on the subset written in English. Tasks to accomplish:

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

  1. What are the distributions of word frequencies and the frequencies of 2-grams and 3-grams in the dataset?
  2. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  3. How do you evaluate how many of the words come from foreign languages?
  4. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Data download and processing

We first download the data and unzip the files.

url<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "SwiftKey.zip")
unzip("SwiftKey.zip")

We then identify and read the files that will be used in the subsequent analysis.

news<-readLines(con = file("./final/en_US/en_US.news.txt"))
twitter<-readLines(con = file("./final/en_US/en_US.twitter.txt"))
## Warning in readLines(con = file("./final/en_US/en_US.twitter.txt")): line 167155
## appears to contain an embedded nul
## Warning in readLines(con = file("./final/en_US/en_US.twitter.txt")): line 268547
## appears to contain an embedded nul
## Warning in readLines(con = file("./final/en_US/en_US.twitter.txt")): line
## 1274086 appears to contain an embedded nul
## Warning in readLines(con = file("./final/en_US/en_US.twitter.txt")): line
## 1759032 appears to contain an embedded nul
blogs<-readLines(con = file("./final/en_US/en_US.blogs.txt"))

The requried libraries are downloaded.

suppressPackageStartupMessages(library(stringi))
suppressPackageStartupMessages(library(dplyr))
#library(RWeka) 
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(tm))

Data overview

We have three different data files from three sources, blogs, news and twitter, and the files are named blogs, news and twitter. A quick data overview is generated to check the sizes and lines of each data file.

filesummary<-data.frame(Source=c("Blogs","News","Twitter")) 
filesummary$sizeMB<-c(file.size("./final/en_US/en_US.blogs.txt")/1024^2, file.size("./final/en_US/en_US.news.txt")/1024^2,file.size("./final/en_US/en_US.twitter.txt")/1024^2 )         
filesummary$lines<-c(length(blogs),length(news),length(twitter))
filesummary$Words_count<-c(sum(stri_count_words(blogs)),sum(stri_count_words(news)),sum(stri_count_words(twitter)))
                        filesummary$Character_count<-c(sum(nchar(blogs)),sum(nchar(news)),sum(nchar(twitter)))
print(filesummary)   
##    Source   sizeMB   lines Words_count Character_count
## 1   Blogs 200.4242  899288    37546239       206824505
## 2    News 196.2775 1010242    34762395       203223159
## 3 Twitter 159.3641 2360148    30093372       162096031

Data sampling

The three data files, blogs, news and twitter, contain large amounts of information. To allow quick processing, a sampling of 5,000 are generated for each data file.

set.seed(1234)
sample_set<-function (x){
        sample(x, ifelse(length(x)*0.01<5000, length(x)*0.01, 5000))
}
blogs_sample<-sample_set(blogs)
news_sample<-sample_set(news)
twitter_sample<-sample_set(twitter)

Data filtering

Numbers, stopwords, whitespace, punctuation and endwords will be removed, and all words will be converted to lower case.

filtered<-function(x){
        x<-VCorpus(VectorSource(x))
        x <- tm_map(x, content_transformer(tolower))
        x<- tm_map(x, stripWhitespace)
        x <- tm_map(x, removePunctuation, preserve_intra_word_contractions = TRUE)
        x <- tm_map(x, removeNumbers)
        x<-tm_map(x,removeWords, stopwords("english"))
}
news_sample<-filtered(news_sample)
blogs_sample<-filtered(blogs_sample)
twitter_sample<-filtered(twitter_sample)

Tokenization

We first generate functions (NgramTokenizer) to generate n-gram tokens, and corresponding functions (ngram) extract n-grams and their frequencies from a dataset.

UnigramTokenizer <-
  function(x) {
    unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
  }
BigramTokenizer <-
  function(x){
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  }
TrigramTokenizer <-
  function(x) {
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
  }
unigram<-function(x){
        dtm <- DocumentTermMatrix(x,control = list(tokenize=UnigramTokenizer))
        frequencylist<-colSums(as.matrix(dtm))
   ngram<-data.frame(ngram=names(frequencylist),freq=frequencylist)
   ngram<-ngram[order(-ngram$freq),]
}
bigram<-function(x){
        dtm <- DocumentTermMatrix(x,control = list(tokenize=BigramTokenizer))
        frequencylist<-colSums(as.matrix(dtm))
   ngram<-data.frame(ngram=names(frequencylist),freq=frequencylist)
   ngram<-ngram[order(-ngram$freq),]
}
trigram<-function(x){
        dtm <- DocumentTermMatrix(x,control = list(tokenize=TrigramTokenizer))
        frequencylist<-colSums(as.matrix(dtm))
   ngram<-data.frame(ngram=names(frequencylist),freq=frequencylist)
   ngram<-ngram[order(-ngram$freq),]
}

Using the functions described above, we generated unigrams, bigrams and trigrams from samples of blogs, news and twitter.

blogs_1<-unigram(blogs_sample)
blogs_2<-bigram(blogs_sample)
blogs_3<-trigram(blogs_sample)
news_1<-unigram(news_sample)
news_2<-bigram(news_sample)
news_3<-trigram(news_sample)
twitter_1<-unigram(twitter_sample)
twitter_2<-bigram(twitter_sample)
twitter_3<-trigram(twitter_sample)

Exploratory Analysis

n-grams Frequency Analysis

We plotted the top 20 n-grams from each of the sources.

g_blogs1<-ggplot(data=blogs_1[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#FF66B2") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    ggtitle( "Top 20 1-grams\nin Blogs")+
        theme(plot.title = element_text(size = 8, face = "bold"))
g_news1<-ggplot(data=news_1[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#66FFB2") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    ggtitle("Top 20 1-grams\nin News")+
        theme(plot.title = element_text(size = 8, face = "bold"))

g_twitter1<-ggplot(data=twitter_1[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#66B2FF") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    ggtitle ("Top 20 1-grams\nin Twitter")+
        theme(plot.title = element_text(size = 8, face = "bold"))

grid.arrange(g_blogs1, g_news1, g_twitter1, ncol=3)

g_blogs2<-ggplot(data=blogs_2[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#FF66B2") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    ggtitle( "Top 20 2-grams\nin Blogs")+
        theme(plot.title = element_text(size = 8, face = "bold"))
g_news2<-ggplot(data=news_2[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#66FFB2") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    ggtitle("Top 20 2-grams\nin News")+
        theme(plot.title = element_text(size = 8, face = "bold"))

g_twitter2<-ggplot(data=twitter_2[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#66B2FF") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    ggtitle ("Top 20 2-grams\nin Twitter")+
        theme(plot.title = element_text(size = 8, face = "bold"))

grid.arrange(g_blogs2, g_news2, g_twitter2, ncol=3)

g_blogs3<-ggplot(data=blogs_3[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#FF66B2") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    ggtitle( "Top 20 3-grams\nin Blogs")+
        theme(plot.title = element_text(size = 8, face = "bold"))
g_news3<-ggplot(data=news_3[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#66FFB2") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    ggtitle("Top 20 3-grams\nin News")+
        theme(plot.title = element_text(size = 8, face = "bold"))

g_twitter3<-ggplot(data=twitter_3[1:20,], aes(x=reorder(ngram, freq), y=freq)) +
    geom_bar(stat = "identity", fill="#66B2FF") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    ggtitle ("Top 20 3-grams\nin Twitter")+
        theme(plot.title = element_text(size = 8, face = "bold"))

grid.arrange(g_blogs3, g_news3, g_twitter3, ncol=3)

Unique Word Coverage

How many unique words does one need in a frequency sorted dictionary to cover 50% of all word instances in the language? We first merge the three samples together, and write a function to find the number of unique words/terms to cover a particular cover_rate.

merge_sample<-c(blogs_sample,news_sample,twitter_sample)
merge_1<-unigram(merge_sample)
wordcoverage<-function(x,cover_rate) 
  {nwords<-0 # initial counter
  coverage<-cover_rate*sum(x$freq) 
  for (i in 1:nrow(x))
    {if (nwords >= coverage) {return (i)}
    nwords<-nwords+x$freq[i]
  }
}

The number of unique words it takes to cover 50% and 90% of the language are shown here, using the merged sampled data sets from blogs, news and twitter.

wordcoverage(merge_1,0.5)
## [1] 1101
wordcoverage(merge_1,0.9)
## [1] 14678

Foreign Language Evaluation

We can evaluate how many of the words come from foreign languages by obtaining data set written in other languages, generating and tokenizing corpora, and then examining the overlapping tokens between the foreign language corpora and our current English corpus.

How to Increase the Coverage of the Corpus

We use the unigrams name and frequency dataset from the merged sample as the example. A plot is generated to show the relationship between number of words and coverage rate.

merge_1 <- merge_1 %>% 
         mutate(Cumulative.Frequency = cumsum(freq))

total <- max(merge_1$Cumulative.Frequency)
merge_1 <- merge_1 %>%
         mutate(Coverage = Cumulative.Frequency/total) %>%
         mutate(`Number of Words` = 1:nrow(merge_1))
p <- ggplot(merge_1, aes(x = `Number of Words`, y = Coverage)) + ggtitle ("Corpus Coverage based on Number of Unique Unigrams")+ geom_line() + geom_hline(yintercept = c(0.5, 0.9)) 
p

We found that the coverage increases sharply from 0-0.75 with increasing number of words, and the rate of increase slows once the coverage reaches 0.9.