Executive Summary

Loading Data

## [1] "./final/en_US/en_US.blogs.txt"   "./final/en_US/en_US.news.txt"   
## [3] "./final/en_US/en_US.twitter.txt"

Exploratory Analysis

Word count of each file

## $blogs
## [1] 37334131
## 
## $news
## [1] 34372530
## 
## $twitter
## [1] 30373583

Line count of each file

## $blogs
## [1] 899288
## 
## $news
## [1] 1010242
## 
## $twitter
## [1] 2360148

Tokenization

  • Given the large size of data, we tokenize the 10% sample of data.
  • Overview of the token is shown below.
## Tokens consisting of 426,967 documents.
## text1 :
##  [1] "It"         "wasn't"     "a"          "rebellion"  "The"       
##  [6] "Metis"      "were"       "not"        "insurgents" "They"      
## [11] "never"      "were"      
## [ ... and 122 more ]
## 
## text2 :
## [1] "04"      "Toot"    "Toot"    "Tootsie" "Styne"   "Green"   "Cahn"   
## [8] "04"      "22"     
## 
## text3 :
##  [1] "Somehow" "I"       "knew"    "Millar"  "would"   "through" "Cowboy" 
##  [8] "Up"      "in"      "there"  
## 
## text4 :
## [1] "I'm"      "watch"    "Caged"    "Are"      "you"      "watching" "it"      
## [8] "to"      
## 
## text5 :
##  [1] "Also"   "it"     "would"  "appear" "that"   "Tetley" "will"   "no"    
##  [9] "longer" "be"     "sold"   "at"    
## [ ... and 76 more ]
## 
## text6 :
##  [1] "I"       "must"    "be"      "tired"   "I"       "just"    "carried"
##  [8] "my"      "cup"     "of"      "#coffee" "with"   
## [ ... and 4 more ]
## 
## [ reached max_ndoc ... 426,961 more documents ]

Document-feature matrix

  • Then, we construct a document-feature matrix.
  • Here, we show the first 6 words and the last 6 words of the top 1000 words.
  • As you can see, the top 1,000 words cover approximately 70% of all words instances.
  • We plot frequency of words below and you can see that top words accounts for large proportion of data.
  • We also plot a word cloud of the data using textplot_wordcloud() function.
##     topwords proportion    cum_sum
## the   476527 0.04686830 0.04686830
## to    275724 0.02711854 0.07398683
## and   241513 0.02375375 0.09774058
## a     239381 0.02354406 0.12128464
## of    200902 0.01975950 0.14104414
## i     164829 0.01621158 0.15725572
##             topwords   proportion   cum_sum
## chicken         1087 0.0001069107 0.6990524
## development     1086 0.0001068124 0.6991592
## deep            1086 0.0001068124 0.6992660
## photos          1085 0.0001067140 0.6993727
## plus            1083 0.0001065173 0.6994792
## restaurant      1083 0.0001065173 0.6995857

n-grams (n = 2)

  • Finally, we generate n-grams using tokens_ngrams() function.
  • Given the large size of data, we generate bigram only.
  • The frequency of bigrams look similar to the previous plot and top bigrams account for large portion of data.
##         topngrams  proportion     cum_sum
## of_the      43165 0.004431541 0.004431541
## in_the      40756 0.004184220 0.008615761
## to_the      21468 0.002204015 0.010819776
## for_the     20205 0.002074349 0.012894125
## on_the      19602 0.002012442 0.014906567
## to_be       16101 0.001653011 0.016559578
##            topngrams   proportion   cum_sum
## just_to          617 6.334439e-05 0.1771516
## would_love       617 6.334439e-05 0.1772149
## people_to        616 6.324172e-05 0.1772782
## a_bad            616 6.324172e-05 0.1773414
## we_got           616 6.324172e-05 0.1774046
## was_on           616 6.324172e-05 0.1774679

Next Steps

Appendix: R Code

# setup
setwd("~/Desktop/Coursera"); set.seed(0)
library(tidyverse); library(quanteda); library(quanteda.textplots)
library(ngram); library(textclean); library(sentimentr)

# loading_data
processFile <- function(path){
        txts <- scan(path, what = character(), 
                     sep = "\n", blank.lines.skip = TRUE,
                     skipNul = TRUE, quiet = TRUE)
        return(txts)
}

file_paths <- list.files(path = "./final/en_US", full.names = TRUE)
file_list <- lapply(file_paths, processFile)
file_names <- c("blogs", "news", "twitter")
names(file_list) <- file_names

# file paths
file_paths

# wordcount
wordcountFile <- function(file){
        n <- wordcount(file)
        n_sum <- sum(n, na.rm = TRUE)
        return(n_sum)
}
wordcount_list <- lapply(file_list, wordcountFile)
wordcount_list

# linecount
linecount_list <- lapply(file_list, length)
linecount_list

# 10% sampling and tokenizing
tokenizeFile <- function(files, p = 0.1){
        docs <- unlist(files)
        size <- length(docs) * p
        docs <- sample(docs, size = size)
        docs <- replace_non_ascii(docs)
        corp <- corpus(docs)
        toks <- tokens(corp, remove_punct = TRUE)
        
        # removing bad words
        pwords <- lexicon::profanity_alvarez
        toks <- tokens_remove(toks, pattern = pwords)
        return(toks)
}

toks <- tokenizeFile(file_list, p = 0.1)
toks

# constructing a document-feature matrix
dfmat <- dfm(toks)
topwords <- topfeatures(dfmat, 1000)
df1 <- data.frame(topwords) %>%
        mutate(proportion = topwords/sum(dfmat)) %>%
        mutate(cum_sum = cumsum(proportion))
head(df1); tail(df1) # Top 1000 words cover 70% of all words instances
plot(topwords, main = "Top 1,000 Word Count")
textplot_wordcloud(dfmat, min_count = 1000)

# generating n-grams (n = 2)
toks_ngrams <- tokens_ngrams(toks, n = 2)
dfmat_ngrams <- dfm(toks_ngrams)
topngrams <- topfeatures(dfmat_ngrams, 1000)
df2 <- data.frame(topngrams) %>%
        mutate(proportion = topngrams/sum(dfmat_ngrams)) %>%
        mutate(cum_sum = cumsum(proportion))
head(df2); tail(df2)
plot(topngrams, main = "Top 1,000 Bigram Count")