Executive Summary

The target of the report is to conduct exploratory analysis on the dataset and states the goal for the final Shiny app. In this report, we are going to:

  1. Load and create summaries of the three files (Word counts, line counts and basic data tables).
  2. Conduct exploratory analysis: using basic plots and word clouds to illustrate features of the data.
  3. Give plans for creating a prediction algorithm and Shiny app.

Load and Summarise the Data

We use the Quanteda package to process and analyse the text data. The first step is to load the 3 text files, which corresponding to content of blog, news, and twitter in en_US locale. We then create 3 Quanteda corpora corresponding to the 3 text datasets. A corpus serves as the repository of original text with some metadata for future use. The basic statistics of these files can be extract from the corresponding corpora.

Load the data and create corpora

We combine the data loading and corpora creation into a single step as shown below:

require(quanteda)
require(readtext)
require(stringi)
require(ggplot2)

## Read the data file
path = "~/temp/JHU_DS/SwiftData/final/"

locale = c("de_DE", "en_US", "fi_FI", "ru_RU")
filetype = c(".blogs.txt", ".news.txt", ".twitter.txt", ".twitter_small.txt")
blog = paste(path, locale[2], "/", locale[2], filetype[1], sep="")
news = paste(path, locale[2], "/", locale[2], filetype[2], sep="")
twitter = paste(path, locale[2], "/", locale[2], filetype[3], sep="")

corp_en_US_b <- corpus(stri_split_lines1(readtext(blog)))
corp_en_US_n <- corpus(stri_split_lines1(readtext(news)))
corp_en_US_t <- corpus(stri_split_lines1(readtext(twitter)))

File size & Corpus size

The file sizes and the sizes of corresponding corpora are shown below:

sizes_corp <- c(format(object.size(corp_en_US_b), units="MB"), format(object.size(corp_en_US_n), units="MB"), format(object.size(corp_en_US_t), units="MB"))
sizes_file <- c(utils:::format.object_size(file.size(blog), "auto"), utils:::format.object_size(file.size(news), "auto"), utils:::format.object_size(file.size(twitter), "auto"))
sizes_all <- matrix(c(sizes_file, sizes_corp), nrow = 2, byrow = T, dimnames = list(c("File", "Corpus"), c("Blog", "News", "Twitter")))
sizes_all
##        Blog       News       Twitter   
## File   "200.4 Mb" "196.3 Mb" "159.4 Mb"
## Corpus "302 Mb"   "310.6 Mb" "445.3 Mb"

The total size of these 3 corpora is quite large and poses a challenge for further analysis if we want to fit it to a Shiny app with 1 GB memory limitation. Also the processing time may be too long for such large objects. We should pre-process these data in the later steps to trim down the memory consumption.

Basic statistics

We take a look on some basic statistics of the three corpora, which are:

  • Number of articles
  • Longest article
  • Shortest article

mtx_bstat <- matrix(nrow = 3, ncol = 3)
colnames(mtx_bstat) <- c("Number of articles", "longest article (no. of words)", "Shortest article (no. of words)")
rownames(mtx_bstat) <- c("Blog", "News", "Twitter")
mtx_bstat[1,] <- c(ndoc(corp_en_US_b), max(stri_length(texts(corp_en_US_b))), min(stri_length(texts(corp_en_US_b))))
mtx_bstat[2,] <- c(ndoc(corp_en_US_n), max(stri_length(texts(corp_en_US_n))), min(stri_length(texts(corp_en_US_n))))
mtx_bstat[3,] <- c(ndoc(corp_en_US_t), max(stri_length(texts(corp_en_US_t))), min(stri_length(texts(corp_en_US_t))))
mtx_bstat
##         Number of articles longest article (no. of words)
## Blog                899288                          40833
## News               1010242                          11384
## Twitter            2360148                            140
##         Shortest article (no. of words)
## Blog                                  1
## News                                  1
## Twitter                               2

Data cleaning and sampling

In this step, we shall create document-feature matrices from the corpora for ease of further analysis. Due to memory constrain and restriction on processing speed as stated above, and for the ease of later analysis, we downsample the corpora to 1/10 of their original sizes and combine them into one corpus before conducting any analysis.

set.seed(2087)
corp_sum <- corpus_sample(corp_en_US_b, ndoc(corp_en_US_b)%/% 10) + corpus_sample(corp_en_US_n, ndoc(corp_en_US_n)%/% 10) + corpus_sample(corp_en_US_t, ndoc(corp_en_US_t)%/% 10)
rm(list=c('corp_en_US_b', 'corp_en_US_n', 'corp_en_US_t'))
profanity = c("arse", "ass", "asshole", "bastard", "bitch", "boong", "cock", "cocksucker", "coon", "coonnass", "crap", "cunt", "damn", "darn", "dick", "douche", "fag", "faggot", "fuck", "gook", "motherfucker", "piss", "pussy", "shit", "slut", "tits")

The size of sampled corpus is 106 Mb.

Exploratory Analysis

We are going to find out the most popular 20 unigrams, bigrams, and trigrams in these text data. For a more consistent analysis, we are changing all the words to lower case, removing numbers, punctuations, symbols, Twitter characters (@ and #), URLs beginning with http(s), English stopwords, and stem the words. We also remove profanity in this step.

Top 20 unigrams

dfm_uni <- dfm(corp_sum, stem=T, remove_numbers=T, remove_punct=T,
            remove_symbols=T, remove_twitter=T, remove_url=T, remove=c(stopwords("en"), profanity))

The top 20 popular words are:

topfeatures(dfm_uni, 20)
##   one  said  just  like   get    go  time   can   day  year  make  love 
## 31206 30622 30274 30241 30207 26470 25669 25287 21986 21158 20592 19987 
##   new  good  know  work   now peopl   say  want 
## 19421 18637 18586 17944 17895 16286 16105 16082

The graph shows there are clear separations on frequency between different group or words:

textstat_frequency(dfm_uni, n = 20) %>% 
  ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
  geom_point(stat = "identity") + coord_flip() + 
  labs(x = "", y = "Term Frequency")

textplot_wordcloud(dfm_uni, min_count = topfeatures(dfm_uni, 100)[100], color = RColorBrewer::brewer.pal(8,"Dark2"))

Top 20 bigrams

dfm_bi <- dfm(corp_sum, stem=T, remove_numbers=T, remove_punct=T,
            remove_symbols=T, remove_twitter=T, remove_url=T, remove=c(stopwords("en"), profanity), ngrams=2)

The most common 20 bigrams are:

topfeatures(dfm_bi, 20)
##   of_the   in_the   to_the  for_the   on_the    to_be   at_the  and_the 
##    42904    41091    20882    20085    19420    16398    14194    12560 
##     in_a with_the    go_to     is_a  want_to   it_was    for_a   i_have 
##    12088    10623    10440    10006     9657     9609     9327     8718 
## from_the   have_a    i_was    it_is 
##     8611     8486     8434     8253

We can see from above chart and below graphs, the occurence of of the and in the are at least double of other bigrams.

textstat_frequency(dfm_bi, n = 20) %>% 
  ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
  geom_point(stat = "identity") + coord_flip() + 
  labs(x = "", y = "Bigram Frequency")

textplot_wordcloud(dfm_bi, min_count = topfeatures(dfm_bi, 70)[70], color = RColorBrewer::brewer.pal(8,"Dark2"))

Top 20 trigrams

dfm_tri <- dfm(corp_sum, stem=T, remove_numbers=T, remove_punct=T,
            remove_symbols=T, remove_twitter=T, remove_url=T, remove=c(stopwords("en"), profanity), ngrams=3)

Finally we take a look on the statistics of trigrams:

topfeatures(dfm_tri, 20)
##      one_of_the        a_lot_of   thank_for_the       i_want_to 
##            3455            2968            2444            2155 
##         to_be_a        go_to_be look_forward_to      out_of_the 
##            1921            1733            1602            1541 
##       be_abl_to      the_end_of        it_was_a      as_well_as 
##            1525            1503            1466            1446 
##     some_of_the     part_of_the        i_have_a     the_rest_of 
##            1427            1369            1213            1152 
##       i_have_to    i_don't_know       i_need_to      a_coupl_of 
##            1131            1118            1049            1036

It partly shows where does the common bigram of the occur in longer phrase below:

textstat_frequency(dfm_tri, n = 20) %>% 
  ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
  geom_point(stat = "identity") + coord_flip() + 
  labs(x = "", y = "Trigram Frequency")

textplot_wordcloud(dfm_tri, min_count = topfeatures(dfm_tri, 50)[50], color = RColorBrewer::brewer.pal(8,"Dark2"))

Future Plan

Prediction algorithm

The next phase would train n-gram models to calculate the probabilities of the next word given the previous n-1 words in the text. There are many methods and algorithms out there and will be studied in the coming projects.

Shiny app

The app should provide a text input field for user to type the text, at the same time giving 3 to 5 suggestions of next word for user to choose, according to what the user has inputed.