The target of the report is to conduct exploratory analysis on the dataset and states the goal for the final Shiny app. In this report, we are going to:
We use the Quanteda package to process and analyse the text data. The first step is to load the 3 text files, which corresponding to content of blog, news, and twitter in en_US locale. We then create 3 Quanteda corpora corresponding to the 3 text datasets. A corpus serves as the repository of original text with some metadata for future use. The basic statistics of these files can be extract from the corresponding corpora.
We combine the data loading and corpora creation into a single step as shown below:
require(quanteda)
require(readtext)
require(stringi)
require(ggplot2)
## Read the data file
path = "~/temp/JHU_DS/SwiftData/final/"
locale = c("de_DE", "en_US", "fi_FI", "ru_RU")
filetype = c(".blogs.txt", ".news.txt", ".twitter.txt", ".twitter_small.txt")
blog = paste(path, locale[2], "/", locale[2], filetype[1], sep="")
news = paste(path, locale[2], "/", locale[2], filetype[2], sep="")
twitter = paste(path, locale[2], "/", locale[2], filetype[3], sep="")
corp_en_US_b <- corpus(stri_split_lines1(readtext(blog)))
corp_en_US_n <- corpus(stri_split_lines1(readtext(news)))
corp_en_US_t <- corpus(stri_split_lines1(readtext(twitter)))
The file sizes and the sizes of corresponding corpora are shown below:
sizes_corp <- c(format(object.size(corp_en_US_b), units="MB"), format(object.size(corp_en_US_n), units="MB"), format(object.size(corp_en_US_t), units="MB"))
sizes_file <- c(utils:::format.object_size(file.size(blog), "auto"), utils:::format.object_size(file.size(news), "auto"), utils:::format.object_size(file.size(twitter), "auto"))
sizes_all <- matrix(c(sizes_file, sizes_corp), nrow = 2, byrow = T, dimnames = list(c("File", "Corpus"), c("Blog", "News", "Twitter")))
sizes_all
## Blog News Twitter
## File "200.4 Mb" "196.3 Mb" "159.4 Mb"
## Corpus "302 Mb" "310.6 Mb" "445.3 Mb"
The total size of these 3 corpora is quite large and poses a challenge for further analysis if we want to fit it to a Shiny app with 1 GB memory limitation. Also the processing time may be too long for such large objects. We should pre-process these data in the later steps to trim down the memory consumption.
We take a look on some basic statistics of the three corpora, which are:
mtx_bstat <- matrix(nrow = 3, ncol = 3)
colnames(mtx_bstat) <- c("Number of articles", "longest article (no. of words)", "Shortest article (no. of words)")
rownames(mtx_bstat) <- c("Blog", "News", "Twitter")
mtx_bstat[1,] <- c(ndoc(corp_en_US_b), max(stri_length(texts(corp_en_US_b))), min(stri_length(texts(corp_en_US_b))))
mtx_bstat[2,] <- c(ndoc(corp_en_US_n), max(stri_length(texts(corp_en_US_n))), min(stri_length(texts(corp_en_US_n))))
mtx_bstat[3,] <- c(ndoc(corp_en_US_t), max(stri_length(texts(corp_en_US_t))), min(stri_length(texts(corp_en_US_t))))
mtx_bstat
## Number of articles longest article (no. of words)
## Blog 899288 40833
## News 1010242 11384
## Twitter 2360148 140
## Shortest article (no. of words)
## Blog 1
## News 1
## Twitter 2
In this step, we shall create document-feature matrices from the corpora for ease of further analysis. Due to memory constrain and restriction on processing speed as stated above, and for the ease of later analysis, we downsample the corpora to 1/10 of their original sizes and combine them into one corpus before conducting any analysis.
set.seed(2087)
corp_sum <- corpus_sample(corp_en_US_b, ndoc(corp_en_US_b)%/% 10) + corpus_sample(corp_en_US_n, ndoc(corp_en_US_n)%/% 10) + corpus_sample(corp_en_US_t, ndoc(corp_en_US_t)%/% 10)
rm(list=c('corp_en_US_b', 'corp_en_US_n', 'corp_en_US_t'))
profanity = c("arse", "ass", "asshole", "bastard", "bitch", "boong", "cock", "cocksucker", "coon", "coonnass", "crap", "cunt", "damn", "darn", "dick", "douche", "fag", "faggot", "fuck", "gook", "motherfucker", "piss", "pussy", "shit", "slut", "tits")
The size of sampled corpus is 106 Mb.
We are going to find out the most popular 20 unigrams, bigrams, and trigrams in these text data. For a more consistent analysis, we are changing all the words to lower case, removing numbers, punctuations, symbols, Twitter characters (@ and #), URLs beginning with http(s), English stopwords, and stem the words. We also remove profanity in this step.
dfm_uni <- dfm(corp_sum, stem=T, remove_numbers=T, remove_punct=T,
remove_symbols=T, remove_twitter=T, remove_url=T, remove=c(stopwords("en"), profanity))
The top 20 popular words are:
topfeatures(dfm_uni, 20)
## one said just like get go time can day year make love
## 31206 30622 30274 30241 30207 26470 25669 25287 21986 21158 20592 19987
## new good know work now peopl say want
## 19421 18637 18586 17944 17895 16286 16105 16082
The graph shows there are clear separations on frequency between different group or words:
textstat_frequency(dfm_uni, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_point(stat = "identity") + coord_flip() +
labs(x = "", y = "Term Frequency")
textplot_wordcloud(dfm_uni, min_count = topfeatures(dfm_uni, 100)[100], color = RColorBrewer::brewer.pal(8,"Dark2"))
dfm_bi <- dfm(corp_sum, stem=T, remove_numbers=T, remove_punct=T,
remove_symbols=T, remove_twitter=T, remove_url=T, remove=c(stopwords("en"), profanity), ngrams=2)
The most common 20 bigrams are:
topfeatures(dfm_bi, 20)
## of_the in_the to_the for_the on_the to_be at_the and_the
## 42904 41091 20882 20085 19420 16398 14194 12560
## in_a with_the go_to is_a want_to it_was for_a i_have
## 12088 10623 10440 10006 9657 9609 9327 8718
## from_the have_a i_was it_is
## 8611 8486 8434 8253
We can see from above chart and below graphs, the occurence of of the and in the are at least double of other bigrams.
textstat_frequency(dfm_bi, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_point(stat = "identity") + coord_flip() +
labs(x = "", y = "Bigram Frequency")
textplot_wordcloud(dfm_bi, min_count = topfeatures(dfm_bi, 70)[70], color = RColorBrewer::brewer.pal(8,"Dark2"))
dfm_tri <- dfm(corp_sum, stem=T, remove_numbers=T, remove_punct=T,
remove_symbols=T, remove_twitter=T, remove_url=T, remove=c(stopwords("en"), profanity), ngrams=3)
Finally we take a look on the statistics of trigrams:
topfeatures(dfm_tri, 20)
## one_of_the a_lot_of thank_for_the i_want_to
## 3455 2968 2444 2155
## to_be_a go_to_be look_forward_to out_of_the
## 1921 1733 1602 1541
## be_abl_to the_end_of it_was_a as_well_as
## 1525 1503 1466 1446
## some_of_the part_of_the i_have_a the_rest_of
## 1427 1369 1213 1152
## i_have_to i_don't_know i_need_to a_coupl_of
## 1131 1118 1049 1036
It partly shows where does the common bigram of the occur in longer phrase below:
textstat_frequency(dfm_tri, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_point(stat = "identity") + coord_flip() +
labs(x = "", y = "Trigram Frequency")
textplot_wordcloud(dfm_tri, min_count = topfeatures(dfm_tri, 50)[50], color = RColorBrewer::brewer.pal(8,"Dark2"))
The next phase would train n-gram models to calculate the probabilities of the next word given the previous n-1 words in the text. There are many methods and algorithms out there and will be studied in the coming projects.
The app should provide a text input field for user to type the text, at the same time giving 3 to 5 suggestions of next word for user to choose, according to what the user has inputed.