Behind the Scenes of the Corpus

Intro

This is the first part of Capstone project for Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.

The final goal of the project is to develop application predicting the next word after being given some text input. Provided in the Report Exploratory Data Analysis serves as a preparation for building a predictive model and data product based on a predictive algorithm.

Training data set containing a sample corpus of text in different languages can be downloaded here. This project focuses on English texts in the data set come from blogs, news and twitter.

Code chunks can be displayed by clicking Code button

Data

Download and unzip

library(R.utils); library(readr); library(data.table)
library(knitr); library(ngram); library(dplyr)
library(quanteda); library(stringi); library(tidytext)
library(ggplot2); library(plotly); library(tidyr)
library(wordcloud2); library(tidyverse); library(wordcloud)

if(!dir.exists("./data")) dir.create("./data")
if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA")) dir.create("./data/1225_DS-CS-w2_WordsEDA")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest <- "./data/1225_DS-CS-w2_WordsEDA/Coursera-SwiftKey.zip"
if(!file.exists(dest)){download.file(
  url = url, destfile = dest, method = "curl")}
if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/final")) {unzip(dest, exdir = "./data/1225_DS-CS-w2_WordsEDA")}

Original files summary

Within each file (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt), every line is an extract from a single post/ article/ tweet. Here is some key summaries of the three files: size, line count, word count, average number of words as well as characters per line, and length of the longest line in characters:

dir <- "./data/1225_DS-CS-w2_WordsEDA/final/en_US"
svar <- function(name) {
  path<- paste0(dir,"/en_US.",name,".txt")
  svar<- read_lines(path, skip_empty_rows = TRUE)
  size.MB <- file.info(path)$size/2^20
  list(svar, size.MB)
}
summ <- function(svar) {
  nlines <- length(svar)
  nwords <- wordcount(svar)
  nchars <- sum(nchar(svar))
  wordpl <- nwords/nlines
  charpl <- nchars/nlines
  maxline <- max(sapply(svar,nchar))
  dt <- data.table(nlines, nwords, wordpl, charpl, maxline, nchars)
  dt
}

blogs<-svar("blogs")
twitter<-svar("twitter")
news<-svar("news")
siblogs<-blogs[[2]]; blogs<-blogs[[1]]
sitwitter <- twitter[[2]]; twitter <- twitter[[1]]
sinews <- news[[2]]; news <- news[[1]]
sblogs <- cbind(name="en_US.blogs.txt", size= siblogs, summ(blogs))
stwitter <- cbind(name="en_US.twitter.txt", size= sitwitter, summ(twitter))
snews <- cbind(name="en_US.news.txt", size= sinews, summ(news))
sdata <- data.table(name="aggregated data", size= siblogs+sitwitter+sinews,
              nlines=sblogs$nlines+stwitter$nlines+snews$nlines,
              nwords=sblogs$nwords+stwitter$nwords+snews$nwords,
              wordpl = (sblogs$nwords+stwitter$nwords+snews$nwords)/(
                sblogs$nlines+stwitter$nlines + snews$nlines),
              charpl = (sblogs$nchars+stwitter$nchars+snews$nchars)/(
                sblogs$nlines+stwitter$nlines + snews$nlines),
maxline = max(sblogs$maxline, stwitter$maxline, snews$maxline))
sfull<- rbind(sblogs[,1:7],stwitter[,1:7],snews[,1:7], sdata)
kable(sfull, digits = 4, caption = "Table 1: Summary of the original data",
      col.names = c("source","file size (MB)","line count","word count",
                    "avg words/line","avg chars/line", "longest line (chars)"))

Table 1: Summary of the original data
source	file size (MB)	line count	word count	avg words/line	avg chars/line	longest line (chars)
en_US.blogs.txt	200.4242	899288	37334131	41.5152	229.9870	40833
en_US.twitter.txt	159.3641	2360148	30373543	12.8693	68.6805	140
en_US.news.txt	196.2775	1010242	34372530	34.0241	201.1628	11384
aggregated data	556.0658	4269678	102080204	23.9082	134.0016	40833

The original files are quite large: \(0.899\) mln blog records, \(1.01\) mln news, \(2.36\) mln tweets. Amount of lines differ, but in total each has roughly \(30\)-\(37\) mln words. The blogs and news data seem similar in average number of words and characters (though blogs are slightly larger); twitter, as expected, is much shorter.

In total, there are
- 4269678 lines
- 102080204 words
Longest line
- overall: the one of en_US.blogs.txt file, contains 40833 characters
- in en_US.twitter.txt: contains expected 140 characters (сontent archived from heliohost.org on September 30, 2016)
Average length of lines:
- 68.6805 characters in the case of en_US.twitter.txt
- around \(200\) characters for the other two sources

Data Processing

The next task is to perform exploratory analysis of the data to get familiar with them and understand the underlying features and relationships. However, processing original files of that huge size pushed up against R’s memory limits and ran slowly. To speed up analysis, \(15\%\) of the lines from each file are sampled for the purpose of this report.

Sampling

15%-sample form each source

blogs <- tibble(text = blogs)
news <- tibble(text = news)
twitter <- tibble(text = twitter)
set.seed(123)
rate <- 0.15
blogs.sample <- blogs %>% slice_sample(., n=nrow(blogs)*rate)
twitter.sample <- twitter %>% slice_sample(., n=nrow(twitter)*rate)
news.sample <- news %>% slice_sample(., n=nrow(news)*rate)
data.sample <- bind_rows(mutate(blogs.sample, source="blogs"),
                         mutate(twitter.sample, source= "twitter"),
                         mutate(news.sample, source="news"))
data.sample$source <- as.factor(data.sample$source)

Sampled data summary (similar to that for original data):

sblogs <- cbind(name="blogs", summ(blogs.sample$text))
stwitter <- cbind(name="twitter", summ(twitter.sample$text))
snews <- cbind(name="news", summ(news.sample$text))
sdata <- cbind(name="aggregated sample", summ(data.sample$text))
sfull<- rbind(sblogs[,1:6],stwitter[,1:6],snews[,1:6], sdata[,1:6])
kable(sfull, digits = 4, caption = "Table 2: Summary of the sampled data",
      col.names = c("source", "line count","word count",
                    "avg words/line","avg chars/line", "longest line (chars)"))

Table 2: Summary of the sampled data
source	line count	word count	avg words/line	avg chars/line	longest line (chars)
blogs	134893	5592164	41.4563	229.6845	12409
twitter	354022	4559255	12.8785	68.7258	140
news	151536	5142591	33.9364	200.6525	2042
aggregated sample	640451	15294010	23.8801	133.8422	12409

rm(blogs, twitter, news, siblogs, sitwitter, sinews, sblogs, stwitter, snews,
   sfull, dir, rate)

The sample used for further analysis consists of 640451 lines and 15294010 words. Other characteristics are similar to the original data ones (except for longest line, which is not surprising).

Cleanup

Cleaning is separated from tokenizing so that unnest_tokens function can be then used for both words (unigrams), and n-grams.

First, text need to be splitted into sentences as the end of a sentence should probably not be a predictor for the next one (like finish sentence_start sentence). Then, pre-process each sentence:

Basic
- convert to lower case
- remove numbers
- remove URLs (mainly everything contains ://)
- remove twitter hashtags (starts with #/ @)
- condense multiple white spaces
- remove leading/ trailing white spaces
Punctuation/ emoticons (words are predicted, not punctuation)
- Handling apostrophes
  - treat with im/ i'm, dont/ don't as distinct words \(=>\) don’t remove apostrophes
  - unify their options (back-tick, Unicode single quotation mark, etc.)
  - make an exception for them when removing punctuation marks
- remove all non-alphanumerics except apostrophes
- strip surrounding apostrophes/ quotation marks (around citations)
Foreign Languages
- decided not to handle foreign words in a special way
  - those frequent enough to affect output \(=>\) of common usage, and should be included in the predictive model

sent.tidy <- chartr("’‘`´", "''''", data.sample$text) # unif apostrophes
sent.tidy <- gsub("[[:blank:]]#[^[:blank:]]*", " ", sent.tidy, perl = T) #hashs
sent.tidy <- gsub("[[:blank:]]@[^[:blank:]]*", " ", sent.tidy, perl = T) #at signs (@)
sent.tidy <- gsub("(https?)?://[^[:blank:]]*", " ", sent.tidy, perl = T) #urls
sent.tidy <- gsub("[^[:alnum:]']", " ", sent.tidy, perl = T) # non-alpha/numerical
sent.tidy <- gsub("[[:blank:]]+'([[:alnum:][:blank:]]+)'[[:blank:]]+", " \\1 ",
                  sent.tidy, perl = T) # surr apostrophes
sent.tidy <- stri_trim_both(sent.tidy) # surr blanks
sent.tidy <-gsub("[[:blank:][:digit:]+[:blank:]]", " ", sent.tidy, perl = T) #numbers w\o text
sent.tidy<- gsub("[[:blank:]]{2,}", " ", sent.tidy, perl = T) # condense blanks
sent.tidy <- trimws(sent.tidy) # leading/ trailing blanks
sent.tidy <- tolower(sent.tidy)
data.sample$text <- sent.tidy
if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/sample"))
  {write_csv(data.sample, "./data/1225_DS-CS-w2_WordsEDA/sample.csv")}

NOTE: this cleanup is not the most reliable one, since maybe not all URLs have been stripped, or something is falsely matched by the URL pattern, the texts can still contain typos, repeated letters etc. Errors left should be marginal though, and not hurt a future predictive model too much.

Profanity/ stop-words/ stems handling

There are pros and cons for each option.

Profanity removal

pros:
- model should never suggest profanity
cons:
- it is a natural living language
- model should be able to predict based on all words used

Stop-words removal

pros:
- they are not good predictors themselves
- removing them makes it possible to focus on more important words
- it’s usual preprocessing in general NLP (natural language processing)
cons:
- they follow some other terms and could be predicted
- they are an important part of predictive text
  - stripping them out alter syntactic structure \(=>\) affect predictions
- model depend on the order of words \(=>\) removing any parts of a sentence may corrupt its predictive value
- typically, they are removed not for predicting but for text analysis, i.e. if the aim is to extract the meaning or sentiment of a text

Stemming

pros:
- it’s usual preprocessing in general NLP (natural language processing)
cons:
- model should predict next complete word depending on full previous word(s) (not stemmed)
- typically, stemming is used not for predicting but for text analysis, i.e. if the aim is to extract the meaning or sentiment of a text

In the report is performed variant without all of these options (stemming, profanity/ stop-words removal), while are made functions to do it (Appendix: rmoptions).

Also, in Appendix: rmstops is performed and shortly discussed influence of stop-words removal.

Tokenization/ obtaining n-grams

Next step is to tokenize the data, that is to separate it into smaller units like words or phrases - n-grams (contiguous sequence of n items). N-gram of size 1 is referred to as a unigram (just a single word), size 2 is a bigram, size 3 is a trigram.

# data.sample <- read_csv("./data/1225_DS-CS-w2_WordsEDA/sample.csv")
# words.tidy<- data.sample%>% unnest_tokens(word,text) - just example: all words (w\o numbers(?) & apostrophes), count: nrow(words.tidy)
data.corpus<- corpus(data.sample)
toks1<- tokens(data.corpus) # to look at a whole sentence: data.corpus[["text640451"]]/ toks1[["text640451"]], count: sum(ntoken(data.corpus))/ sum(ntoken(toks1))
# toks1<- toksrp(toks1) # remove profanities
# toks1 <- toksrs(toks1) # remove stop-words
toks2 <- tokens_ngrams(toks1, 2)
toks3 <- tokens_ngrams(toks1, 3)
astr<- dfm(toks1, groups = docvars(toks1, "source")) # document feature matrix (for plots)
astr2<- dfm(toks2, groups = docvars(toks1, "source"))
astr3<- dfm(toks3, groups = docvars(toks1, "source"))
# astr<- stemm(astr); astr2<- stemm(astr2); astr3<- stemm(astr3) # stemming

Exploratory data analysis

Just take a look at what the cleaned data looks like:

data.sample

# A tibble: 640,451 x 2
   text                                                                   source
   <chr>                                                                  <fct> 
 1 the bruschetta however missed the mark instead of manageable two bite… blogs 
 2 walden pond mt rainier big sur everglades and so forth                 blogs 
 3 despite laws banning cell phones while driving and increased awarenes… blogs 
 4 ghosts and goblins                                                     blogs 
 5 now i can write in specific post information for each day of the week… blogs 
 6 but trying to pin photos to muslin walls would be a bit too tricky     blogs 
 7 she and rosso had been fruiting around because they are bored and pen… blogs 
 8 lastly has anyone seen the new harry potter movie if you're planning … blogs 
 9 while i generally enjoyed this movie there were a few things that did… blogs 
10 accessories martha stewart floral border punch marvy notched corner p… blogs 
# … with 640,441 more rows

Each sentence is on a separate line, and no uppercase letters, extra characters, numbers, punctuation, just apostrophes.

Unigrams

Coverage of Corpus by unigrams

There are 201550 unique words in the cleaned corpus. Count their frequency, and then see how many of them are required to cover \(50\%-80\%-90\%\) of the whole sample.

tstat<- function(dfm) {
  tstat<-textstat_frequency(dfm)
  tstat <- tibble(tstat) %>%
  transmute(ngram=feature, frequency, nwords=1:nrow(tstat),
            coverage = cumsum(frequency)/sum(frequency))
cover50 <- min(which(tstat$coverage>=0.5))
cover80 <- min(which(tstat$coverage>=0.8))
cover90 <- min(which(tstat$coverage>=0.9))
list(tstat, c(cover50, cover80, cover90))
}
tstat1 <- tstat(astr); cover1 <- tstat1[[2]]; tstat1<-tstat1[[1]]
ggplot(tstat1, aes(x=nwords, y=coverage)) +
  geom_line(colour="cornflowerblue", size=1.3) +
        geom_vline(aes(xintercept=cover1[1],colour="50%"),
                   linetype="longdash", size=1.1) + 
        geom_vline(aes(xintercept=cover1[2],colour="80%"),
                   linetype="longdash", size=1.1) + 
        geom_vline(aes(xintercept=cover1[3],colour="90%"),
                   linetype="longdash", size=1.1) +
  scale_color_manual(name=NULL, values=c(`50%`="brown", `80%`="purple",
                                         `90%`="violet"))+
    scale_x_continuous(limits =c(NA, 10000))+
  scale_y_continuous(labels = scales::percent) +
  theme(axis.line = element_line(size = 3, colour = "grey80")) +
    labs(x = "words count", y = "cumulative %" ) + 
    ggtitle("Corpus coverage by words (unigrams)")

Only 142 words (0.07\(\%\)) are required to fill 50% of the sample corpus. It takes 2304 (1.14\(\%\)) words for 80% coverage, and 7094 (3.52\(\%\)) - for 90% coverage.

Frequency

The higher the frequency of words/ word combinations in the corpus, the higher the probability user will enter them in a future application. So, visualize (interactively: move your mouse) the most common words in the data sample (word size/color represents its frequency):

top1.100<- textstat_frequency(astr, n=100)
wordcloud2(top1.100, size = 1.5, backgroundColor = "lightsteelblue")

The most frequent words are ‘the’, ‘to’, ‘and’, ‘a’, ‘of’, which is not surprising since stop-words haven’t been removed.

Look now at the frequency of the 25 most common words in the context of groups (blogs, news, twitter):

gfreq<- function(dfm) {
 top.25<- textstat_frequency(dfm, n=25, groups = docnames(dfm))
colnames(top.25)[5]<- "source"
ggplot(top.25, aes(reorder(feature, frequency), frequency,
                       fill=source)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(x = "most frequent n-grams", y = "n-gram frequency")+
  facet_wrap(~ source) +
  theme(axis.line = element_line(size = 3, colour = "grey80"),
        axis.text.x=element_text(angle=20, vjust=1, hjust=0)) 
}
gfreq(astr)

word frequency declines very quickly, especially in news and blogs
most words are present in all three groups
some words are most common in certain groups, and rarely found in the other ones (you, my, was, said etc.)

Bigrams/ trigrams

Coverage

tstat2 <- tstat(astr2); cover2 <- tstat2[[2]]; tstat2<-tstat2[[1]]
tstat3 <- tstat(astr3); cover3 <- tstat3[[2]]; tstat3<-tstat3[[1]]

ngrams <- c("unigrams", "bigrams", "trigrams")
total <- c(sum(astr), sum(astr2), sum(astr3))
uni <- c(ncol(astr), ncol(astr2), ncol(astr3))
cover50 <- round(c(100*cover1[1]/ncol(astr), 100*cover2[1]/ncol(astr2),
                                         100*cover3[1]/ncol(astr3)),2)
cover80 <- round(c(100*cover1[2]/ncol(astr), 100*cover2[2]/ncol(astr2),
                                         100*cover3[2]/ncol(astr3)),2)
cover90 <- round(c(100*cover1[3]/ncol(astr), 100*cover2[3]/ncol(astr2),
                                         100*cover3[3]/ncol(astr3)),2)
tab<- cbind(ngrams, total, uni, cover50, cover80, cover90)
kable(tab, caption = "Table 3: n-grams comparison",
      col.names = c("n-grams", "total","unique", "cover50, %", "cover80, %",
                    "cover90, %"))

Table 3: n-grams comparison
n-grams	total	unique	cover50, %	cover80, %	cover90, %
unigrams	15104877	201550	0.07	1.14	3.52
bigrams	14464677	3491872	1.1	22.07	58.58
trigrams	13828228	8880303	22.14	68.86	84.43

Amount of total n-grams decreases from unigrams to trigrams, while number of unique n-grams increases. Percent coverage by n-grams increases significantly: if to cover 90% of total unigrams are required only 3.52\(\%\) unique unigrams, for trigrams are requared almost all 84.43\(\%\) of them.

Explore the frequency of bigrams/ trigrams in total, and in the context of groups (blogs, news, twitter).

Bigrams frequency

top2.100<- textstat_frequency(astr2, n=150)
wordcloud(top2.100$feature, top2.100$frequency, scale=c(4,.6),
          colors=brewer.pal(8, "Dark2"))

gfreq(astr2)

Trigrams frequency

top3.100<- textstat_frequency(astr3, n=70)
wordcloud(top3.100$feature, top3.100$frequency, scale=c(3,.4),
          colors=brewer.pal(8, "Dark2"))

gfreq(astr3)

As n increases, so do both diversity of n-grams, and differences between sources (blogs, news, twitter).

Initial conclusions

processing is time consuming because of the huge data set size
- it can be avoided by sampling
- downside is decrease in prediction accuracy
few words cover large parts of the corpus: 2304 words (1.14\(\%\)) are required for 80% coverage
many words are really seldom: 100164 words (49.7\(\%\)) occur only once in the data sample,
- 165975 words (82.35\(\%\)) occur less than 10 times in the data sample
- \(=>\) it may be a good idea to cut-off seldom features before building model (initial guestimate for cut-off is about \(80\%\))
  - this should also considerably reduce memory requirements
it seems there is no need to take special care of typos/ foreign words
twitter lines differ significantly from the other two sources in average length/ frequent patterns
- they also contain many more special cases like hashtags, abbreviations, slang, emoticons
- it might be helpful to compare models with/ without twitter, or with only twitter source

Further steps

For building an accurate and fast predictive model, it seems to make sense (but not limited to):

explore ways to improve model accuracy
- n-gram based model (with n-grams have been building here) is a natural first step, but having drawback:
  - being not able to represent long distance relationships (e.g. a verb at the end of a sentence corresponding to a noun at the beginning)
- \(=>\) examine if there are other models capturing this relationships better
  - learn techniques such as hidden Markov models (HMM)
- whether and how to filter profanities/ stop-words/ punctuation/ numbers etc
  - compare model accuracy with and without them
investigate the issue of handling sentences:
- whether it’s important for model to know is it the start of a sentence, or the end
decide how to handle unknown words/ unobserved n-grams
- implement a synonym dictionary
  - it could also reduce the size of model if replace low frequency words with higher frequency ones
  - back-off strategy: looking for a required word in a last (n-1)-gram
deal with memory usage
- measure resources/ memory usage/ code runtime
- make some pruning in model
  - use 2-3-4-grams, but not 5-6-7grams,
    - OR invent how to handle with 5-6-7grams
  - remove sparse terms
balance accuracy and speed of the model

Appendix

`rmoptions`

Ways for filtering profanities/ stop-words, stemming

# load profanity file
loader <- function() {
  if(!dir.exists("./data")) dir.create("./data")
  if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA")) dir.create("./data/1225_DS-CS-w2_WordsEDA")
  if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA/ignore"))
        dir.create("./data/1225_DS-CS-w2_WordsEDA/ignore")
        url <- "https://www.freewebheaders.com/download/files/full-list-of-bad-words_csv-file_2021_01_18.zip"
        dest <- "./data/1225_DS-CS-w2_WordsEDA/full-list-of-bad-words_csv-file_2021_01_18.zip"
        if(!file.exists(dest)){download.file(
                url = url, destfile = dest, method = "curl")}
        if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/ignore/full-list-of-bad-words_csv-file_2021_01_18.csv")) {
                unzip(dest, exdir = "./data/1225_DS-CS-w2_WordsEDA/ignore")}
####### another option #######: a publicly kept profanity list from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
        # url <- "https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/5faf2ba42d7b1c0977169ec3611df25a3c08eb13/en"
        # dest <- "./data/1225_DS-CS-w2_WordsEDA/ignore"
        # if(!file.exists(dest)){download.file(url = url, destfile = dest, method = "curl")}
#######
}

# remove profanities
toksrp <- function(toks){
        library(readr); library(quanteda)
        loader()
        ignore <- read_csv(
                "./data/1225_DS-CS-w2_WordsEDA/ignore/full-list-of-bad-words_csv-file_2021_01_18.csv",
                col_names = FALSE)
####### with another option #######:
        # ignore <- read_tsv("./data/1225_DS-CS-w2_WordsEDA/ignore", col_names = FALSE)
#######
        ignore <- ignore$X1
        toksrp <- tokens_remove(toks, ignore)
        toksrp
}

# remove stop-words
toksrs <- function(toks) {
  library(quanteda)
  toksrs <- tokens_remove(toks, stopwords("english"))
  toksrs
}

# stemming
stemm <- function(dfm) {
  library(quanteda)
  stemm <- dfm(dfm, stem=TRUE)
  stemm
}

`rmstops`

Stop-words filter in a shorthand

The most common words except for stop-words:

toks01 <- toksrs(toks1)
toks02 <- tokens_ngrams(toks01, 2)
astr0<- dfm(toks01, groups = docvars(toks01, "source"))
astr02<- dfm(toks02, groups = docvars(toks01, "source"))

top01.100<- textstat_frequency(astr0, n=120)
wordcloud(top01.100$feature, top01.100$frequency, scale=c(3,.6),
          colors=brewer.pal(8, "Dark2"))

Frequency of the 25 most common bigrams (except for stop-words ) in the context of groups:

gfreq(astr02)

Comparing with stop-words non-removal option:

bigrams frequency has decreased significantly (\(10^2\) times)
amount of bigrams increased noticeably
model could assign unusually high probabilities to those phrases which are in fact not so common
highest frequency phrases of stop-words non-removal option (in the, of the, for the etc) do in fact look like the most common in daily life
- but due to their high occurrences these phrases may obscure the others, more important ones
it’s necessary to balance inclusion/ exclusion of stop-words in predictive model

Session info

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] wordcloud_2.6      RColorBrewer_1.1-2 forcats_0.5.0      stringr_1.4.0     
 [5] purrr_0.3.4        tibble_3.0.4       tidyverse_1.3.0    wordcloud2_0.2.1  
 [9] tidyr_1.1.2        plotly_4.9.2.1     ggplot2_3.3.2      tidytext_0.2.6    
[13] stringi_1.5.3      quanteda_2.1.2     dplyr_1.0.2        ngram_3.0.4       
[17] knitr_1.30         data.table_1.13.4  readr_1.4.0        R.utils_2.10.1    
[21] R.oo_1.24.0        R.methodsS3_1.8.1 

loaded via a namespace (and not attached):
 [1] httr_1.4.2          jsonlite_1.7.2      viridisLite_0.3.0  
 [4] modelr_0.1.8        RcppParallel_5.0.2  assertthat_0.2.1   
 [7] highr_0.8           cellranger_1.1.0    yaml_2.2.1         
[10] pillar_1.4.7        backports_1.2.1     lattice_0.20-41    
[13] glue_1.4.2          digest_0.6.27       rvest_0.3.6        
[16] colorspace_2.0-0    htmltools_0.5.1.1   Matrix_1.2-18      
[19] pkgconfig_2.0.3     ISOcodes_2020.12.04 broom_0.7.2        
[22] haven_2.3.1         scales_1.1.1        farver_2.0.3       
[25] generics_0.1.0      usethis_2.0.0       ellipsis_0.3.1     
[28] withr_2.3.0         lazyeval_0.2.2      cli_2.2.0          
[31] magrittr_2.0.1      crayon_1.3.4        readxl_1.3.1       
[34] evaluate_0.14       stopwords_2.1       tokenizers_0.2.1   
[37] janeaustenr_0.1.5   fs_1.5.0            fansi_0.4.1        
[40] SnowballC_0.7.0     xml2_1.3.2          tools_4.0.3        
[43] hms_0.5.3           lifecycle_0.2.0     munsell_0.5.0      
[46] reprex_0.3.0        compiler_4.0.3      rlang_0.4.9        
[49] grid_4.0.3          rstudioapi_0.13     htmlwidgets_1.5.2  
[52] labeling_0.4.2      rmarkdown_2.5       gtable_0.3.0       
[55] DBI_1.1.0           R6_2.5.0            lubridate_1.7.9.2  
[58] utf8_1.1.4          fastmatch_1.1-0     Rcpp_1.0.5         
[61] vctrs_0.3.5         dbplyr_2.0.0        tidyselect_1.1.0   
[64] xfun_0.19