Intro

This is the first part of Capstone project for Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.

The final goal of the project is to develop application predicting the next word after being given some text input. Provided in the Report Exploratory Data Analysis serves as a preparation for building a predictive model and data product based on a predictive algorithm.

Training data set containing a sample corpus of text in different languages can be downloaded here. This project focuses on English texts in the data set come from blogs, news and twitter.


  • Code chunks can be displayed by clicking Code button

Data

Download and unzip

library(R.utils); library(readr); library(data.table)
library(knitr); library(ngram); library(dplyr)
library(quanteda); library(stringi); library(tidytext)
library(ggplot2); library(plotly); library(tidyr)
library(wordcloud2); library(tidyverse); library(wordcloud)
if(!dir.exists("./data")) dir.create("./data")
if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA")) dir.create("./data/1225_DS-CS-w2_WordsEDA")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest <- "./data/1225_DS-CS-w2_WordsEDA/Coursera-SwiftKey.zip"
if(!file.exists(dest)){download.file(
  url = url, destfile = dest, method = "curl")}
if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/final")) {unzip(dest, exdir = "./data/1225_DS-CS-w2_WordsEDA")}

Original files summary

Within each file (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt), every line is an extract from a single post/ article/ tweet. Here is some key summaries of the three files: size, line count, word count, average number of words as well as characters per line, and length of the longest line in characters:

dir <- "./data/1225_DS-CS-w2_WordsEDA/final/en_US"
svar <- function(name) {
  path<- paste0(dir,"/en_US.",name,".txt")
  svar<- read_lines(path, skip_empty_rows = TRUE)
  size.MB <- file.info(path)$size/2^20
  list(svar, size.MB)
}
summ <- function(svar) {
  nlines <- length(svar)
  nwords <- wordcount(svar)
  nchars <- sum(nchar(svar))
  wordpl <- nwords/nlines
  charpl <- nchars/nlines
  maxline <- max(sapply(svar,nchar))
  dt <- data.table(nlines, nwords, wordpl, charpl, maxline, nchars)
  dt
}

blogs<-svar("blogs")
twitter<-svar("twitter")
news<-svar("news")
siblogs<-blogs[[2]]; blogs<-blogs[[1]]
sitwitter <- twitter[[2]]; twitter <- twitter[[1]]
sinews <- news[[2]]; news <- news[[1]]
sblogs <- cbind(name="en_US.blogs.txt", size= siblogs, summ(blogs))
stwitter <- cbind(name="en_US.twitter.txt", size= sitwitter, summ(twitter))
snews <- cbind(name="en_US.news.txt", size= sinews, summ(news))
sdata <- data.table(name="aggregated data", size= siblogs+sitwitter+sinews,
              nlines=sblogs$nlines+stwitter$nlines+snews$nlines,
              nwords=sblogs$nwords+stwitter$nwords+snews$nwords,
              wordpl = (sblogs$nwords+stwitter$nwords+snews$nwords)/(
                sblogs$nlines+stwitter$nlines + snews$nlines),
              charpl = (sblogs$nchars+stwitter$nchars+snews$nchars)/(
                sblogs$nlines+stwitter$nlines + snews$nlines),
maxline = max(sblogs$maxline, stwitter$maxline, snews$maxline))
sfull<- rbind(sblogs[,1:7],stwitter[,1:7],snews[,1:7], sdata)
kable(sfull, digits = 4, caption = "Table 1: Summary of the original data",
      col.names = c("source","file size (MB)","line count","word count",
                    "avg words/line","avg chars/line", "longest line (chars)"))
Table 1: Summary of the original data
source file size (MB) line count word count avg words/line avg chars/line longest line (chars)
en_US.blogs.txt 200.4242 899288 37334131 41.5152 229.9870 40833
en_US.twitter.txt 159.3641 2360148 30373543 12.8693 68.6805 140
en_US.news.txt 196.2775 1010242 34372530 34.0241 201.1628 11384
aggregated data 556.0658 4269678 102080204 23.9082 134.0016 40833

The original files are quite large: \(0.899\) mln blog records, \(1.01\) mln news, \(2.36\) mln tweets. Amount of lines differ, but in total each has roughly \(30\)-\(37\) mln words. The blogs and news data seem similar in average number of words and characters (though blogs are slightly larger); twitter, as expected, is much shorter.

  • In total, there are
    • 4269678 lines
    • 102080204 words
  • Longest line
    • overall: the one of en_US.blogs.txt file, contains 40833 characters
    • in en_US.twitter.txt: contains expected 140 characters (сontent archived from heliohost.org on September 30, 2016)
  • Average length of lines:
    • 68.6805 characters in the case of en_US.twitter.txt
    • around \(200\) characters for the other two sources

Data Processing

The next task is to perform exploratory analysis of the data to get familiar with them and understand the underlying features and relationships. However, processing original files of that huge size pushed up against R’s memory limits and ran slowly. To speed up analysis, \(15\%\) of the lines from each file are sampled for the purpose of this report.

Sampling

15%-sample form each source

blogs <- tibble(text = blogs)
news <- tibble(text = news)
twitter <- tibble(text = twitter)
set.seed(123)
rate <- 0.15
blogs.sample <- blogs %>% slice_sample(., n=nrow(blogs)*rate)
twitter.sample <- twitter %>% slice_sample(., n=nrow(twitter)*rate)
news.sample <- news %>% slice_sample(., n=nrow(news)*rate)
data.sample <- bind_rows(mutate(blogs.sample, source="blogs"),
                         mutate(twitter.sample, source= "twitter"),
                         mutate(news.sample, source="news"))
data.sample$source <- as.factor(data.sample$source)

Sampled data summary (similar to that for original data):

sblogs <- cbind(name="blogs", summ(blogs.sample$text))
stwitter <- cbind(name="twitter", summ(twitter.sample$text))
snews <- cbind(name="news", summ(news.sample$text))
sdata <- cbind(name="aggregated sample", summ(data.sample$text))
sfull<- rbind(sblogs[,1:6],stwitter[,1:6],snews[,1:6], sdata[,1:6])
kable(sfull, digits = 4, caption = "Table 2: Summary of the sampled data",
      col.names = c("source", "line count","word count",
                    "avg words/line","avg chars/line", "longest line (chars)"))
Table 2: Summary of the sampled data
source line count word count avg words/line avg chars/line longest line (chars)
blogs 134893 5592164 41.4563 229.6845 12409
twitter 354022 4559255 12.8785 68.7258 140
news 151536 5142591 33.9364 200.6525 2042
aggregated sample 640451 15294010 23.8801 133.8422 12409
rm(blogs, twitter, news, siblogs, sitwitter, sinews, sblogs, stwitter, snews,
   sfull, dir, rate)

The sample used for further analysis consists of 640451 lines and 15294010 words. Other characteristics are similar to the original data ones (except for longest line, which is not surprising).

Cleanup

Cleaning is separated from tokenizing so that unnest_tokens function can be then used for both words (unigrams), and n-grams.

First, text need to be splitted into sentences as the end of a sentence should probably not be a predictor for the next one (like finish sentence_start sentence). Then, pre-process each sentence:

  • Basic
    • convert to lower case
    • remove numbers
    • remove URLs (mainly everything contains ://)
    • remove twitter hashtags (starts with #/ @)
    • condense multiple white spaces
    • remove leading/ trailing white spaces
  • Punctuation/ emoticons (words are predicted, not punctuation)
    • Handling apostrophes
      • treat with im/ i'm, dont/ don't as distinct words \(=>\) don’t remove apostrophes
      • unify their options (back-tick, Unicode single quotation mark, etc.)
      • make an exception for them when removing punctuation marks
    • remove all non-alphanumerics except apostrophes
    • strip surrounding apostrophes/ quotation marks (around citations)
  • Foreign Languages
    • decided not to handle foreign words in a special way
      • those frequent enough to affect output \(=>\) of common usage, and should be included in the predictive model
sent.tidy <- chartr("’‘`´", "''''", data.sample$text) # unif apostrophes
sent.tidy <- gsub("[[:blank:]]#[^[:blank:]]*", " ", sent.tidy, perl = T) #hashs
sent.tidy <- gsub("[[:blank:]]@[^[:blank:]]*", " ", sent.tidy, perl = T) #at signs (@)
sent.tidy <- gsub("(https?)?://[^[:blank:]]*", " ", sent.tidy, perl = T) #urls
sent.tidy <- gsub("[^[:alnum:]']", " ", sent.tidy, perl = T) # non-alpha/numerical
sent.tidy <- gsub("[[:blank:]]+'([[:alnum:][:blank:]]+)'[[:blank:]]+", " \\1 ",
                  sent.tidy, perl = T) # surr apostrophes
sent.tidy <- stri_trim_both(sent.tidy) # surr blanks
sent.tidy <-gsub("[[:blank:][:digit:]+[:blank:]]", " ", sent.tidy, perl = T) #numbers w\o text
sent.tidy<- gsub("[[:blank:]]{2,}", " ", sent.tidy, perl = T) # condense blanks
sent.tidy <- trimws(sent.tidy) # leading/ trailing blanks
sent.tidy <- tolower(sent.tidy)
data.sample$text <- sent.tidy
if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/sample"))
  {write_csv(data.sample, "./data/1225_DS-CS-w2_WordsEDA/sample.csv")}

NOTE: this cleanup is not the most reliable one, since maybe not all URLs have been stripped, or something is falsely matched by the URL pattern, the texts can still contain typos, repeated letters etc. Errors left should be marginal though, and not hurt a future predictive model too much.

Profanity/ stop-words/ stems handling

There are pros and cons for each option.

Profanity removal

  • pros:
    • model should never suggest profanity
  • cons:
    • it is a natural living language
    • model should be able to predict based on all words used

Stop-words removal

  • pros:
    • they are not good predictors themselves
    • removing them makes it possible to focus on more important words
    • it’s usual preprocessing in general NLP (natural language processing)
  • cons:
    • they follow some other terms and could be predicted
    • they are an important part of predictive text
      • stripping them out alter syntactic structure \(=>\) affect predictions
    • model depend on the order of words \(=>\) removing any parts of a sentence may corrupt its predictive value
    • typically, they are removed not for predicting but for text analysis, i.e. if the aim is to extract the meaning or sentiment of a text

Stemming

  • pros:
    • it’s usual preprocessing in general NLP (natural language processing)
  • cons:
    • model should predict next complete word depending on full previous word(s) (not stemmed)
    • typically, stemming is used not for predicting but for text analysis, i.e. if the aim is to extract the meaning or sentiment of a text

In the report is performed variant without all of these options (stemming, profanity/ stop-words removal), while are made functions to do it (Appendix: rmoptions).

Also, in Appendix: rmstops is performed and shortly discussed influence of stop-words removal.

Tokenization/ obtaining n-grams

Next step is to tokenize the data, that is to separate it into smaller units like words or phrases - n-grams (contiguous sequence of n items). N-gram of size 1 is referred to as a unigram (just a single word), size 2 is a bigram, size 3 is a trigram.

# data.sample <- read_csv("./data/1225_DS-CS-w2_WordsEDA/sample.csv")
# words.tidy<- data.sample%>% unnest_tokens(word,text) - just example: all words (w\o numbers(?) & apostrophes), count: nrow(words.tidy)
data.corpus<- corpus(data.sample)
toks1<- tokens(data.corpus) # to look at a whole sentence: data.corpus[["text640451"]]/ toks1[["text640451"]], count: sum(ntoken(data.corpus))/ sum(ntoken(toks1))
# toks1<- toksrp(toks1) # remove profanities
# toks1 <- toksrs(toks1) # remove stop-words
toks2 <- tokens_ngrams(toks1, 2)
toks3 <- tokens_ngrams(toks1, 3)
astr<- dfm(toks1, groups = docvars(toks1, "source")) # document feature matrix (for plots)
astr2<- dfm(toks2, groups = docvars(toks1, "source"))
astr3<- dfm(toks3, groups = docvars(toks1, "source"))
# astr<- stemm(astr); astr2<- stemm(astr2); astr3<- stemm(astr3) # stemming

Exploratory data analysis

Just take a look at what the cleaned data looks like:

data.sample
# A tibble: 640,451 x 2
   text                                                                   source
   <chr>                                                                  <fct> 
 1 the bruschetta however missed the mark instead of manageable two bite… blogs 
 2 walden pond mt rainier big sur everglades and so forth                 blogs 
 3 despite laws banning cell phones while driving and increased awarenes… blogs 
 4 ghosts and goblins                                                     blogs 
 5 now i can write in specific post information for each day of the week… blogs 
 6 but trying to pin photos to muslin walls would be a bit too tricky     blogs 
 7 she and rosso had been fruiting around because they are bored and pen… blogs 
 8 lastly has anyone seen the new harry potter movie if you're planning … blogs 
 9 while i generally enjoyed this movie there were a few things that did… blogs 
10 accessories martha stewart floral border punch marvy notched corner p… blogs 
# … with 640,441 more rows

Each sentence is on a separate line, and no uppercase letters, extra characters, numbers, punctuation, just apostrophes.

Unigrams

Coverage of Corpus by unigrams

There are 201550 unique words in the cleaned corpus. Count their frequency, and then see how many of them are required to cover \(50\%-80\%-90\%\) of the whole sample.

tstat<- function(dfm) {
  tstat<-textstat_frequency(dfm)
  tstat <- tibble(tstat) %>%
  transmute(ngram=feature, frequency, nwords=1:nrow(tstat),
            coverage = cumsum(frequency)/sum(frequency))
cover50 <- min(which(tstat$coverage>=0.5))
cover80 <- min(which(tstat$coverage>=0.8))
cover90 <- min(which(tstat$coverage>=0.9))
list(tstat, c(cover50, cover80, cover90))
}
tstat1 <- tstat(astr); cover1 <- tstat1[[2]]; tstat1<-tstat1[[1]]
ggplot(tstat1, aes(x=nwords, y=coverage)) +
  geom_line(colour="cornflowerblue", size=1.3) +
        geom_vline(aes(xintercept=cover1[1],colour="50%"),
                   linetype="longdash", size=1.1) + 
        geom_vline(aes(xintercept=cover1[2],colour="80%"),
                   linetype="longdash", size=1.1) + 
        geom_vline(aes(xintercept=cover1[3],colour="90%"),
                   linetype="longdash", size=1.1) +
  scale_color_manual(name=NULL, values=c(`50%`="brown", `80%`="purple",
                                         `90%`="violet"))+
    scale_x_continuous(limits =c(NA, 10000))+
  scale_y_continuous(labels = scales::percent) +
  theme(axis.line = element_line(size = 3, colour = "grey80")) +
    labs(x = "words count", y = "cumulative %" ) + 
    ggtitle("Corpus coverage by words (unigrams)")

Only 142 words (0.07\(\%\)) are required to fill 50% of the sample corpus. It takes 2304 (1.14\(\%\)) words for 80% coverage, and 7094 (3.52\(\%\)) - for 90% coverage.

Frequency

The higher the frequency of words/ word combinations in the corpus, the higher the probability user will enter them in a future application. So, visualize (interactively: move your mouse) the most common words in the data sample (word size/color represents its frequency):

top1.100<- textstat_frequency(astr, n=100)
wordcloud2(top1.100, size = 1.5, backgroundColor = "lightsteelblue")

The most frequent words are ‘the’, ‘to’, ‘and’, ‘a’, ‘of’, which is not surprising since stop-words haven’t been removed.

Look now at the frequency of the 25 most common words in the context of groups (blogs, news, twitter):

gfreq<- function(dfm) {
 top.25<- textstat_frequency(dfm, n=25, groups = docnames(dfm))
colnames(top.25)[5]<- "source"
ggplot(top.25, aes(reorder(feature, frequency), frequency,
                       fill=source)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(x = "most frequent n-grams", y = "n-gram frequency")+
  facet_wrap(~ source) +
  theme(axis.line = element_line(size = 3, colour = "grey80"),
        axis.text.x=element_text(angle=20, vjust=1, hjust=0)) 
}
gfreq(astr)

  • word frequency declines very quickly, especially in news and blogs
  • most words are present in all three groups
  • some words are most common in certain groups, and rarely found in the other ones (you, my, was, said etc.)

Bigrams/ trigrams

Coverage

tstat2 <- tstat(astr2); cover2 <- tstat2[[2]]; tstat2<-tstat2[[1]]
tstat3 <- tstat(astr3); cover3 <- tstat3[[2]]; tstat3<-tstat3[[1]]

ngrams <- c("unigrams", "bigrams", "trigrams")
total <- c(sum(astr), sum(astr2), sum(astr3))
uni <- c(ncol(astr), ncol(astr2), ncol(astr3))
cover50 <- round(c(100*cover1[1]/ncol(astr), 100*cover2[1]/ncol(astr2),
                                         100*cover3[1]/ncol(astr3)),2)
cover80 <- round(c(100*cover1[2]/ncol(astr), 100*cover2[2]/ncol(astr2),
                                         100*cover3[2]/ncol(astr3)),2)
cover90 <- round(c(100*cover1[3]/ncol(astr), 100*cover2[3]/ncol(astr2),
                                         100*cover3[3]/ncol(astr3)),2)
tab<- cbind(ngrams, total, uni, cover50, cover80, cover90)
kable(tab, caption = "Table 3: n-grams comparison",
      col.names = c("n-grams", "total","unique", "cover50, %", "cover80, %",
                    "cover90, %"))
Table 3: n-grams comparison
n-grams total unique cover50, % cover80, % cover90, %
unigrams 15104877 201550 0.07 1.14 3.52
bigrams 14464677 3491872 1.1 22.07 58.58
trigrams 13828228 8880303 22.14 68.86 84.43

Amount of total n-grams decreases from unigrams to trigrams, while number of unique n-grams increases. Percent coverage by n-grams increases significantly: if to cover 90% of total unigrams are required only 3.52\(\%\) unique unigrams, for trigrams are requared almost all 84.43\(\%\) of them.

Explore the frequency of bigrams/ trigrams in total, and in the context of groups (blogs, news, twitter).

Bigrams frequency

top2.100<- textstat_frequency(astr2, n=150)
wordcloud(top2.100$feature, top2.100$frequency, scale=c(4,.6),
          colors=brewer.pal(8, "Dark2"))

gfreq(astr2)

Trigrams frequency

top3.100<- textstat_frequency(astr3, n=70)
wordcloud(top3.100$feature, top3.100$frequency, scale=c(3,.4),
          colors=brewer.pal(8, "Dark2"))

gfreq(astr3)

As n increases, so do both diversity of n-grams, and differences between sources (blogs, news, twitter).

Initial conclusions

  • processing is time consuming because of the huge data set size
    • it can be avoided by sampling
    • downside is decrease in prediction accuracy
  • few words cover large parts of the corpus: 2304 words (1.14\(\%\)) are required for 80% coverage
  • many words are really seldom: 100164 words (49.7\(\%\)) occur only once in the data sample,
    • 165975 words (82.35\(\%\)) occur less than 10 times in the data sample
    • \(=>\) it may be a good idea to cut-off seldom features before building model (initial guestimate for cut-off is about \(80\%\))
      • this should also considerably reduce memory requirements
  • it seems there is no need to take special care of typos/ foreign words
  • twitter lines differ significantly from the other two sources in average length/ frequent patterns
    • they also contain many more special cases like hashtags, abbreviations, slang, emoticons
    • it might be helpful to compare models with/ without twitter, or with only twitter source

Further steps

For building an accurate and fast predictive model, it seems to make sense (but not limited to):

  • explore ways to improve model accuracy
    • n-gram based model (with n-grams have been building here) is a natural first step, but having drawback:
      • being not able to represent long distance relationships (e.g. a verb at the end of a sentence corresponding to a noun at the beginning)
    • \(=>\) examine if there are other models capturing this relationships better
      • learn techniques such as hidden Markov models (HMM)
    • whether and how to filter profanities/ stop-words/ punctuation/ numbers etc
      • compare model accuracy with and without them
  • investigate the issue of handling sentences:
    • whether it’s important for model to know is it the start of a sentence, or the end
  • decide how to handle unknown words/ unobserved n-grams
    • implement a synonym dictionary
      • it could also reduce the size of model if replace low frequency words with higher frequency ones
      • back-off strategy: looking for a required word in a last (n-1)-gram
  • deal with memory usage
    • measure resources/ memory usage/ code runtime
    • make some pruning in model
      • use 2-3-4-grams, but not 5-6-7grams,
        • OR invent how to handle with 5-6-7grams
      • remove sparse terms
  • balance accuracy and speed of the model

Appendix

rmoptions

Ways for filtering profanities/ stop-words, stemming

# load profanity file
loader <- function() {
  if(!dir.exists("./data")) dir.create("./data")
  if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA")) dir.create("./data/1225_DS-CS-w2_WordsEDA")
  if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA/ignore"))
        dir.create("./data/1225_DS-CS-w2_WordsEDA/ignore")
        url <- "https://www.freewebheaders.com/download/files/full-list-of-bad-words_csv-file_2021_01_18.zip"
        dest <- "./data/1225_DS-CS-w2_WordsEDA/full-list-of-bad-words_csv-file_2021_01_18.zip"
        if(!file.exists(dest)){download.file(
                url = url, destfile = dest, method = "curl")}
        if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/ignore/full-list-of-bad-words_csv-file_2021_01_18.csv")) {
                unzip(dest, exdir = "./data/1225_DS-CS-w2_WordsEDA/ignore")}
####### another option #######: a publicly kept profanity list from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
        # url <- "https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/5faf2ba42d7b1c0977169ec3611df25a3c08eb13/en"
        # dest <- "./data/1225_DS-CS-w2_WordsEDA/ignore"
        # if(!file.exists(dest)){download.file(url = url, destfile = dest, method = "curl")}
#######
}

# remove profanities
toksrp <- function(toks){
        library(readr); library(quanteda)
        loader()
        ignore <- read_csv(
                "./data/1225_DS-CS-w2_WordsEDA/ignore/full-list-of-bad-words_csv-file_2021_01_18.csv",
                col_names = FALSE)
####### with another option #######:
        # ignore <- read_tsv("./data/1225_DS-CS-w2_WordsEDA/ignore", col_names = FALSE)
#######
        ignore <- ignore$X1
        toksrp <- tokens_remove(toks, ignore)
        toksrp
}

# remove stop-words
toksrs <- function(toks) {
  library(quanteda)
  toksrs <- tokens_remove(toks, stopwords("english"))
  toksrs
}

# stemming
stemm <- function(dfm) {
  library(quanteda)
  stemm <- dfm(dfm, stem=TRUE)
  stemm
}

rmstops

Stop-words filter in a shorthand

The most common words except for stop-words:

toks01 <- toksrs(toks1)
toks02 <- tokens_ngrams(toks01, 2)
astr0<- dfm(toks01, groups = docvars(toks01, "source"))
astr02<- dfm(toks02, groups = docvars(toks01, "source"))

top01.100<- textstat_frequency(astr0, n=120)
wordcloud(top01.100$feature, top01.100$frequency, scale=c(3,.6),
          colors=brewer.pal(8, "Dark2"))

Frequency of the 25 most common bigrams (except for stop-words ) in the context of groups:

gfreq(astr02)

Comparing with stop-words non-removal option:

  • bigrams frequency has decreased significantly (\(10^2\) times)
  • amount of bigrams increased noticeably
  • model could assign unusually high probabilities to those phrases which are in fact not so common
  • highest frequency phrases of stop-words non-removal option (in the, of the, for the etc) do in fact look like the most common in daily life
    • but due to their high occurrences these phrases may obscure the others, more important ones
  • it’s necessary to balance inclusion/ exclusion of stop-words in predictive model

Session info

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] wordcloud_2.6      RColorBrewer_1.1-2 forcats_0.5.0      stringr_1.4.0     
 [5] purrr_0.3.4        tibble_3.0.4       tidyverse_1.3.0    wordcloud2_0.2.1  
 [9] tidyr_1.1.2        plotly_4.9.2.1     ggplot2_3.3.2      tidytext_0.2.6    
[13] stringi_1.5.3      quanteda_2.1.2     dplyr_1.0.2        ngram_3.0.4       
[17] knitr_1.30         data.table_1.13.4  readr_1.4.0        R.utils_2.10.1    
[21] R.oo_1.24.0        R.methodsS3_1.8.1 

loaded via a namespace (and not attached):
 [1] httr_1.4.2          jsonlite_1.7.2      viridisLite_0.3.0  
 [4] modelr_0.1.8        RcppParallel_5.0.2  assertthat_0.2.1   
 [7] highr_0.8           cellranger_1.1.0    yaml_2.2.1         
[10] pillar_1.4.7        backports_1.2.1     lattice_0.20-41    
[13] glue_1.4.2          digest_0.6.27       rvest_0.3.6        
[16] colorspace_2.0-0    htmltools_0.5.1.1   Matrix_1.2-18      
[19] pkgconfig_2.0.3     ISOcodes_2020.12.04 broom_0.7.2        
[22] haven_2.3.1         scales_1.1.1        farver_2.0.3       
[25] generics_0.1.0      usethis_2.0.0       ellipsis_0.3.1     
[28] withr_2.3.0         lazyeval_0.2.2      cli_2.2.0          
[31] magrittr_2.0.1      crayon_1.3.4        readxl_1.3.1       
[34] evaluate_0.14       stopwords_2.1       tokenizers_0.2.1   
[37] janeaustenr_0.1.5   fs_1.5.0            fansi_0.4.1        
[40] SnowballC_0.7.0     xml2_1.3.2          tools_4.0.3        
[43] hms_0.5.3           lifecycle_0.2.0     munsell_0.5.0      
[46] reprex_0.3.0        compiler_4.0.3      rlang_0.4.9        
[49] grid_4.0.3          rstudioapi_0.13     htmlwidgets_1.5.2  
[52] labeling_0.4.2      rmarkdown_2.5       gtable_0.3.0       
[55] DBI_1.1.0           R6_2.5.0            lubridate_1.7.9.2  
[58] utf8_1.1.4          fastmatch_1.1-0     Rcpp_1.0.5         
[61] vctrs_0.3.5         dbplyr_2.0.0        tidyselect_1.1.0   
[64] xfun_0.19