HST

An exploratory comparison of R text-mining packages

My purpose is to explore different code and methods using the tm, quanteda, and tidytext packages. The text is taken from Hunter S. Thompson’s ESPN column ‘Hey Rube’, which is available at this link:
https://totallygonzo.org/gonzowriting/hst-hey-rube-espn-page-2/

library(tm)
library(quanteda)
library(readtext)
library(tidytext)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(tidyr)
library(igraph)
library(ggraph)

The tm package

This first phase using the tm package will apply my standard code to use on a corpus. I will start by loading the corpus.

name <- file.path("C:/Users/Cory/Desktop/hstCorpus")
length(dir(name))

## [1] 6

dir(name)

## [1] "habit.txt"     "musberger.txt" "nba.txt"       "roy.txt"      
## [5] "sainthood.txt" "uberalles.txt"

docs <- Corpus(DirSource(name)) 
docs

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 6

There are six documents. I will now process the text. I do not stem as part of my standard operating procedure, and I’m not compelled to start now on this exercise.

docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)

This code creates a document-term-matrix, removes sparse terms, and allows us to inspect a small portion of the matrix.

dtm <- DocumentTermMatrix(docs)
dim(dtm)

## [1]    6 1451

rownames(dtm)

## [1] "habit.txt"     "musberger.txt" "nba.txt"       "roy.txt"      
## [5] "sainthood.txt" "uberalles.txt"

dtm = removeSparseTerms(dtm, 0.6)
dim(dtm)

## [1]   6 114

inspect(dtm[1:6, 1:5])

## <<DocumentTermMatrix (documents: 6, terms: 5)>>
## Non-/sparse entries: 20/10
## Sparsity           : 33%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##                Terms
## Docs            almost along always another back
##   habit.txt          1     2      1       2    2
##   musberger.txt      0     0      4       2    2
##   nba.txt            0     0      1       0    0
##   roy.txt            0     1      0       1    4
##   sainthood.txt      1     0      1       1    2
##   uberalles.txt      1     1      0       1    1

Only 114 terms after removing the sparse ones (Sparsity = 33%). Here are the frequency results.

freq = colSums(as.matrix(dtm))
ord = order(-freq) #order the frequency
freq[head(ord)]

##   will   like denver   time    new  night 
##     38     23     21     19     17     16

freq[tail(ord)]

##   started      also      away    doomed including     spend 
##         3         3         3         3         3         3

head(table(freq))

## freq
##  3  4  5  6  7  8 
## 29 19 18 13 10  3

tail(table(freq))

## freq
## 16 17 19 21 23 38 
##  2  1  1  1  1  1

findFreqTerms(dtm, 10)

##  [1] "back"     "can"      "denver"   "even"     "football" "game"    
##  [7] "good"     "like"     "long"     "many"     "new"      "night"   
## [13] "now"      "one"      "people"   "season"   "still"    "time"    
## [19] "will"

What is all this commentary about Denver?

findAssocs(dtm, "denver", corlimit = 0.5)

## $denver
##      hell      many   vicious       end       far      back      town 
##      0.91      0.85      0.74      0.70      0.69      0.68      0.61 
##   victory      away including     spend      said 
##      0.61      0.60      0.60      0.60      0.55

Denver and Hell eh? OK, it is now time for the wordcloud.

library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
wordcloud(names(freq), freq, 
          max.words = 50, scale = c(3, .5), colors=brewer.pal(6, "Dark2"))

Here are some barplots using base R.

freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
wf <- data.frame(word = names(freq), freq = freq)
wf <- wf[1:10, ]
barplot(wf$freq, names = wf$word, main = "Word Frequency",
        xlab = "Words", ylab = "Counts", ylim = c(0, 100))

Just to say we included it, let’s do some topic modeling

library(topicmodels)
set.seed(123)
lda3 <- LDA(dtm, k = 3, method = "Gibbs")
topics(lda3)

##     habit.txt musberger.txt       nba.txt       roy.txt sainthood.txt 
##             3             2             3             1             3 
## uberalles.txt 
##             1

terms(lda3, 10)

##       Topic 1  Topic 2    Topic 3  
##  [1,] "denver" "time"     "will"   
##  [2,] "night"  "football" "like"   
##  [3,] "one"    "long"     "still"  
##  [4,] "even"   "people"   "years"  
##  [5,] "game"   "back"     "three"  
##  [6,] "many"   "now"      "always" 
##  [7,] "season" "can"      "another"
##  [8,] "last"   "good"     "much"   
##  [9,] "new"    "every"    "might"  
## [10,] "dont"   "new"      "beat"

Wow! Nothing drains the blood out of Hunter’s writing like text-mining. Is there any hope? Perhaps.

Putting the quanteda package through its paces

To help with the effort, I relied extensively on the package vignette: http://quanteda.io/articles/quickstart.html

Using readtext creates a dataframe. Then I use corpus() to create the…that’s right, corpus.

myCorpus <- readtext("C:/Users/Cory/Desktop/hstCorpus") #docvars() not used
myCorpus

## readtext object consisting of 6 documents and 0 docvars.
## # data.frame [6 x 2]
##          doc_id                text
##           <chr>               <chr>
## 1     habit.txt "\"Some peopl\"..."
## 2 musberger.txt "\"The start \"..."
## 3       nba.txt "\"A spiral t\"..."
## 4       roy.txt "\"I had dinn\"..."
## 5 sainthood.txt "\"OK. That h\"..."
## 6 uberalles.txt "\"Gambling o\"..."

myCorpus <- corpus(myCorpus)
summary(myCorpus)

## Corpus consisting of 6 documents.
## 
##           Text Types Tokens Sentences
##      habit.txt   402    921        46
##  musberger.txt   379    810        36
##        nba.txt   349    705        32
##        roy.txt   424    887        43
##  sainthood.txt   624   1428        70
##  uberalles.txt   538   1241        61
## 
## Source:  C:/Users/Cory/Desktop/nam/* on x86-64 by Cory
## Created: Sun Jul 23 17:37:46 2017
## Notes:

Should you so desire, you can put the summary into a dataframe. Also, one can explore the individual texts.

texts(myCorpus)[1]

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        habit.txt 
## "Some people called me a fool for betting the 49ers to beat Oakland last week, but they were wrong.  It is true that I bleed 49er scarlet-and-gold on some days, but I am no longer ashamed of it like I was in the good old days when I was trapped in the nasty habit of betting on San Francisco every week like some kind of helpless junkie. That addiction is still with me, to some extent, but it is no longer quite as painful as it was back then, when I lived three blocks up the hill from Kezar Stadium. Freelance writers almost never make enough money to live on, much less ride exotic motorcycles and buy season tickets to 49er games. But I am here to tell you that it can be done -- and done without ever resorting to shadowy gigs like pimping or selling drugs. There were times when I was sorely tempted, due to overweening poverty, but I have always believed that anybody with a personal lifestyle as flagrant as mine should have a spotless criminal record, if only for reasons of karma. I still believe that, and it has served me well and honorably over the years --  knock knock -- and I still try to live by it. But it was not just karma that quasi-justified my spending habits in those wild and elegant years of the middle '60s. No, there were good reasons. ... My brand new silver-and-red 650 BSA Lightning (the \"fastest motorcycle ever tested by Hot Rod Magazine\") -- was absolutely necessary to my work. Nobody will argue with that.  It was the best investment I ever made. The 49er tickets, however, were a touch more difficult to explain.  I was a professional sportswriter, even then, and I have been hopelessly addicted to NFL football ever since I watched the legendary Giants-Colts championship game in 1958 -- but that was not enough, at the time, to justify spending our rent money on my football habit. Perhaps there was no justification, but I did it anyway, because I had to.  It was necessary to my mental health. ...  My comfortable apartment on Parnassus Hill looked out on the Bay and the Park and the Golden Gate Bridge -- and, thusly, straight down on the wretched hulk of Kezar.  Indeed, who could ask for anything more? ... Ho ho.  But we could only see half of the playing field. John Brodie would fade back and throw long to Dave Parks or Gene Washington -- and the damn ball would disappear in midflight behind the roof of a building. We could hear the roar of the crowd and the howls of despair that usually followed, but we never saw the end of the west-bound play.  Never. And that was too painful to live with, too hard on my nerves.  So I borrowed enough money from my lawyer to pay for a season ticket (and thank you again, John Clancy, for the loan).  It was another good investment. But it took about 20 years to \"mature,\" as they say. It was not until Bill Walsh and Joe Montana came along that the worm turned, and after that came Steve Young and Jerry Rice, along with five Super Bowls,  many victory celebrations,and the delicious habit of winning, which I highly recommend.  And that -- to make a long story short -- is why I bet heavily on San Francisco to beat Oakland last week. The three points helped, but in truth I honestly believed, in the pit of my gambler's heart, that the 49ers would Win, and that is why I bet on them. It was a vicious game, and by the time it was over, I was ready to sic the Hells Angels on that flaky punk of a kicker.  That swine.  If the Raiders had won in OT, Al Davis would be ordering a new Mercedes 500SL to send Jose Cortez for Christmas. ...  The game was that important for Oakland, especially with the hated Denver Broncos coming up next.  The spread should be about six for that one. And the Broncos are riding high. ...  But what the hell?  I'll take Oakland and six anyway.  It will be life or death for the Raiders -- and, if it's not snowing in Denver on Monday night, I suspect they will win.  Probably not, but those six points are what this business is all about. ...  And so long for now, folks.  I have to get to bed so I can go into town tomorrow and vote.  That is another habit I recommend.  It ain't much, but it's the only weapon we have against the Greed-heads. Mahalo. \n3343808367"

I almost forgot. You can add metadata using the metadoc function. For starters, it seems reasonable to add the published month and year for each of the texts. I will add more detail as I build this dataset over time.

metadoc(myCorpus, 'published') <- c('Nov_02', 'Jan_01', 'Apr_01', 'May_01', 'Dec_00',
                                    'Oct_02')

Here is an interesting function kwic(). It stands for keywords-in-context. Let’s try it with ‘Brent’, for legendary sports commentator Brent Musberger.

kwic(myCorpus, "Brent")

##                                                              
##   [musberger.txt, 91]         all manner of people: | Brent |
##  [musberger.txt, 124] everywhere. My grudge against | Brent |
##  [musberger.txt, 147]           1980s in fact, when | Brent |
##  [musberger.txt, 265]           From way downtown!" | Brent |
##  [musberger.txt, 631]      Baseball Writer, or like | Brent |
##  [musberger.txt, 639]           The last time I saw | Brent |
##                               
##  Musburger, Lyle Lovett,      
##  Musburger has been smoking on
##  was covering the NBA Finals  
##  would scream." Another       
##  Musburger. The last time     
##  socially was in the dinner

I find this an interesting function, and I shall have some good times using it I can assure you. Try it for yourself.

Here is a lexical dispersion plot using an object of the kwic class.

den <- kwic(myCorpus, 'Denver')
textplot_xray(den, scale = 'relative') #absolute scale also available

Creating and examining n-grams is an important approach to understanding text. It should prove quite interesting with HST’s column.

tk <- tokens(myCorpus, what = 'word', remove_numbers = T,
             remove_punct = T,
             remove_symbols = T) # first tokenize the corpus
ng <- tokens_ngrams(tk, n = 2:3)
ng$sainthood.txt[1:10]

##  [1] "OK_That"        "That_horrible"  "horrible_farce" "farce_is"      
##  [5] "is_Over"        "Over_now"       "now_Everybody"  "Everybody_can" 
##  [9] "can_Relax"      "Relax_and"

The package allows the creation of a document-frequency matrix. Concurrently, let’s remove stopwords, capitalization, and punctuation.

myDFM <- dfm(myCorpus, tolower = T, remove = stopwords("english"), remove_punct = T)
# can be grouped by variable groups = 'var'
myDFM[, 1:10]

## Document-feature matrix of: 6 documents, 10 features (48.3% sparse).
## 6 x 10 sparse Matrix of class "dfmSparse"
##                features
## docs            people called fool betting 49ers beat oakland last week
##   habit.txt          1      1    1       2     2    2       4    2    3
##   musberger.txt      3      0    1       0     0    1       0    1    0
##   nba.txt            3      2    0       0     0    0       0    2    0
##   roy.txt            4      0    0       0     0    0       0    0    0
##   sainthood.txt      2      1    0       2     5    3       0    1    1
##   uberalles.txt      1      0    0       1     1    0       0    2    2
##                features
## docs            wrong
##   habit.txt         1
##   musberger.txt     3
##   nba.txt           0
##   roy.txt           0
##   sainthood.txt     0
##   uberalles.txt     0

What are the 25 most occuring words?

topfeatures(myDFM, n = 25)

##     will     like   denver     time      new      one    night     even 
##       38       23       21       19       17       16       16       15 
##     game     long   people football    still      now downtown     back 
##       15       15       14       13       12       12       12       11 
##     good   season      can     many    every     much baseball      roy 
##       10       10       10       10        9        9        9        9 
##    49ers 
##        8

The obligatory wordcloud…

textplot_wordcloud(myDFM, max.words = 50, scale = c(3, .5), colors=brewer.pal(6, "Dark2"))

What are the similarities between texts?

textstat_simil(myDFM)

##                habit.txt musberger.txt    nba.txt    roy.txt sainthood.txt
## musberger.txt 0.07701445                                                  
## nba.txt       0.07063758    0.08314443                                    
## roy.txt       0.08266500    0.08735786 0.05518121                         
## sainthood.txt 0.23369075    0.13678286 0.23197742 0.08180234              
## uberalles.txt 0.15912107    0.05885202 0.06429563 0.09556388    0.12143917

Not much similarity. Clustering is available! Remember to get distances on normalized dfm.

distance <- textstat_dist(dfm_weight(myDFM, 'relFreq'))
hstClus <- hclust(distance, method = 'ward.D2')
hstClus$labels <- docnames(myCorpus)
plot(hstClus, hang = -1)

Finally, let’s build a topic model

hstTopics <- LDA(myDFM, k = 3)
topics(hstTopics)

##     habit.txt musberger.txt       nba.txt       roy.txt sainthood.txt 
##             2             3             1             1             1 
## uberalles.txt 
##             2

terms(hstTopics, 10)

##       Topic 1  Topic 2    Topic 3    
##  [1,] "will"   "football" "downtown" 
##  [2,] "denver" "game"     "musburger"
##  [3,] "like"   "season"   "time"     
##  [4,] "people" "new"      "brent"    
##  [5,] "time"   "denver"   "like"     
##  [6,] "now"    "night"    "every"    
##  [7,] "roy"    "baseball" "always"   
##  [8,] "long"   "teams"    "one"      
##  [9,] "even"   "money"    "people"   
## [10,] "one"    "like"     "wrong"

Tidytextmining with the tidytext package

I’ve been wanting to give this a try for a couple of months now. The source of inspiration is the online book at this site:
http://tidytextmining.com/tidytext.html

Like all things ‘tidy’, it operates on dataframes. Therefore, we will first turn the corpus into a tibble.

docTidy <- tidy(myCorpus)
str(docTidy)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  1 variable:
##  $ text: chr  "Some people called me a fool for betting the 49ers to beat Oakland last week, but they were wrong.  It is true "| __truncated__ "The start of a new year is always a good time to watch football and settle old scores, so let's get to it. I ha"| __truncated__ "A spiral that goes straight Down at unholy speed is called a \"Vortex,\" I think, and a spiral that whirls stra"| __truncated__ "I had dinner with Patrick Roy's lawyer the other night, and we spoke of many things -- including Hockey, slande"| __truncated__ ...

df <- data.frame(summary(myCorpus))

## Corpus consisting of 6 documents.
## 
##           Text Types Tokens Sentences
##      habit.txt   402    921        46
##  musberger.txt   379    810        36
##        nba.txt   349    705        32
##        roy.txt   424    887        43
##  sainthood.txt   624   1428        70
##  uberalles.txt   538   1241        61
## 
## Source:  C:/Users/Cory/Desktop/nam/* on x86-64 by Cory
## Created: Sun Jul 23 17:37:46 2017
## Notes:

docTidy$id <- df$Text
glimpse(docTidy)

## Observations: 6
## Variables: 2
## $ text <chr> "Some people called me a fool for betting the 49ers to be...
## $ id   <fctr> habit.txt, musberger.txt, nba.txt, roy.txt, sainthood.tx...

Tokenize and count most frequent words. It will also tokenize sentences and ngrams

docTidy %>%
  unnest_tokens(word, text, token = 'words') %>%
  anti_join(stop_words, by = 'word') -> tokenTidy
str(tokenTidy)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2008 obs. of  2 variables:
##  $ id  : Factor w/ 6 levels "habit.txt","musberger.txt",..: 4 6 2 5 5 6 6 6 5 5 ...
##  $ word: chr  "embarrass" "fine" "repeating" "thin" ...

tokenTidy %>%
  count(word, sort = T)

## # A tibble: 1,284 x 2
##        word     n
##       <chr> <int>
##  1     time    24
##  2   denver    21
##  3    night    16
##  4     game    15
##  5   people    14
##  6 football    13
##  7 downtown    12
##  8      roy    10
##  9   season    10
## 10 baseball     9
## # ... with 1,274 more rows

Term Frequency-Inverse Document Frequency

tokenTidy %>%
  count(id, word) %>%
  bind_tf_idf(word, id, n) %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) -> tfIDF

## Warning: package 'bindrcpp' was built under R version 3.4.1

tfIDF

## # A tibble: 1,602 x 6
##               id      word     n         tf       idf     tf_idf
##           <fctr>    <fctr> <int>      <dbl>     <dbl>      <dbl>
##  1 musberger.txt  downtown    12 0.04545455 1.7917595 0.08144361
##  2       roy.txt       roy    10 0.03225806 1.7917595 0.05779869
##  3 musberger.txt musburger     8 0.03030303 1.7917595 0.05429574
##  4 musberger.txt     brent     6 0.02272727 1.7917595 0.04072181
##  5       nba.txt    spiral     5 0.02173913 1.7917595 0.03895129
##  6 uberalles.txt     teams     8 0.01843318 1.7917595 0.03302782
##  7       roy.txt    denver    13 0.04193548 0.6931472 0.02906746
##  8       roy.txt   patrick     5 0.01612903 1.7917595 0.02889935
##  9     habit.txt   oakland     4 0.01454545 1.7917595 0.02606196
## 10       nba.txt      dumb     3 0.01304348 1.7917595 0.02337078
## # ... with 1,592 more rows

tfIDF %>% 
  top_n(10) %>%
  ggplot(aes(word, tf_idf, fill = id)) +
  geom_col() +
  labs(x = NULL, y = "tf-idf") +
  coord_flip()

## Selecting by tf_idf

tfIDF %>% 
  group_by(id) %>% 
  top_n(6) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = id)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~ id, ncol = 2, scales = "free") +
  coord_flip()

## Selecting by tf_idf

Sentiment analysis is a snap. There are three different sentiment scores.

get_sentiments("afinn")

## # A tibble: 2,476 x 2
##          word score
##         <chr> <int>
##  1    abandon    -2
##  2  abandoned    -2
##  3   abandons    -2
##  4   abducted    -2
##  5  abduction    -2
##  6 abductions    -2
##  7      abhor    -3
##  8   abhorred    -3
##  9  abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows

# also available is 'nrc' and 'bing'

Just take the tokenized dataframe and inner join with the sentiment data.

tokenTidy %>%
  inner_join(get_sentiments("afinn")) -> senTidy

## Joining, by = "word"

glimpse(senTidy)

## Observations: 240
## Variables: 3
## $ id    <fctr> roy.txt, uberalles.txt, sainthood.txt, roy.txt, uberall...
## $ word  <chr> "embarrass", "fine", "cool", "matter", "matter", "jesus"...
## $ score <int> -2, 2, 1, 1, 1, 1, 1, 1, 1, 1, -1, -2, -1, -4, -2, -4, -...

tail(senTidy)

## # A tibble: 6 x 3
##              id   word score
##          <fctr>  <chr> <int>
## 1       nba.txt   dumb    -3
## 2       nba.txt   dumb    -3
## 3       nba.txt   dumb    -3
## 4 sainthood.txt  pushy    -1
## 5 musberger.txt  noisy    -1
## 6       roy.txt stupid    -2

senTidy %>% 
  group_by(id) %>%
  summarize(sentiment_score = sum(score))

## # A tibble: 6 x 2
##              id sentiment_score
##          <fctr>           <int>
## 1     habit.txt             -12
## 2 musberger.txt             -48
## 3       nba.txt             -33
## 4       roy.txt             -62
## 5 sainthood.txt              -4
## 6 uberalles.txt             -28

Let’s try it with ‘nrc’

tokenTidy %>%
  inner_join(get_sentiments("nrc")) -> senTidy

## Joining, by = "word"

senTidy

## # A tibble: 1,378 x 3
##               id      word    sentiment
##           <fctr>     <chr>        <chr>
##  1       roy.txt embarrass     negative
##  2       roy.txt embarrass      sadness
##  3     habit.txt    coming anticipation
##  4 sainthood.txt    coming anticipation
##  5 uberalles.txt    coming anticipation
##  6 sainthood.txt      cool     positive
##  7 uberalles.txt  unbeaten anticipation
##  8 uberalles.txt  unbeaten          joy
##  9 uberalles.txt  unbeaten     negative
## 10 uberalles.txt  unbeaten     positive
## # ... with 1,368 more rows

# plot sentiment by id where it is anger
senTidy %>%
  group_by(id) %>%
  summarize(total = n()) -> totalRows

senTidy %>%
  filter(sentiment == 'anger') %>%
  group_by(id) %>%
  summarize(anger_count = n()) -> anger

anger$totalRows = totalRows$total
anger$percent_angry <- (anger$anger_count / anger$totalRows) *100
anger <- anger[order(-anger$percent_angry), ]
anger

## # A tibble: 6 x 4
##              id anger_count totalRows percent_angry
##          <fctr>       <int>     <int>         <dbl>
## 1 musberger.txt          22       171     12.865497
## 2     habit.txt          20       180     11.111111
## 3       roy.txt          26       252     10.317460
## 4 uberalles.txt          29       292      9.931507
## 5 sainthood.txt          25       325      7.692308
## 6       nba.txt          12       158      7.594937

anger$id <- factor(anger$id, levels = c('musberger.txt',
                                        'habit.txt',
                                        'roy.txt',
                                        'uberalles.txt',
                                        'sainthood.txt',
                                        'nba.txt'))

# data is now ready for ggplot2
ggplot(anger, aes(x = id, y = percent_angry)) +
  geom_bar(stat = 'identity') +
  xlab('Article') +
  ylab('Percent of Angry Sentiment') +
  ggtitle('Hunter S. Thompson is Angry at Brent Musberger') +
  theme_economist_white()

Common word counts to sentiment

tokenTidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() -> bingCounts

## Joining, by = "word"

bingCounts

## # A tibble: 257 x 3
##        word sentiment     n
##       <chr>     <chr> <int>
##  1     hell  negative     5
##  2  victory  positive     5
##  3      win  positive     5
##  4   brutal  negative     4
##  5 horrible  negative     4
##  6     lost  negative     4
##  7    wrong  negative     4
##  8      bad  negative     3
##  9   bloody  negative     3
## 10 criminal  negative     3
## # ... with 247 more rows

bingCounts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Selecting by n

Can’t forget the wordcloud

tokenTidy %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 25))

Put together some ngrams.

docTidy %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) -> gramTidy
gramTidy

## # A tibble: 5,085 x 2
##           id        bigram
##       <fctr>         <chr>
##  1 habit.txt   some people
##  2 habit.txt people called
##  3 habit.txt     called me
##  4 habit.txt          me a
##  5 habit.txt        a fool
##  6 habit.txt      fool for
##  7 habit.txt   for betting
##  8 habit.txt   betting the
##  9 habit.txt     the 49ers
## 10 habit.txt      49ers to
## # ... with 5,075 more rows

gramTidy %>% count(bigram, sort = TRUE)

## # A tibble: 4,243 x 2
##     bigram     n
##      <chr> <int>
##  1  in the    34
##  2  it was    21
##  3  of the    20
##  4 and the    18
##  5  on the    14
##  6  to the    13
##  7 for the    11
##  8 will be    11
##  9  i have    10
## 10   it is    10
## # ... with 4,233 more rows

That is a dumpster fire of results. This is a recommended solution.

gramTidy %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) -> gramFilter

gramFilter %>%
  filter(word2 == "football") %>%
  count(id, word1, sort = TRUE)

## # A tibble: 7 x 3
##              id    word1     n
##          <fctr>    <chr> <int>
## 1     habit.txt      nfl     1
## 2 musberger.txt    watch     1
## 3 uberalles.txt   brutal     1
## 4 uberalles.txt national     1
## 5 uberalles.txt      pro     1
## 6 uberalles.txt    style     1
## 7 uberalles.txt  topless     1

# new bigram counts:
gramFilter %>% 
  count(word1, word2, sort = TRUE) -> gramCount

gramCount

## # A tibble: 641 x 3
##      word1     word2     n
##      <chr>     <chr> <int>
##  1     san francisco     5
##  2     joe   montana     4
##  3 patrick       roy     4
##  4      st     louis     4
##  5   brent musburger     3
##  6      ho        ho     3
##  7  monday     night     3
##  8 stanley       cup     3
##  9  sunday     night     3
## 10    beat   oakland     2
## # ... with 631 more rows

gramFilter %>%
  unite(bigram, word1, word2, sep = " ") -> gramUnited

gramUnited

## # A tibble: 685 x 2
##           id            bigram
##  *    <fctr>             <chr>
##  1 habit.txt     people called
##  2 habit.txt      beat oakland
##  3 habit.txt        bleed 49er
##  4 habit.txt      49er scarlet
##  5 habit.txt       nasty habit
##  6 habit.txt     san francisco
##  7 habit.txt   helpless junkie
##  8 habit.txt     kezar stadium
##  9 habit.txt stadium freelance
## 10 habit.txt freelance writers
## # ... with 675 more rows

gramUnited %>%
  count(id, bigram) %>%
  bind_tf_idf(bigram, id, n) %>%
  arrange(desc(tf_idf)) -> gramDFIDF

gramDFIDF

## # A tibble: 659 x 6
##               id          bigram     n         tf      idf     tf_idf
##           <fctr>           <chr> <int>      <dbl>    <dbl>      <dbl>
##  1       nba.txt     dumb people     2 0.03636364 1.791759 0.06515489
##  2       roy.txt     patrick roy     4 0.03478261 1.791759 0.06232207
##  3 musberger.txt brent musburger     3 0.03157895 1.791759 0.05658188
##  4       roy.txt     stanley cup     3 0.02608696 1.791759 0.04674155
##  5     habit.txt    beat oakland     2 0.02173913 1.791759 0.03895129
##  6 musberger.txt real definition     2 0.02105263 1.791759 0.03772125
##  7 uberalles.txt    sunday night     3 0.01886792 1.791759 0.03380678
##  8       nba.txt        12 hours     1 0.01818182 1.791759 0.03257744
##  9       nba.txt        12 miles     1 0.01818182 1.791759 0.03257744
## 10       nba.txt  auto insurance     1 0.01818182 1.791759 0.03257744
## # ... with 649 more rows

Graphing a network? Interesting!

gramCount %>%
  filter(n > 2) %>%
  graph_from_data_frame() -> gramGraph

gramGraph

## IGRAPH DN-- 16 9 -- 
## + attr: name (v/c), n (e/n)
## + edges (vertex names):
## [1] san    ->francisco joe    ->montana   patrick->roy      
## [4] st     ->louis     brent  ->musburger ho     ->ho       
## [7] monday ->night     stanley->cup       sunday ->night

Network graph as demonstrated in the link

ggraph(gramGraph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = -0.5, hjust = 0.5)

You may be asking about the inclusion of ‘ho’ by itself. HST often used the exclamation ‘ho-ho’.

We will finish this exploration by doing topic modeling the tidy way.

tokenTidy %>%
  count(id, word, sort = TRUE) %>%
  ungroup() -> word_counts

word_counts

## # A tibble: 1,602 x 3
##               id      word     n
##           <fctr>     <chr> <int>
##  1       roy.txt    denver    13
##  2 musberger.txt  downtown    12
##  3       roy.txt       roy    10
##  4 musberger.txt musburger     8
##  5 uberalles.txt  baseball     8
##  6 uberalles.txt  football     8
##  7 uberalles.txt     teams     8
##  8 musberger.txt      time     7
##  9 sainthood.txt      time     7
## 10 uberalles.txt     night     7
## # ... with 1,592 more rows

text_dtm <- word_counts %>%
  cast_dtm(id, word, n) -> text_dtm

text_dtm

## <<DocumentTermMatrix (documents: 6, terms: 1284)>>
## Non-/sparse entries: 1602/6102
## Sparsity           : 79%
## Maximal term length: 14
## Weighting          : term frequency (tf)

text_lda <- LDA(text_dtm, k = 3, control = list(seed = 1234))
text_lda

## A LDA_VEM topic model with 3 topics.

text_topics <- tidy(text_lda, matrix = "beta")
text_topics

## # A tibble: 3,852 x 3
##    topic      term         beta
##    <int>     <chr>        <dbl>
##  1     1    denver 5.743094e-51
##  2     2    denver 6.458558e-03
##  3     3    denver 2.564103e-02
##  4     1  downtown 2.429150e-02
##  5     2  downtown 2.457619e-57
##  6     3  downtown 3.624722e-52
##  7     1       roy 2.928935e-54
##  8     2       roy 5.943582e-57
##  9     3       roy 1.709402e-02
## 10     1 musburger 1.619433e-02
## # ... with 3,842 more rows

text_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) -> top_terms

top_terms

## # A tibble: 16 x 3
##    topic      term        beta
##    <int>     <chr>       <dbl>
##  1     1  downtown 0.024291498
##  2     1      time 0.016194332
##  3     1 musburger 0.016194332
##  4     1    people 0.012145749
##  5     1     brent 0.012145749
##  6     2     night 0.011840689
##  7     2      time 0.011840689
##  8     2  football 0.010764263
##  9     2  baseball 0.008611410
## 10     2     teams 0.008611410
## 11     2      game 0.008611410
## 12     3    denver 0.025641026
## 13     3       roy 0.017094017
## 14     3   patrick 0.008547009
## 15     3     habit 0.008547009
## 16     3    people 0.008547009

Plot the result

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Conlusions

What do I make of all this? My first inclination is to bring quanteda and tidytext forward as my new go-to methodology. I have leaned heavily on tm and qdap in the past, and certainly will not discard their valuable contributions. We shall see as more work needs to be done.

Mahalo!

HST

Cory Lesmeister

July 23, 2017

An exploratory comparison of R text-mining packages

The tm package

Putting the quanteda package through its paces

Tidytextmining with the tidytext package

Conlusions