My purpose is to explore different code and methods using the tm, quanteda, and tidytext packages. The text is taken from Hunter S. Thompson’s ESPN column ‘Hey Rube’, which is available at this link:
https://totallygonzo.org/gonzowriting/hst-hey-rube-espn-page-2/
library(tm)
library(quanteda)
library(readtext)
library(tidytext)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(tidyr)
library(igraph)
library(ggraph)
This first phase using the tm package will apply my standard code to use on a corpus. I will start by loading the corpus.
name <- file.path("C:/Users/Cory/Desktop/hstCorpus")
length(dir(name))
## [1] 6
dir(name)
## [1] "habit.txt" "musberger.txt" "nba.txt" "roy.txt"
## [5] "sainthood.txt" "uberalles.txt"
docs <- Corpus(DirSource(name))
docs
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 6
There are six documents. I will now process the text. I do not stem as part of my standard operating procedure, and I’m not compelled to start now on this exercise.
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
This code creates a document-term-matrix, removes sparse terms, and allows us to inspect a small portion of the matrix.
dtm <- DocumentTermMatrix(docs)
dim(dtm)
## [1] 6 1451
rownames(dtm)
## [1] "habit.txt" "musberger.txt" "nba.txt" "roy.txt"
## [5] "sainthood.txt" "uberalles.txt"
dtm = removeSparseTerms(dtm, 0.6)
dim(dtm)
## [1] 6 114
inspect(dtm[1:6, 1:5])
## <<DocumentTermMatrix (documents: 6, terms: 5)>>
## Non-/sparse entries: 20/10
## Sparsity : 33%
## Maximal term length: 7
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs almost along always another back
## habit.txt 1 2 1 2 2
## musberger.txt 0 0 4 2 2
## nba.txt 0 0 1 0 0
## roy.txt 0 1 0 1 4
## sainthood.txt 1 0 1 1 2
## uberalles.txt 1 1 0 1 1
Only 114 terms after removing the sparse ones (Sparsity = 33%). Here are the frequency results.
freq = colSums(as.matrix(dtm))
ord = order(-freq) #order the frequency
freq[head(ord)]
## will like denver time new night
## 38 23 21 19 17 16
freq[tail(ord)]
## started also away doomed including spend
## 3 3 3 3 3 3
head(table(freq))
## freq
## 3 4 5 6 7 8
## 29 19 18 13 10 3
tail(table(freq))
## freq
## 16 17 19 21 23 38
## 2 1 1 1 1 1
findFreqTerms(dtm, 10)
## [1] "back" "can" "denver" "even" "football" "game"
## [7] "good" "like" "long" "many" "new" "night"
## [13] "now" "one" "people" "season" "still" "time"
## [19] "will"
What is all this commentary about Denver?
findAssocs(dtm, "denver", corlimit = 0.5)
## $denver
## hell many vicious end far back town
## 0.91 0.85 0.74 0.70 0.69 0.68 0.61
## victory away including spend said
## 0.61 0.60 0.60 0.60 0.55
Denver and Hell eh? OK, it is now time for the wordcloud.
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
wordcloud(names(freq), freq,
max.words = 50, scale = c(3, .5), colors=brewer.pal(6, "Dark2"))
Here are some barplots using base R.
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
wf <- data.frame(word = names(freq), freq = freq)
wf <- wf[1:10, ]
barplot(wf$freq, names = wf$word, main = "Word Frequency",
xlab = "Words", ylab = "Counts", ylim = c(0, 100))
Just to say we included it, let’s do some topic modeling
library(topicmodels)
set.seed(123)
lda3 <- LDA(dtm, k = 3, method = "Gibbs")
topics(lda3)
## habit.txt musberger.txt nba.txt roy.txt sainthood.txt
## 3 2 3 1 3
## uberalles.txt
## 1
terms(lda3, 10)
## Topic 1 Topic 2 Topic 3
## [1,] "denver" "time" "will"
## [2,] "night" "football" "like"
## [3,] "one" "long" "still"
## [4,] "even" "people" "years"
## [5,] "game" "back" "three"
## [6,] "many" "now" "always"
## [7,] "season" "can" "another"
## [8,] "last" "good" "much"
## [9,] "new" "every" "might"
## [10,] "dont" "new" "beat"
Wow! Nothing drains the blood out of Hunter’s writing like text-mining. Is there any hope? Perhaps.
To help with the effort, I relied extensively on the package vignette: http://quanteda.io/articles/quickstart.html
Using readtext creates a dataframe. Then I use corpus() to create the…that’s right, corpus.
myCorpus <- readtext("C:/Users/Cory/Desktop/hstCorpus") #docvars() not used
myCorpus
## readtext object consisting of 6 documents and 0 docvars.
## # data.frame [6 x 2]
## doc_id text
## <chr> <chr>
## 1 habit.txt "\"Some peopl\"..."
## 2 musberger.txt "\"The start \"..."
## 3 nba.txt "\"A spiral t\"..."
## 4 roy.txt "\"I had dinn\"..."
## 5 sainthood.txt "\"OK. That h\"..."
## 6 uberalles.txt "\"Gambling o\"..."
myCorpus <- corpus(myCorpus)
summary(myCorpus)
## Corpus consisting of 6 documents.
##
## Text Types Tokens Sentences
## habit.txt 402 921 46
## musberger.txt 379 810 36
## nba.txt 349 705 32
## roy.txt 424 887 43
## sainthood.txt 624 1428 70
## uberalles.txt 538 1241 61
##
## Source: C:/Users/Cory/Desktop/nam/* on x86-64 by Cory
## Created: Sun Jul 23 17:37:46 2017
## Notes:
Should you so desire, you can put the summary into a dataframe. Also, one can explore the individual texts.
texts(myCorpus)[1]
## habit.txt
## "Some people called me a fool for betting the 49ers to beat Oakland last week, but they were wrong. It is true that I bleed 49er scarlet-and-gold on some days, but I am no longer ashamed of it like I was in the good old days when I was trapped in the nasty habit of betting on San Francisco every week like some kind of helpless junkie. That addiction is still with me, to some extent, but it is no longer quite as painful as it was back then, when I lived three blocks up the hill from Kezar Stadium. Freelance writers almost never make enough money to live on, much less ride exotic motorcycles and buy season tickets to 49er games. But I am here to tell you that it can be done -- and done without ever resorting to shadowy gigs like pimping or selling drugs. There were times when I was sorely tempted, due to overweening poverty, but I have always believed that anybody with a personal lifestyle as flagrant as mine should have a spotless criminal record, if only for reasons of karma. I still believe that, and it has served me well and honorably over the years -- knock knock -- and I still try to live by it. But it was not just karma that quasi-justified my spending habits in those wild and elegant years of the middle '60s. No, there were good reasons. ... My brand new silver-and-red 650 BSA Lightning (the \"fastest motorcycle ever tested by Hot Rod Magazine\") -- was absolutely necessary to my work. Nobody will argue with that. It was the best investment I ever made. The 49er tickets, however, were a touch more difficult to explain. I was a professional sportswriter, even then, and I have been hopelessly addicted to NFL football ever since I watched the legendary Giants-Colts championship game in 1958 -- but that was not enough, at the time, to justify spending our rent money on my football habit. Perhaps there was no justification, but I did it anyway, because I had to. It was necessary to my mental health. ... My comfortable apartment on Parnassus Hill looked out on the Bay and the Park and the Golden Gate Bridge -- and, thusly, straight down on the wretched hulk of Kezar. Indeed, who could ask for anything more? ... Ho ho. But we could only see half of the playing field. John Brodie would fade back and throw long to Dave Parks or Gene Washington -- and the damn ball would disappear in midflight behind the roof of a building. We could hear the roar of the crowd and the howls of despair that usually followed, but we never saw the end of the west-bound play. Never. And that was too painful to live with, too hard on my nerves. So I borrowed enough money from my lawyer to pay for a season ticket (and thank you again, John Clancy, for the loan). It was another good investment. But it took about 20 years to \"mature,\" as they say. It was not until Bill Walsh and Joe Montana came along that the worm turned, and after that came Steve Young and Jerry Rice, along with five Super Bowls, many victory celebrations,and the delicious habit of winning, which I highly recommend. And that -- to make a long story short -- is why I bet heavily on San Francisco to beat Oakland last week. The three points helped, but in truth I honestly believed, in the pit of my gambler's heart, that the 49ers would Win, and that is why I bet on them. It was a vicious game, and by the time it was over, I was ready to sic the Hells Angels on that flaky punk of a kicker. That swine. If the Raiders had won in OT, Al Davis would be ordering a new Mercedes 500SL to send Jose Cortez for Christmas. ... The game was that important for Oakland, especially with the hated Denver Broncos coming up next. The spread should be about six for that one. And the Broncos are riding high. ... But what the hell? I'll take Oakland and six anyway. It will be life or death for the Raiders -- and, if it's not snowing in Denver on Monday night, I suspect they will win. Probably not, but those six points are what this business is all about. ... And so long for now, folks. I have to get to bed so I can go into town tomorrow and vote. That is another habit I recommend. It ain't much, but it's the only weapon we have against the Greed-heads. Mahalo. \n3343808367"
I almost forgot. You can add metadata using the metadoc function. For starters, it seems reasonable to add the published month and year for each of the texts. I will add more detail as I build this dataset over time.
metadoc(myCorpus, 'published') <- c('Nov_02', 'Jan_01', 'Apr_01', 'May_01', 'Dec_00',
'Oct_02')
Here is an interesting function kwic(). It stands for keywords-in-context. Let’s try it with ‘Brent’, for legendary sports commentator Brent Musberger.
kwic(myCorpus, "Brent")
##
## [musberger.txt, 91] all manner of people: | Brent |
## [musberger.txt, 124] everywhere. My grudge against | Brent |
## [musberger.txt, 147] 1980s in fact, when | Brent |
## [musberger.txt, 265] From way downtown!" | Brent |
## [musberger.txt, 631] Baseball Writer, or like | Brent |
## [musberger.txt, 639] The last time I saw | Brent |
##
## Musburger, Lyle Lovett,
## Musburger has been smoking on
## was covering the NBA Finals
## would scream." Another
## Musburger. The last time
## socially was in the dinner
I find this an interesting function, and I shall have some good times using it I can assure you. Try it for yourself.
Here is a lexical dispersion plot using an object of the kwic class.
den <- kwic(myCorpus, 'Denver')
textplot_xray(den, scale = 'relative') #absolute scale also available
Creating and examining n-grams is an important approach to understanding text. It should prove quite interesting with HST’s column.
tk <- tokens(myCorpus, what = 'word', remove_numbers = T,
remove_punct = T,
remove_symbols = T) # first tokenize the corpus
ng <- tokens_ngrams(tk, n = 2:3)
ng$sainthood.txt[1:10]
## [1] "OK_That" "That_horrible" "horrible_farce" "farce_is"
## [5] "is_Over" "Over_now" "now_Everybody" "Everybody_can"
## [9] "can_Relax" "Relax_and"
The package allows the creation of a document-frequency matrix. Concurrently, let’s remove stopwords, capitalization, and punctuation.
myDFM <- dfm(myCorpus, tolower = T, remove = stopwords("english"), remove_punct = T)
# can be grouped by variable groups = 'var'
myDFM[, 1:10]
## Document-feature matrix of: 6 documents, 10 features (48.3% sparse).
## 6 x 10 sparse Matrix of class "dfmSparse"
## features
## docs people called fool betting 49ers beat oakland last week
## habit.txt 1 1 1 2 2 2 4 2 3
## musberger.txt 3 0 1 0 0 1 0 1 0
## nba.txt 3 2 0 0 0 0 0 2 0
## roy.txt 4 0 0 0 0 0 0 0 0
## sainthood.txt 2 1 0 2 5 3 0 1 1
## uberalles.txt 1 0 0 1 1 0 0 2 2
## features
## docs wrong
## habit.txt 1
## musberger.txt 3
## nba.txt 0
## roy.txt 0
## sainthood.txt 0
## uberalles.txt 0
What are the 25 most occuring words?
topfeatures(myDFM, n = 25)
## will like denver time new one night even
## 38 23 21 19 17 16 16 15
## game long people football still now downtown back
## 15 15 14 13 12 12 12 11
## good season can many every much baseball roy
## 10 10 10 10 9 9 9 9
## 49ers
## 8
The obligatory wordcloud…
textplot_wordcloud(myDFM, max.words = 50, scale = c(3, .5), colors=brewer.pal(6, "Dark2"))
What are the similarities between texts?
textstat_simil(myDFM)
## habit.txt musberger.txt nba.txt roy.txt sainthood.txt
## musberger.txt 0.07701445
## nba.txt 0.07063758 0.08314443
## roy.txt 0.08266500 0.08735786 0.05518121
## sainthood.txt 0.23369075 0.13678286 0.23197742 0.08180234
## uberalles.txt 0.15912107 0.05885202 0.06429563 0.09556388 0.12143917
Not much similarity. Clustering is available! Remember to get distances on normalized dfm.
distance <- textstat_dist(dfm_weight(myDFM, 'relFreq'))
hstClus <- hclust(distance, method = 'ward.D2')
hstClus$labels <- docnames(myCorpus)
plot(hstClus, hang = -1)
Finally, let’s build a topic model
hstTopics <- LDA(myDFM, k = 3)
topics(hstTopics)
## habit.txt musberger.txt nba.txt roy.txt sainthood.txt
## 2 3 1 1 1
## uberalles.txt
## 2
terms(hstTopics, 10)
## Topic 1 Topic 2 Topic 3
## [1,] "will" "football" "downtown"
## [2,] "denver" "game" "musburger"
## [3,] "like" "season" "time"
## [4,] "people" "new" "brent"
## [5,] "time" "denver" "like"
## [6,] "now" "night" "every"
## [7,] "roy" "baseball" "always"
## [8,] "long" "teams" "one"
## [9,] "even" "money" "people"
## [10,] "one" "like" "wrong"
I’ve been wanting to give this a try for a couple of months now. The source of inspiration is the online book at this site:
http://tidytextmining.com/tidytext.html
Like all things ‘tidy’, it operates on dataframes. Therefore, we will first turn the corpus into a tibble.
docTidy <- tidy(myCorpus)
str(docTidy)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 1 variable:
## $ text: chr "Some people called me a fool for betting the 49ers to beat Oakland last week, but they were wrong. It is true "| __truncated__ "The start of a new year is always a good time to watch football and settle old scores, so let's get to it. I ha"| __truncated__ "A spiral that goes straight Down at unholy speed is called a \"Vortex,\" I think, and a spiral that whirls stra"| __truncated__ "I had dinner with Patrick Roy's lawyer the other night, and we spoke of many things -- including Hockey, slande"| __truncated__ ...
df <- data.frame(summary(myCorpus))
## Corpus consisting of 6 documents.
##
## Text Types Tokens Sentences
## habit.txt 402 921 46
## musberger.txt 379 810 36
## nba.txt 349 705 32
## roy.txt 424 887 43
## sainthood.txt 624 1428 70
## uberalles.txt 538 1241 61
##
## Source: C:/Users/Cory/Desktop/nam/* on x86-64 by Cory
## Created: Sun Jul 23 17:37:46 2017
## Notes:
docTidy$id <- df$Text
glimpse(docTidy)
## Observations: 6
## Variables: 2
## $ text <chr> "Some people called me a fool for betting the 49ers to be...
## $ id <fctr> habit.txt, musberger.txt, nba.txt, roy.txt, sainthood.tx...
Tokenize and count most frequent words. It will also tokenize sentences and ngrams
docTidy %>%
unnest_tokens(word, text, token = 'words') %>%
anti_join(stop_words, by = 'word') -> tokenTidy
str(tokenTidy)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2008 obs. of 2 variables:
## $ id : Factor w/ 6 levels "habit.txt","musberger.txt",..: 4 6 2 5 5 6 6 6 5 5 ...
## $ word: chr "embarrass" "fine" "repeating" "thin" ...
tokenTidy %>%
count(word, sort = T)
## # A tibble: 1,284 x 2
## word n
## <chr> <int>
## 1 time 24
## 2 denver 21
## 3 night 16
## 4 game 15
## 5 people 14
## 6 football 13
## 7 downtown 12
## 8 roy 10
## 9 season 10
## 10 baseball 9
## # ... with 1,274 more rows
Term Frequency-Inverse Document Frequency
tokenTidy %>%
count(id, word) %>%
bind_tf_idf(word, id, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) -> tfIDF
## Warning: package 'bindrcpp' was built under R version 3.4.1
tfIDF
## # A tibble: 1,602 x 6
## id word n tf idf tf_idf
## <fctr> <fctr> <int> <dbl> <dbl> <dbl>
## 1 musberger.txt downtown 12 0.04545455 1.7917595 0.08144361
## 2 roy.txt roy 10 0.03225806 1.7917595 0.05779869
## 3 musberger.txt musburger 8 0.03030303 1.7917595 0.05429574
## 4 musberger.txt brent 6 0.02272727 1.7917595 0.04072181
## 5 nba.txt spiral 5 0.02173913 1.7917595 0.03895129
## 6 uberalles.txt teams 8 0.01843318 1.7917595 0.03302782
## 7 roy.txt denver 13 0.04193548 0.6931472 0.02906746
## 8 roy.txt patrick 5 0.01612903 1.7917595 0.02889935
## 9 habit.txt oakland 4 0.01454545 1.7917595 0.02606196
## 10 nba.txt dumb 3 0.01304348 1.7917595 0.02337078
## # ... with 1,592 more rows
tfIDF %>%
top_n(10) %>%
ggplot(aes(word, tf_idf, fill = id)) +
geom_col() +
labs(x = NULL, y = "tf-idf") +
coord_flip()
## Selecting by tf_idf
tfIDF %>%
group_by(id) %>%
top_n(6) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = id)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~ id, ncol = 2, scales = "free") +
coord_flip()
## Selecting by tf_idf
Sentiment analysis is a snap. There are three different sentiment scores.
get_sentiments("afinn")
## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
# also available is 'nrc' and 'bing'
Just take the tokenized dataframe and inner join with the sentiment data.
tokenTidy %>%
inner_join(get_sentiments("afinn")) -> senTidy
## Joining, by = "word"
glimpse(senTidy)
## Observations: 240
## Variables: 3
## $ id <fctr> roy.txt, uberalles.txt, sainthood.txt, roy.txt, uberall...
## $ word <chr> "embarrass", "fine", "cool", "matter", "matter", "jesus"...
## $ score <int> -2, 2, 1, 1, 1, 1, 1, 1, 1, 1, -1, -2, -1, -4, -2, -4, -...
tail(senTidy)
## # A tibble: 6 x 3
## id word score
## <fctr> <chr> <int>
## 1 nba.txt dumb -3
## 2 nba.txt dumb -3
## 3 nba.txt dumb -3
## 4 sainthood.txt pushy -1
## 5 musberger.txt noisy -1
## 6 roy.txt stupid -2
senTidy %>%
group_by(id) %>%
summarize(sentiment_score = sum(score))
## # A tibble: 6 x 2
## id sentiment_score
## <fctr> <int>
## 1 habit.txt -12
## 2 musberger.txt -48
## 3 nba.txt -33
## 4 roy.txt -62
## 5 sainthood.txt -4
## 6 uberalles.txt -28
Let’s try it with ‘nrc’
tokenTidy %>%
inner_join(get_sentiments("nrc")) -> senTidy
## Joining, by = "word"
senTidy
## # A tibble: 1,378 x 3
## id word sentiment
## <fctr> <chr> <chr>
## 1 roy.txt embarrass negative
## 2 roy.txt embarrass sadness
## 3 habit.txt coming anticipation
## 4 sainthood.txt coming anticipation
## 5 uberalles.txt coming anticipation
## 6 sainthood.txt cool positive
## 7 uberalles.txt unbeaten anticipation
## 8 uberalles.txt unbeaten joy
## 9 uberalles.txt unbeaten negative
## 10 uberalles.txt unbeaten positive
## # ... with 1,368 more rows
# plot sentiment by id where it is anger
senTidy %>%
group_by(id) %>%
summarize(total = n()) -> totalRows
senTidy %>%
filter(sentiment == 'anger') %>%
group_by(id) %>%
summarize(anger_count = n()) -> anger
anger$totalRows = totalRows$total
anger$percent_angry <- (anger$anger_count / anger$totalRows) *100
anger <- anger[order(-anger$percent_angry), ]
anger
## # A tibble: 6 x 4
## id anger_count totalRows percent_angry
## <fctr> <int> <int> <dbl>
## 1 musberger.txt 22 171 12.865497
## 2 habit.txt 20 180 11.111111
## 3 roy.txt 26 252 10.317460
## 4 uberalles.txt 29 292 9.931507
## 5 sainthood.txt 25 325 7.692308
## 6 nba.txt 12 158 7.594937
anger$id <- factor(anger$id, levels = c('musberger.txt',
'habit.txt',
'roy.txt',
'uberalles.txt',
'sainthood.txt',
'nba.txt'))
# data is now ready for ggplot2
ggplot(anger, aes(x = id, y = percent_angry)) +
geom_bar(stat = 'identity') +
xlab('Article') +
ylab('Percent of Angry Sentiment') +
ggtitle('Hunter S. Thompson is Angry at Brent Musberger') +
theme_economist_white()
Common word counts to sentiment
tokenTidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() -> bingCounts
## Joining, by = "word"
bingCounts
## # A tibble: 257 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 hell negative 5
## 2 victory positive 5
## 3 win positive 5
## 4 brutal negative 4
## 5 horrible negative 4
## 6 lost negative 4
## 7 wrong negative 4
## 8 bad negative 3
## 9 bloody negative 3
## 10 criminal negative 3
## # ... with 247 more rows
bingCounts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
Can’t forget the wordcloud
tokenTidy %>%
count(word) %>%
with(wordcloud(word, n, max.words = 25))
Put together some ngrams.
docTidy %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) -> gramTidy
gramTidy
## # A tibble: 5,085 x 2
## id bigram
## <fctr> <chr>
## 1 habit.txt some people
## 2 habit.txt people called
## 3 habit.txt called me
## 4 habit.txt me a
## 5 habit.txt a fool
## 6 habit.txt fool for
## 7 habit.txt for betting
## 8 habit.txt betting the
## 9 habit.txt the 49ers
## 10 habit.txt 49ers to
## # ... with 5,075 more rows
gramTidy %>% count(bigram, sort = TRUE)
## # A tibble: 4,243 x 2
## bigram n
## <chr> <int>
## 1 in the 34
## 2 it was 21
## 3 of the 20
## 4 and the 18
## 5 on the 14
## 6 to the 13
## 7 for the 11
## 8 will be 11
## 9 i have 10
## 10 it is 10
## # ... with 4,233 more rows
That is a dumpster fire of results. This is a recommended solution.
gramTidy %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) -> gramFilter
gramFilter %>%
filter(word2 == "football") %>%
count(id, word1, sort = TRUE)
## # A tibble: 7 x 3
## id word1 n
## <fctr> <chr> <int>
## 1 habit.txt nfl 1
## 2 musberger.txt watch 1
## 3 uberalles.txt brutal 1
## 4 uberalles.txt national 1
## 5 uberalles.txt pro 1
## 6 uberalles.txt style 1
## 7 uberalles.txt topless 1
# new bigram counts:
gramFilter %>%
count(word1, word2, sort = TRUE) -> gramCount
gramCount
## # A tibble: 641 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 san francisco 5
## 2 joe montana 4
## 3 patrick roy 4
## 4 st louis 4
## 5 brent musburger 3
## 6 ho ho 3
## 7 monday night 3
## 8 stanley cup 3
## 9 sunday night 3
## 10 beat oakland 2
## # ... with 631 more rows
gramFilter %>%
unite(bigram, word1, word2, sep = " ") -> gramUnited
gramUnited
## # A tibble: 685 x 2
## id bigram
## * <fctr> <chr>
## 1 habit.txt people called
## 2 habit.txt beat oakland
## 3 habit.txt bleed 49er
## 4 habit.txt 49er scarlet
## 5 habit.txt nasty habit
## 6 habit.txt san francisco
## 7 habit.txt helpless junkie
## 8 habit.txt kezar stadium
## 9 habit.txt stadium freelance
## 10 habit.txt freelance writers
## # ... with 675 more rows
gramUnited %>%
count(id, bigram) %>%
bind_tf_idf(bigram, id, n) %>%
arrange(desc(tf_idf)) -> gramDFIDF
gramDFIDF
## # A tibble: 659 x 6
## id bigram n tf idf tf_idf
## <fctr> <chr> <int> <dbl> <dbl> <dbl>
## 1 nba.txt dumb people 2 0.03636364 1.791759 0.06515489
## 2 roy.txt patrick roy 4 0.03478261 1.791759 0.06232207
## 3 musberger.txt brent musburger 3 0.03157895 1.791759 0.05658188
## 4 roy.txt stanley cup 3 0.02608696 1.791759 0.04674155
## 5 habit.txt beat oakland 2 0.02173913 1.791759 0.03895129
## 6 musberger.txt real definition 2 0.02105263 1.791759 0.03772125
## 7 uberalles.txt sunday night 3 0.01886792 1.791759 0.03380678
## 8 nba.txt 12 hours 1 0.01818182 1.791759 0.03257744
## 9 nba.txt 12 miles 1 0.01818182 1.791759 0.03257744
## 10 nba.txt auto insurance 1 0.01818182 1.791759 0.03257744
## # ... with 649 more rows
Graphing a network? Interesting!
gramCount %>%
filter(n > 2) %>%
graph_from_data_frame() -> gramGraph
gramGraph
## IGRAPH DN-- 16 9 --
## + attr: name (v/c), n (e/n)
## + edges (vertex names):
## [1] san ->francisco joe ->montana patrick->roy
## [4] st ->louis brent ->musburger ho ->ho
## [7] monday ->night stanley->cup sunday ->night
Network graph as demonstrated in the link
ggraph(gramGraph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = -0.5, hjust = 0.5)
You may be asking about the inclusion of ‘ho’ by itself. HST often used the exclamation ‘ho-ho’.
We will finish this exploration by doing topic modeling the tidy way.
tokenTidy %>%
count(id, word, sort = TRUE) %>%
ungroup() -> word_counts
word_counts
## # A tibble: 1,602 x 3
## id word n
## <fctr> <chr> <int>
## 1 roy.txt denver 13
## 2 musberger.txt downtown 12
## 3 roy.txt roy 10
## 4 musberger.txt musburger 8
## 5 uberalles.txt baseball 8
## 6 uberalles.txt football 8
## 7 uberalles.txt teams 8
## 8 musberger.txt time 7
## 9 sainthood.txt time 7
## 10 uberalles.txt night 7
## # ... with 1,592 more rows
text_dtm <- word_counts %>%
cast_dtm(id, word, n) -> text_dtm
text_dtm
## <<DocumentTermMatrix (documents: 6, terms: 1284)>>
## Non-/sparse entries: 1602/6102
## Sparsity : 79%
## Maximal term length: 14
## Weighting : term frequency (tf)
text_lda <- LDA(text_dtm, k = 3, control = list(seed = 1234))
text_lda
## A LDA_VEM topic model with 3 topics.
text_topics <- tidy(text_lda, matrix = "beta")
text_topics
## # A tibble: 3,852 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 denver 5.743094e-51
## 2 2 denver 6.458558e-03
## 3 3 denver 2.564103e-02
## 4 1 downtown 2.429150e-02
## 5 2 downtown 2.457619e-57
## 6 3 downtown 3.624722e-52
## 7 1 roy 2.928935e-54
## 8 2 roy 5.943582e-57
## 9 3 roy 1.709402e-02
## 10 1 musburger 1.619433e-02
## # ... with 3,842 more rows
text_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta) -> top_terms
top_terms
## # A tibble: 16 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 downtown 0.024291498
## 2 1 time 0.016194332
## 3 1 musburger 0.016194332
## 4 1 people 0.012145749
## 5 1 brent 0.012145749
## 6 2 night 0.011840689
## 7 2 time 0.011840689
## 8 2 football 0.010764263
## 9 2 baseball 0.008611410
## 10 2 teams 0.008611410
## 11 2 game 0.008611410
## 12 3 denver 0.025641026
## 13 3 roy 0.017094017
## 14 3 patrick 0.008547009
## 15 3 habit 0.008547009
## 16 3 people 0.008547009
Plot the result
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
What do I make of all this? My first inclination is to bring quanteda and tidytext forward as my new go-to methodology. I have leaned heavily on tm and qdap in the past, and certainly will not discard their valuable contributions. We shall see as more work needs to be done.
Mahalo!