Malcolm Gladwell first popped onto my radar with his book Blink in the early 2000s as I was studying military strategy and problem solving at an Army school on Kansas. The book fit nicely into the curriculum as it explores how we think about thinking. Gut instincts versus deep planning, how our brains work, and why some decisions are inexplicable are just a few of the topics that resonated with me from Blink. I also listen to a couple of the podcasts from his Pushkin production company as they dig into why we see the world the way we do. With so much positive interactions with his past works, I thought I’d be thrilled about his newest book, The Bomber Mafia, which was published in the Spring of 2021. I have yet to buy it. While it gets 4.5 out of 5 stars over its almost 6,000 reviews, many historians I follow and respect have been harsh critics of the book. While most of Gladwell’s work has focused on why people and businesses are successful, The Bomber Mafia is an historical account of the struggle between the concepts of precision bombing and scorched earth tactics for U.S. aviators in World War II. Historians by the bunches are writing that Gladwell missed the mark on much of his historic analysis.
This project will analyze Amazon book reviews to determine if topic modeling techniques can identify specific themes of why people may be unsatisfied with Gladwell’s latest work of nonfiction.
This week’s independent analysis will focus on identifying “topics” by examining how words cohere into different latent, or hidden, themes based on patterns of co-occurrence of words within online book reviews. Three questions will drive the analysis:
With respect to the actual R workflow of applying topic models to documents and text of interests, Silge & Robinson add a new bottom row to their flowchart detailing new data structures (i.e., a corpus object and document-term matrix) and and the LDA model:
This project will explore multiple topic modeling techniques for identifying themes in a corpus of book reviews that received relatively low ratings. The project will be organized according to the basic data science analytic process:
tidytext package and then stemmed through the stm package. This package makes use of the tm text mining package to preprocess text and prior to word stemming. Finally, the tidied text is transformed into a document term matrix (DTM) to describe the frequency of terms across the body of book reviews.topicmodels and stm packages will be used including the findThoughts function for viewing documents assigned to a given topic and the toLDAvis function for exploring topic and word distributions.To assist in understanding the context behind the focus questions and data sources for this project, this section will focus on the following topics:
Can distinct themes be identified through automated topic modeling?
In this case, I’m looking for themes common to multiple book reviews. Technically, each review can be thought of as “themed,” but I would like to see if topics emerge outside of the base documents.
Will different topic modeling techniques (stemming, LDA vs STM) result in similar or dissimilar topics?
We have a fairly small data set at 152 documents, so the value of stemming may be less than if we had a much larger set of terms to assess. Still, reducing redundancy may enable the emergence of topics masked by similar words that are repeated.
Do the themes correlate to negative reactions to the book?
This is the true purpose of the project. Can we discern why these reviews were critical of the book?
As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up a “Project” within RStudio. This will be our “home” for any files and code used or created in the analysis.
The following packages were installed and/or loaded for this project:
library(tidyverse)
library(tidytext)
library(rvest)
library(purrr)
library(stringr)
library(XML)
library(RCurl)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
tidytext package was used to “tidy” and tokenize book review data and the cast_dtm() function created the document term matrix (DTM) needed for topic modeling.textProcessor() function. I”ll compare this method with non-stemming techniques to gauge potential impact on consistency of theme identification.My web scraping method was inspired by an article by Riki Saito posted to thbe R news and tutorials site R-BLOGGERS. The process began with establishing a connection to the site hosting public reviews of The Bomber Mafia:
asin <- "0316296619"
url <- paste0("https://www.amazon.com/dp/", asin)
doc <- read_html(url)
prod <- html_nodes(doc, "#productTitle") %>%
html_text() %>%
gsub("\n", "", .) %>%
trimws()
prod
## [1] "The Bomber Mafia: A Dream, a Temptation, and the Longest Night of the Second World War"
Now that I’m confident I can call the correct product review website, below is the code to ensure the html data is formatted appropriately for analysis. For this effort, I’m only interested in the review title, date, star ratings and comments:
scrape_amazon <- function(url, throttle = 0){
# Set throttle between URL calls
sec = 0
if(throttle < 0) warning("throttle was less than 0: set to 0")
if(throttle > 0) sec = max(0, throttle + runif(1, -1, 1))
# obtain HTML of URL
doc <- read_html(url)
# Parse relevant elements from HTML
title <- doc %>%
html_nodes("#cm_cr-review_list .a-color-base") %>%
html_text() %>%
gsub("^\\s+|\\s+$", "", .)
str_replace_all(title, "[\r\n]" , "")
date <- doc %>%
html_nodes("#cm_cr-review_list .review-date") %>%
html_text() %>%
gsub(".*on ", "", .)
stars <- doc %>%
html_nodes("#cm_cr-review_list .review-rating") %>%
html_text() %>%
str_extract("\\d") %>%
as.numeric()
comments <- doc %>%
html_nodes("#cm_cr-review_list .review-text") %>%
html_text() %>%
gsub("^\\s+|\\s+$", "", .)
# Combine attributes into a single data frame
df <- data.frame(title, date, stars, comments, stringsAsFactors = F)
return(df)
}
With the format set, I wanted to run a sample test to ensure the data arrived ready for analysis. I pulled the first page of reviews and sent it to a data frame:
url <- "http://www.amazon.com/product-reviews/0316296619/?pageNumber=1"
reviews <- scrape_amazon(url)
str(reviews)
Data frame for 1st ten reviews of The Bomber Mafia
My original intent was to download the first 100 pages which should yield approximately 1000 book reviews to begin exploratory analysis. However, it became problematic to get that data directly for a reason I didn’t anticipate. I tried multiple times and could only get 498 reviews, just shy of 50 full pages. Turns out that not every country submits book reviews in the same format or by using the same entry fields. My last book review entry was from Japan and the format did not complete the intended cells correctly in the data frame causing an error at entry #498. I researched a few solutions addressing this issue and they exceeded my skill level, so I took an alternate route.
If I were able to successfully scrape the first 1000 reviews, the results would be a mix of ratings from 1 through 5 stars. No telling how many I would get at each rating. Since my project is focused on the bad reviews, I decided to pull exclusively from the reviews Amazon classifies as “critical” as that’s where all the comments are contained that I’d like to analyze. With that in mind, I rebuilt my query to pull these reviews from that single category. There are only 16 pages of these “critical” reviews:
pages <- 16
reviews_all <- NULL
for(page_num in 1:pages){
url <- paste0("https://www.amazon.com/product-reviews/0316296619/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=",page_num,"&filterByStar=critical")
reviews <- scrape_amazon(url, throttle = 3)
reviews_all <- rbind(reviews_all, cbind(prod, reviews))
}
This code returned a data frame with 153 perfectly formatted reviews ranging from 1 to 3 stars. There was one review not in English, so I pulled that from the data set to keep things simple, leaving 152 remaining entries. Finally, I added a column with a unique ID number to discriminate between book reviews. Since some readers title their review after the book, I wanted to ensure each review was treated as a separate document and not combined with others with the same title. To keep from having to continually run this query, I converted the data frame into a .csv file to continue the analysis:
reviews_all <- reviews_all[-nrow(reviews_all),]
reviews_all <- cbind(ID = 1:nrow(reviews_all), reviews_all)
write.csv(reviews_all, "data/reviews_all.csv")
In this section I employed some familiar tidytext functions to tidy and tokenize text while also introducing the stm package for processing text and transforming our data frames into new data structures required for topic modeling.
tidytext functions
unnest_tokens() splits a column into tokensanti_join() returns all rows from x without a match in y and used to remove stop words from out data.cast_dtm() takes a tidied data frame take and “casts” it into a document-term matrix (dtm)dplyr functions
count() lets you quickly count the unique values of one or more variablesgroup_by() takes a data frame and one or more variables to group bysummarise() creates a summary of data using arguments like sum and meanstm functions
textProcessor() takes in a vector or column of raw texts and performs text processing like removing punctuation and word stemming.prepDocuments() performs several corpus manipulations including removing words and renumbering word indicesreviews_tidy <- reviews_all %>%
unnest_tokens(output = word, input = comments) %>%
anti_join(stop_words, by = "word")
Now let’s do a quick word count to see some of the most common words used throughout the customer reviews. This should give a sense of what we’re working with and later we’ll need these word counts for creating our document term matrix for topic modeling:
my_kable(reviews_tidy %>%
count(word, sort = TRUE))
| word | n |
|---|---|
| bombing | 187 |
| book | 182 |
| war | 157 |
| gladwell | 155 |
| precision | 72 |
| lemay | 70 |
| bomber | 65 |
| air | 62 |
| japan | 62 |
| read | 53 |
Terms like “bombing,” “war,” and “air” are about what we would have expected from a book on the WW II air campaign. The terms “written” and “subject” however, are not so intuitive and may be worth a quick look once we get into exploratory analysis. Additionally, many similar words appear that could skew analysis. “Gladwell” and “gladwell’s” do not carry significantly different meanings. Same could be said for “bomb” vs. “bombing.” This leads me to infer that stemming could reduce this redundancy and better highlight less common terms.
To reduce the influence of unnecessary terms, I’ll create a list of custom stop words to filter from the tidied data frame.
my_stopwords <- c("war", "bomb", "gladwell", "book", "bombing", "lemay", "read", "author", "u.s", "air", "bomber", "japan", "mafia", "books", "british", "wwii", "military", "malcolm", "hansell", "bombs", "world", "gladwell's", "cities", "curtis", "germany", "napalm", "casualties", "bombers", "american", "bombsight", "wars", "army", "ii", "makes", "night", "29", "dropped", "lives", "japanese", "story", "people", "civilian", "gladwell’s")
reviews_tidy_2 <- reviews_tidy %>%
filter(!word %in% my_stopwords)
my_kable(reviews_tidy_2 %>%
count(word, sort = TRUE))
| word | n |
|---|---|
| precision | 72 |
| history | 49 |
| time | 33 |
| strategic | 27 |
| force | 24 |
| short | 24 |
| norden | 22 |
| tailwind | 22 |
| written | 19 |
| atomic | 18 |
The updated term list looks to be much less a summary of the plot and highlights specific descriptors of writing style or value. To gain a little more context, I’ll do a search for one of the terms identified earlier as a possible indicator of criticism, “written”:
error_quotes <- reviews_all %>%
select(comments) %>%
filter(grepl('written', comments))
sample_n(error_quotes,3) %>%
kable()
| comments |
|---|
| I found the story very interesting and well written but for $15.00 the book was much to short I read it in under 4 hours. |
| From this well-written book, we learn a third possible reason why Japan surrendered to end World War II. As kids, we all were told it was the atomic bomb. Then we learned it was also the addition of the Soviet Union after Germany was defeated with the potential to invade Japan. Now we hear Curtis LeMay’s napalm bombing of scores of Japanese cities did it, at least according to LeMay, and despite this, the theory of precision bombing is now the acceptable or at least morally superior way to wage war. The book is not convincing. Despite LeMay’s bombing, it is estimated that Japan had less than one million civilian casualties. The Soviet Union with maybe 2.5 times more people than Japan, and which was invaded, had five or ten million civilian casualties and it was a winner. It’s easy to argue the Soviet Union could have ultimately finished defeating Germany without any help from the Western Allies. In contrast, neither precision nor blanket bombing of civilians has been a proven success, and it’s still completely unclear which is better or even worthwhile. It’s also not clear why Japan really surrendered and this book is not helpful, but it’s important to know the answer. The US hasn’t been on the winning side of a significant war since, if the standard is, as it should be, that it wins the war, it feeds the defeated people, it takes over the government changing the constitution and it creates a friendly democratic country out of a completely undemocratic society. This has essentially been the result for Japan and Germany. We can’t say that about Korea, Vietnam, Iraq or Afghanistan or the so-called victory in the Cold War, If we’re not going to accomplish that, it greatly reduces the reason to fight at all except if directly attacked. |
| OK, I have never been a big fan of Gladwell, even before this. His “6 degrees of separation from a butterfly flapping its wings in Nairobi”. Imaginative, but verging on Sophist (anything can shown to be related to something else to show causality).This is a bit of a vanity project. “I noticed I have a lot of military history books - so I decided to do one myself!” Although, this was developed as a podcast, then an audiobook - and then a “real” book.It is obvious that the backbone of his book has been developed from secondary sources, but all he annotates and quotes are “original sources”. Which tend to be oral history interviews, from decades after the fact. And official US military histories. Questionable, at best.That he never mentions W G Sebald’s essays on the bombing of Germany in WWII, or Martin Caidin’s books, or any other of the valuable “secondary” sources that are out there is somewhat stunning. Caidin’s books, while written early on, and a bit dry, cover exactly the air campaigns Gladwell writes about here - the ballbearing plant raids in Germany, and the fire bombings of Japanese cities.What is also missing is politics and economics. Churchill is brought in briefly, but mostly for his friendship with Lindemann. Economics? How could he not have read A J P Taylor’s comments on how much of the British economy was invested in the bombings - so stopping them would have been a tough (economic) decision.I did enjoy the chapter on the development of napalm. And the Norden bombsight - but, was it ever successfully used? And if it worked, how often, why, where, and how well?Overall, a slight book. The other issue is in the end it seems he wants to have it both ways. He admires the “humanism” (military and humanism seems like an oxymoron to me) of the Bomber Mafia - but he obviously also admires LeMay for “getting it done”.A quick read, a quick history - that should lead you to other, more insightful and complete, books. For me, that meant going back to W G Sebald, where you can see a real brilliant mind at work. |
This request returned 10 reviews describing the author’s writing style. They include references to both Gladwell’s talent as a story teller, but also show that many were unhappy with this particular history manuscript.
For this analysis, each individual book review will be treated as a unique “document.” To create the DTM, we’ll need to first count() how many times each word occurs in each document, or ID in our case, and create a matrix that contains one row per post as the original data frame did, but now contains a column for each word in the entire corpus and a value of n for how many times that word occurs in each review.
To create this document term matrix from our post counts, we’ll use the cast_dtm() function like so and assign it to the variable reviews_dtm:
reviews_dtm <- reviews_tidy_2 %>%
count(ID, word) %>%
cast_dtm(ID, word, n)
The result of this function is a simple_triplet_matrix object, called reviews_dtm. This DTM contains 152 documents and 3,686 terms.
The ldatuning package has functions for both calculating and plotting different metrics that can be used to estimate the most preferable number of topics for LDA modeling. It also conveniently takes the standard document term matrix object created from the tidy text.
k_metrics <- FindTopicsNumber(
reviews_dtm,
topics = seq(10, 75, by = 5),
metrics = "Griffiths2004",
method = "Gibbs",
control = list(),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL
)
FindTopicsNumber_plot(k_metrics)
Using the Griffiths2004 metrics included in the default example produced the results as show in the figure below:
As a general rule of thumb and overly simplistic heuristic, an inflection point in the plot indicates an optimal number of topics to select for a value of K (15 in this case).
reviews_lda <- LDA(reviews_dtm,
k = 15,
control = list(seed = 588)
)
reviews_lda
## A LDA_VEM topic model with 15 topics.
The control = argument was used to pass a random number (588) to seed the assignment of topics to each word in the corpus. Since LDA is a stochastic algorithm that could have different results depending on where the algorithm starts, a seed was specified for reproducibility. The model will now produce the same results every time the same number of topics is specified.
temp <- textProcessor(reviews_all$comments,
metadata = reviews_all,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=FALSE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=my_stopwords)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Remove Custom Stopwords...
## Removing numbers...
## Creating Output...
Unlike the unnest_tokens() function, the output is not a nice tidy data frame. Topic modeling using the stm package requires a very unique set of inputs that are specific to the package. One change I made to the default values was to change stem to “FALSE.” The relatively small data set mixed with custom stop words should mitigate against redundant terms and potentially preserve some context.
stm PackageAs shown above, STM produced an unusual temp textProcessor output that is unique to the stm package. The stm() function for fitting a structural topic model does not take a fairly standard document term matrix like the LDA() function.
Before an STM model can be fitted, elements must be extracted from the temp object created after processing the review text. Specifically, the stm() function expects the following arguments:
documents = the document term matrix to be modeled in the native STM formatdata = an optional data frame containing meta data for the prevalence and/or content covariates to include in the modelvocab = a character vector specifying the words in the corpus in the order of the vocab indices in documents.Let’s go ahead and extract these elements:
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
And now use these elements to fit the model using the same number of topics for K that we specified for our LDA topic model. Let’s also take advantage of the fact that we can include the ID and title covariates in the prevealence = argument to help improve, in theory, model fit:
reviews_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ ID + title,
K=10,
max.em.its=25,
verbose = FALSE,
gamma.prior = 'L1')
reviews_stm
## A topic model with 10 topics, 152 documents and a 3994 word dictionary.
As noted earlier, the stm package has a number of handy features. One of these is the plot.STM() function for viewing the most probable words assigned to each topic.
By default, it only shows the first 3 terms so let’s change that to 5 to help with interpretation:
plot.STM(reviews_stm, n = 5)
I chose two different K values for each of the topic models. I used K=15 for LDA as that’s what the Gibbs method recommended, but I reduced K to 10 for the STM model. That shift was made as I saw multiple overlapping topics in STM when K was set to 15. Screen shots below capture the STM differences for each of those K values using the LDAvis topic browser:
toLDAvis(mod = reviews_stm, docs = docs)
As you can see from the browser screen shot below, our current STM model of 15 topics is resulting in a lot of overlap among topics and suggest that 15 may not be an optimal number of topics:
Changing K to 10 significantly reduced the overlap in topics:
Hidden within this forums_lda topic model object we created are per-topic-per-word probabilities, called β (“beta”). It is the probability of a term (word) belonging to a topic.
Let’s take a look at the 5 most likely terms assigned to each topic, i.e. those with the largest β values using the terms() function from the topicmodels package:
terms(reviews_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "words" "tactics" "strategic" "tailwind" "5" "history"
## [2,] "precision" "history" "precision" "ferocious" "precision" "blah"
## [3,] "burning" "podcasts" "short" "takes" "narrative" "surrender"
## [4,] "claiming" "feels" "2" "force" "subject" "precision"
## [5,] "doesn’t" "combat" "force" "precision" "strategic" "bad"
## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12
## [1,] "precision" "history" "precision" "time" "precision" "precision"
## [2,] "results" "argument" "reader" "history" "effectiveness" "fighter"
## [3,] "corps" "precision" "harris" "errors" "morally" "strategic"
## [4,] "raids" "topic" "history" "short" "means" "targets"
## [5,] "daylight" "found" "raids" "podcast" "fighter" "111"
## Topic 13 Topic 14 Topic 15
## [1,] "idea" "history" "time"
## [2,] "effective" "tailwind" "worth"
## [3,] "stars" "time" "moral"
## [4,] "note" "effective" "treatment"
## [5,] "segregation" "lamaye" "based"
Based on our selected number of topics for our corpus, some themes are fairly intuitive to interpret. For example:
Topic 2 (tactics, history, podcasts, feels, combat) seems to be about relating the authors la takes on military history to his podcast (which is about revisionist history);
Topic 6 (history, blah, surrender, precision, bad) indicates dissatisfaction with the author’s analysis of a specific event; and
Topic 4 (tailwind, ferocious, takes, force, precision) relates to a specific claim the author makes regarding the wind required for a plane to take off (debunked by many).
To get a more intuitive description of the relationships above, the tidytext package can covert the LDA model to a tidy data frame containing these beta values for each term:
tidy_lda <- tidy(reviews_lda)
tidy_lda
## # A tibble: 54,645 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 100,000 2.68e-271
## 2 2 100,000 6.62e-270
## 3 3 100,000 2.69e- 3
## 4 4 100,000 2.13e-270
## 5 5 100,000 1.35e-268
## 6 6 100,000 1.24e-270
## 7 7 100,000 3.75e- 3
## 8 8 100,000 6.83e-270
## 9 9 100,000 1.61e-269
## 10 10 100,000 1.81e-270
## # … with 54,635 more rows
Then we can plot this information visually:
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
Now that we have a sense of the most common words associated with each topic, let’s take a look at the topic prevalence in the book review corpus, including the words that contribute to each topic we examined above.
Also, hidden within our forums_lda topic model object we created are per-document-per-topic probabilities, called γ (“gamma”). This provides the probabilities that each document is generated from each topic, the gamma matrix. Beta and gamma values can be combined to understand the topic prevalence in our corpus, and which words contribute to each topic.
First, let’s create two tidy data frames for our beta and gamma values
td_beta <- tidy(reviews_lda)
td_gamma <- tidy(reviews_lda, matrix = "gamma")
td_beta
## # A tibble: 54,645 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 100,000 2.68e-271
## 2 2 100,000 6.62e-270
## 3 3 100,000 2.69e- 3
## 4 4 100,000 2.13e-270
## 5 5 100,000 1.35e-268
## 6 6 100,000 1.24e-270
## 7 7 100,000 3.75e- 3
## 8 8 100,000 6.83e-270
## 9 9 100,000 1.61e-269
## 10 10 100,000 1.81e-270
## # … with 54,635 more rows
td_gamma
## # A tibble: 2,280 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1 1 0.0000427
## 2 2 1 0.000108
## 3 3 1 0.000423
## 4 4 1 0.00238
## 5 5 1 0.000372
## 6 6 1 0.0000478
## 7 7 1 0.0000614
## 8 8 1 0.0000494
## 9 9 1 0.000228
## 10 10 1 1.00
## # … with 2,270 more rows
Next, we’ll create a filtered data frame of our top_terms, join this to a new data frame for gamma-terms and create a nice clean table using the kabel() function knitr package:
top_terms <- td_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`
gamma_terms <- td_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
| Topic | Expected topic proportion | Top 7 terms |
|---|---|---|
| Topic 6 | 0.135 | history, blah, surrender, precision, bad, content, time, future, evil, humans |
| Topic 15 | 0.128 | time, worth, moral, treatment, based, biased, reading |
| Topic 10 | 0.117 | time, history, errors, short, podcast, basic, precision, hard, wrong |
| Topic 8 | 0.117 | history, argument, precision, topic, found, pacific, podcast |
| Topic 4 | 0.104 | tailwind, ferocious, takes, force, precision, enola, error, gay, hiroshima, nagasaki |
| Topic 14 | 0.098 | history, tailwind, time, effective, lamaye, force, navy, naval, personal, writing |
| Topic 2 | 0.098 | tactics, history, podcasts, feels, combat, subject, thinking, ww, technology, strategies, practices, theme |
| Topic 3 | 0.040 | strategic, precision, short, 2, force, historians, incendiary, message, morally, quotes, tokyo, authors |
| Topic 13 | 0.034 | idea, effective, stars, note, segregation, timeline, je |
| Topic 5 | 0.027 | 5, precision, narrative, subject, strategic, raf, hard, lot |
| Topic 11 | 0.027 | precision, effectiveness, morally, means, fighter, fighters, targets, combat, developed, prior, warfare, accuracy, aircraft, failure, results, flying, proved, provide, daylight, target, defensive, allied, saturation, anti, stealth, uncertain |
| Topic 12 | 0.021 | precision, fighter, strategic, targets, 111, enemy, weapons, soviet, union, fan |
| Topic 9 | 0.021 | precision, reader, harris, history, raids, left, force, arthur, time, fails, it’s, atomic, firebombing, imperial, invasion, raf, technology |
| Topic 1 | 0.018 | words, precision, burning, claiming, doesn’t, satan, word |
| Topic 7 | 0.016 | precision, results, corps, raids, daylight, civilians, atomic, deaths |
And let’s also compare this to the most prevalent topics and terms from our reviews_stm model that we created using the plot() function:
plot(reviews_stm, n = 7)
Recognizing that topic modeling is best used as a “tool for reading” and provides only an incomplete answer to our overarching, “How do we quantify what a corpus is about?”, the results do suggest some potential topics that have emerges as well as some areas worth following up on.
Specifically, looking at some of the common clusters of words for the more prevalent topics suggest that some key topics or “latent themes” (renamed in bold) might include:
reviews_stm and reviews_lda models contains the terms “errors”, “wrong”, “history”. This could represent the general consensus across all the critical reviews. Some specific examples concern how planes work (tailwind) and which planes dropped specific ordinance during WW II (enola, gay, hiroshima, nagasaki).This project was designed to answer three key questions about using topic modeling to identify latent themes in a body of text. Specifically:
I believe the simple answer is that yes, multiple themes can be identified. Being the techniques are unsupervised, however, their is an art to determining how inclusive those themes may be. In this case, I had to weed out much of the information about the plot of the book as it was common to almost every review. Losing that context may be detrimental to identifying topics, but if you have some understanding of the overall corpus, you can mitigate that risk.
In this case, both LDA and STM models returned relatively similar results. Many of the same key terms appeared in both models. However, the number of topics (K) returned different results in each model. LDA was more consistent with K = 15 while STM models returned fewer overlapping topics with K =10. Lastly, stemming was less of a factor in this project as I reduced redundant terms during wrangling. Adding some additional stop words ensured that stemming would be less impactful. When it was tried, the results were not consistent with the other models.
The topics returned by both models do correlate to the negative reactions I expected from the reviews. A major limitation in this study was the size of the data set. With only 152 reviews categorized as “critical” of the book, I was left to analyze a fairly small corpus of text. I would argue that a minimum of 500 documents are needed when the text is so short in each document. In this case, many of the reviews only differed in a couple of key words that failed to register very high on frequency count. This kept those terms towards the bottom of key word counts and harder to distinguish as drivers of topic development.