Sentiment analysis is a very exited topic and can allow us to understand text better. The second chapter of the book A Tidy Approach talks about the approach we can use for sentiment analysis with tidy data.
The following chunks of code are example code that I took from the book “A Tidy Approach”.
Let’s explore the different sentiment lexicons.
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
Martin Luther King was an advocate for the civil rights movement who delivered the speech “I Have a Dream” on August 28, 1963. This speech is widely considered the greatest speech of the 20th century for its power and resonance.
I’m interested in using a sentiment lexicon to understand what made the speech so powerful and memorable.
I found a text version of the speech on Kaggle. I added to my github and start tidying the data.
speech_df <- read.csv("https://raw.githubusercontent.com/Kossi-Akplaka/Data607-data_acquisition_and_management/main/Assignment10/dream.txt", header = FALSE)
tibble(speech_df)## # A tibble: 43 × 1
## V1
## <chr>
## 1 "I am happy to join with you today in what will go down in history as the gr…
## 2 "Five score years ago, a great American, in whose symbolic shadow we stand t…
## 3 "One hundred years later, the colored American lives on a lonely island of p…
## 4 "In a sense we have come to our Nation’s Capital to cash a check. When the a…
## 5 "This note was a promise that all men, yes, black men as well as white men, …
## 6 "It is obvious today that America has defaulted on this promissory note inso…
## 7 " a check that will give us upon demand the riches of freedom and security o…
## 8 "I would be fatal for the nation to overlook the urgency of the moment and t…
## 9 "There will be neither rest nor tranquility in America until the colored cit…
## 10 "We can never be satisfied as long as our bodies, heavy with the fatigue of …
## # ℹ 33 more rows
The choice of a sentiment lexicon depends on the nature of the text. The Loughran-McDonald Financial Sentiment Word Lists, for instance, are tailored for financial text and may not be suitable for a historical speech.
An alternative is to use the SentiWordNet lexicon. According to SentiWordNet gitHub, SentiWordNet is a lexical resource for opinion mining that assigns to each synset of WordNet three sentiment scores:
Positivity
Negativity
Objectivity (neutral)
The speech data has 43 rows and 1 columns. Let’s tidy it up, remove the punctuation, etc…into a corpus.
# Interprets each element as a document
corpus <- Corpus(VectorSource(speech_df$V1))
# Remove quotation marks like “ or ”
corpus <- tm_map(corpus, content_transformer(function(x) gsub('”', '', x)))
corpus <- tm_map(corpus, content_transformer(function(x) gsub('“', '', x)))
# Pre-process the data
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
# retrieve common words in English
corpus <- tm_map(corpus, removeWords, stopwords("english")) First, let’s count the word for each row and add that in a dataframe df
# Create a Document-Term Matrix (DTM)
dtm <- DocumentTermMatrix(corpus)
# Convert DTM to a Data Frame
df <- as.data.frame(as.matrix(dtm))
tibble(df)## # A tibble: 43 × 397
## demonstration freedom greatest happy history join nation today will ago
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 1 2 1 1 1 1 0
## 2 0 0 0 0 0 0 0 1 0 1
## 3 0 0 0 0 0 0 0 1 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 1 1 0 0
## 7 0 1 0 0 0 0 1 0 1 0
## 8 0 1 0 0 0 0 2 0 3 0
## 9 0 0 0 0 0 0 1 0 2 0
## 10 0 0 0 0 0 0 0 0 0 0
## # ℹ 33 more rows
## # ℹ 387 more variables: america <dbl>, american <dbl>, beacon <dbl>,
## # came <dbl>, captivity <dbl>, chains <dbl>, colored <dbl>, crippled <dbl>,
## # daybreak <dbl>, decree <dbl>, discrimination <dbl>, emancipation <dbl>,
## # end <dbl>, five <dbl>, flames <dbl>, free <dbl>, great <dbl>, hope <dbl>,
## # hundred <dbl>, injustice <dbl>, joyous <dbl>, later <dbl>, life <dbl>,
## # long <dbl>, manacle <dbl>, millions <dbl>, momentous <dbl>, night <dbl>, …
This data has 43 rows. Let’s add all the rows together to count the total number of times each word have been spoken
# Sum the counts across all documents
total_counts <- colSums(df)
# Convert to a data frame
total_counts_df <- data.frame(word = names(total_counts), count = total_counts) %>%
group_by(word) %>%
summarize(count = sum(count))
head(total_counts_df)## # A tibble: 6 × 2
## word count
## <chr> <dbl>
## 1 able 8
## 2 ago 1
## 3 alabama 3
## 4 alleghenies 1
## 5 almighty 1
## 6 also 1
Now, we can sort the data frame and plot the 10 most used words in the speech.
# Arrange by count in descending order and select the top 10
top_10_words <- total_counts_df %>%
arrange(desc(count)) %>%
head(10)
# Plot the top 10 words
ggplot(top_10_words, aes(x = reorder(word, count), y = count)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Top 10 Most Used Words in the Speech", x = "Word", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))In his speech, Martin Luther King uses the word “will” the most, indicating a forward-looking perspective towards the future. Furthermore, he frequently uses words such as “freedom”, “colored”, and “every” conveying a vision where individuals, regardless of the color of their skin, will experience universal freedom.
Based on that, one can assume that the speech was very positive and encouraging. Let’s use the SentiWordNet lexicon to find if that’s the case.
SentiWord has a list of 20,000 rows that gives a polarity values. In the dataframe, x is the Words and y stands for the Sentiment values.
Find more in the R Help document (hash_sentiment_sentiword {lexicon})
## x y
## 1: 365 days -0.5000
## 2: 366 days 0.2500
## 3: 3tc -0.2500
## 4: a fortiori 0.2500
## 5: a good deal 0.2500
## 6: a great deal 0.3125
Now we perform an inner join between “total_counts_df” and “hash_sentiment_sentiword”.
sentiment_analysis_df <- total_counts_df %>%
inner_join(hash_sentiment_sentiword, by = c("word" = "x"))
head(sentiment_analysis_df)## # A tibble: 6 × 3
## word count y
## <chr> <dbl> <dbl>
## 1 able 8 0.125
## 2 back 8 0.25
## 3 bad 1 -0.518
## 4 bank 1 0.375
## 5 basic 1 0.25
## 6 battered 1 -0.75
Finally, we can visualize the word vs the sentiment.
ggplot(sentiment_analysis_df, aes(x = word, y = y)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Sentiment Scores for Each word", y = "Sentiment Score") +
theme(axis.text.x = element_blank(), axis.title.x = element_blank())Based on the distribution, there is no clear pattern. The reason can be:
Limitation of lexicon analysis as it may not cover all nuances
The speech was very polarizing.
Exploring another Lexicon sentiment may provide a more understanding.
# Load AFINN lexicon
afinn_lexicon <- get_sentiments("afinn")
# Join total_counts_df with AFINN lexicon
afinn_analysis_df <- total_counts_df %>%
inner_join(afinn_lexicon, by = c("word" = "word"))
# Plot sentiment scores vs terms
ggplot(afinn_analysis_df, aes(x = word, y = value)) +
geom_bar(stat = "identity", fill = "brown") +
labs(title = "Sentiment Scores using AFINN for Each word", y = "Sentiment Score") +
theme(axis.text.x = element_blank(), axis.title.x = element_blank())Based on the plot, there are slightly more positive words in the AFFINN sentiment.
Speakers often use rhetorical devices to create emotional impact. This can involve emphasizing challenges and injustices (negative sentiment) while concurrently expressing optimism and aspirations (positive sentiment).
This speech motivate and inspire the audience by incorporating hope, dreams, and the vision for a better future while reminding the historical struggles for civil rights.