Conduct sentiment analysis on MLK’s speech to determine how positive/negative his speech was. Split his speech into four quartiles to see how that sentiment changes over time.Create two bar charts to display your results.
# Add your library below.
library(XML)
## Warning: package 'XML' was built under R version 4.2.3
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.2.3
library(tm)
# Set CRAN mirror
options(repos = "https://cran.r-project.org")
Sentiment analysis relies on a “dictionary”. Most dictionaries categorize words as either positive or negative, but some dictionaries use emotion (such as the NRC EmoLex Dictionary). Each dictionary is different. This assignment will introduce you to the Bing dictionary, which researchers created by categorizing words used in online reviews from Amazon, Yelp, and other similar platforms.
The files needed for this lab are stored in a RAR file. You must extract the files from the compressed RAR file by using a third-party application, such as 7Zip, winZip, or another program. Use google to find a RAR file extractor.
Find the RAR file on the UIC website (contains two text files: positive words and negative words). Ths file is about halfway down the page, listed as “A list of English positive and negative opinion words or sentiment words”. Use the link below:
Save these files in your “data” folder.
# No code necessary; Save the files in your project's data folder.
Create two vectors of words, one for the positive words and one for the negative words.
# Defining the path to the positive word file
positive_words_file <- "/Users/auz/Desktop/week7_Lab/data/positive-words.txt"
# Defining the path to the negative word file
negative_words_file <- "/Users/auz/Desktop/week7_Lab/data/negative-words.txt"
# Reading positive words from file
positive_words <- scan(positive_words_file, what = "character", sep = "\n")
# Reading negative words from file
negative_words <- scan(negative_words_file, what = "character", sep = "\n")
# Printing the first few words from each list to verify
head(positive_words)
## [1] ";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;"
## [2] "; "
## [3] "; Opinion Lexicon: Positive"
## [4] ";"
## [5] "; This file contains a list of POSITIVE opinion words (or sentiment words)."
## [6] ";"
head(negative_words)
## [1] ";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;"
## [2] "; "
## [3] "; Opinion Lexicon: Negative"
## [4] ";"
## [5] "; This file contains a list of NEGATIVE opinion words (or sentiment words)."
## [6] ";"
Note that when reading in the word files, there might be lines at the start and/or the end that will need to be removed (i.e. you should clean your dataset).
# Loading necessary libraries
install.packages("tidytext")
##
## The downloaded binary packages are in
## /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpZc7htQ/downloaded_packages
library(tidytext)
# Step 1.3 - Clean the files
# Cleaning the positive word file
positive_words <- read_lines(positive_words_file) %>%
as_tibble() %>%
filter(str_detect(value, "^[a-zA-Z]")) %>%
pull(value) # Keep only lines starting with alphabetic characters
# Cleaning the negative word file
negative_words <- read_lines(negative_words_file) %>%
as_tibble() %>%
filter(str_detect(value, "^[a-zA-Z]")) %>%
pull(value) # Keep only lines starting with alphabetic characters
Text is stored in many different formats, such as TXT, CSV, HTML, and JSON. In this lab, you are going to experience how to “parse HTML” for text analysis.
Find MLK’s speech on the AnalyticTech website. You can either read in the file using the XML package, or you can copy/paste the document into a TXT file.
Use the link below:
# Write your code below.
# Load necessary library
library(xml2)
# Load necessary library
library(xml2)
# Step 2.1 - Find and read in the file
# Defining the URL of MLK's speech
mlk_url <- "http://www.analytictech.com/mb021/mlk.htm"
# Reading and parse the HTML content of the webpage
mlk_html <- read_html(mlk_url)
# Extracting the text content from the HTML paragraphs
mlk_paragraphs <- xml_text(xml_find_all(mlk_html, "//p"))
# Concatenating the paragraphs into a single string
mlk_text <- paste(mlk_paragraphs, collapse = " ")
# Printing the extracted text
print(mlk_text)
## [1] "I am happy to join with you today in what will go down in\r\nhistory as the greatest demonstration for freedom in the history\r\nof our nation. Five score years ago a great American in whose symbolic shadow\r\nwe stand today signed the Emancipation Proclamation. This\r\nmomentous decree came as a great beckoning light of hope to\r\nmillions of Negro slaves who had been seared in the flames of\r\nwithering injustice. It came as a joyous daybreak to end the long\r\nnight of their captivity. But one hundred years later the Negro is still not free. One\r\nhundred years later the life of the Negro is still sadly crippled\r\nby the manacles of segregation and the chains of discrimination. One hundred years later the Negro lives on a lonely island of\r\npoverty in the midst of a vast ocean of material prosperity. One hundred years later the Negro is still languishing in the\r\ncomers of American society and finds himself in exile in his own\r\nland. We all have come to this hallowed spot to remind America of\r\nthe fierce urgency of now. Now is the time to rise from the dark\r\nand desolate valley of segregation to the sunlit path of racial\r\njustice. Now is the time to change racial injustice to the solid\r\nrock of brotherhood. Now is the time to make justice ring out for\r\nall of God's children. There will be neither rest nor tranquility in America until\r\nthe Negro is granted citizenship rights. We must forever conduct our struggle on the high plane of\r\ndignity and discipline. We must not allow our creative protest to\r\ndegenerate into physical violence. Again and again we must rise\r\nto the majestic heights of meeting physical force with soul\r\nforce. And the marvelous new militarism which has engulfed the Negro\r\ncommunity must not lead us to a distrust of all white people, for\r\nmany of our white brothers have evidenced by their presence here\r\ntoday that they have come to realize that their destiny is part\r\nof our destiny. So even though we face the difficulties of today and tomorrow\r\nI still have a dream. It is a dream deeply rooted in the American\r\ndream. I have a dream that one day this nation will rise up and live\r\nout the true meaning of its creed: 'We hold these truths to be\r\nself-evident; that all men are created equal.\" I have a dream that one day on the red hills of Georgia the\r\nsons of former slaves and the sons of former slave owners will be\r\nable to sit together at the table of brotherhood. I have a dream that one day even the state of Mississippi, a\r\nstate sweltering with the heat of injustice, sweltering with the\r\nheat of oppression, will be transformed into an oasis of freedom\r\nand justice. I have a dream that little children will one day live in a\r\nnation where they will not be judged by the color of their skin\r\nbut by the content of their character. I have a dream today. I have a dream that one day down in Alabama, with its vicious\r\nracists, with its Governor having his lips dripping with the\r\nwords of interposition and nullification, one day right there in\r\nAlabama little black boys and black girls will be able to join\r\nhands with little white boys and white girls as sisters and\r\nbrothers. I have a dream today. I have a dream that one day every valley shall be exalted,\r\nevery hill and mountain shall be made low, the rough places\r\nplains, and the crooked places will be made straight, and before\r\nthe Lord will be revealed, and all flesh shall see it together. This is our hope. This is the faith that I go back to the\r\nmount with. With this faith we will be able to hew out of the\r\nmountain of despair a stone of hope. With this faith we will be\r\nable to transform the genuine discords of our nation into a\r\nbeautiful symphony of brotherhood. With this faith we will be\r\nable to work together, pray together; to struggle together, to go\r\nto jail together, to stand up for freedom forever, )mowing that\r\nwe will be free one day. And I say to you today my friends, let freedom ring. From the\r\nprodigious hilltops of New Hampshire, let freedom ring. From the\r\nmighty mountains of New York, let freedom ring. From the mighty\r\nAlleghenies of Pennsylvania! Let freedom ring from the snow capped Rockies of Colorado! Let freedom ring from the curvaceous slopes of California! But not only there; let freedom ring from the Stone Mountain\r\nof Georgia! Let freedom ring from Lookout Mountain in Tennessee! Let freedom ring from every hill and molehill in Mississippi.\r\nFrom every mountainside, let freedom ring. And when this happens, when we allow freedom to ring, when we\r\nlet it ring from every village and hamlet, from every state and\r\nevery city, we will be able to speed up that day when all of\r\nGod's children, black men and white men, Jews and Gentiles,\r\nProtestants and Catholics, will be able to join hands and sing in\r\nthe words of the old Negro spiritual, \"Free at last! Free at\r\nlast! Thank God almighty, we're free at last!\" "
If you choose to read the raw HTML using the XML package, you will need to parse the HTML object. For this exercise, we can split the HTML by the paragraph tag and then store the paragraphs inside a vector. The following code might help:
# Read and parse HTML file
doc.html = htmlTreeParse('http://www.analytictech.com/mb021/mlk.htm',
useInternal = TRUE)
# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
# Replace all \n by spaces
doc.text = gsub('\\n', ' ', doc.text)
# Replace all \r by spaces
doc.text = gsub('\\r', ' ', doc.text)
# Write your code below, if necessary.
# Loading necessary library
library(xml2)
# Step 2.2 - Parse the HTML file and extract paragraphs
# Reading and parse HTML file
mlk_html <- read_html('http://www.analytictech.com/mb021/mlk.htm')
# creating a character vector.
mlk_paragraphs <- xml_text(xml_find_all(mlk_html, "//p"))
# Concatenating the paragraphs into a single string
mlk_text <- paste(mlk_paragraphs, collapse = " ")
# Replacing all \n by spaces
mlk_text <- gsub('\\n', ' ', mlk_text)
# Replacing all \r by spaces
mlk_text <- gsub('\\r', ' ', mlk_text)
# Printing the extracted text
print(mlk_text)
## [1] "I am happy to join with you today in what will go down in history as the greatest demonstration for freedom in the history of our nation. Five score years ago a great American in whose symbolic shadow we stand today signed the Emancipation Proclamation. This momentous decree came as a great beckoning light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity. But one hundred years later the Negro is still not free. One hundred years later the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later the Negro is still languishing in the comers of American society and finds himself in exile in his own land. We all have come to this hallowed spot to remind America of the fierce urgency of now. Now is the time to rise from the dark and desolate valley of segregation to the sunlit path of racial justice. Now is the time to change racial injustice to the solid rock of brotherhood. Now is the time to make justice ring out for all of God's children. There will be neither rest nor tranquility in America until the Negro is granted citizenship rights. We must forever conduct our struggle on the high plane of dignity and discipline. We must not allow our creative protest to degenerate into physical violence. Again and again we must rise to the majestic heights of meeting physical force with soul force. And the marvelous new militarism which has engulfed the Negro community must not lead us to a distrust of all white people, for many of our white brothers have evidenced by their presence here today that they have come to realize that their destiny is part of our destiny. So even though we face the difficulties of today and tomorrow I still have a dream. It is a dream deeply rooted in the American dream. I have a dream that one day this nation will rise up and live out the true meaning of its creed: 'We hold these truths to be self-evident; that all men are created equal.\" I have a dream that one day on the red hills of Georgia the sons of former slaves and the sons of former slave owners will be able to sit together at the table of brotherhood. I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice. I have a dream that little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today. I have a dream that one day down in Alabama, with its vicious racists, with its Governor having his lips dripping with the words of interposition and nullification, one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today. I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places plains, and the crooked places will be made straight, and before the Lord will be revealed, and all flesh shall see it together. This is our hope. This is the faith that I go back to the mount with. With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the genuine discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, pray together; to struggle together, to go to jail together, to stand up for freedom forever, )mowing that we will be free one day. And I say to you today my friends, let freedom ring. From the prodigious hilltops of New Hampshire, let freedom ring. From the mighty mountains of New York, let freedom ring. From the mighty Alleghenies of Pennsylvania! Let freedom ring from the snow capped Rockies of Colorado! Let freedom ring from the curvaceous slopes of California! But not only there; let freedom ring from the Stone Mountain of Georgia! Let freedom ring from Lookout Mountain in Tennessee! Let freedom ring from every hill and molehill in Mississippi. From every mountainside, let freedom ring. And when this happens, when we allow freedom to ring, when we let it ring from every village and hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual, \"Free at last! Free at last! Thank God almighty, we're free at last!\" "
Text must be processed before it can be analyzed. There are many ways to process text. This class has introduced you to two ways:
Either create a term-document matrix or unnest the tokens.
# Write your code below.
# Loading necessary library
library(tidytext)
# Step 2.3 - Unnest tokens
# Creating a tibble with a single column named 'text' containing the entire speech
mlk_data <- tibble(text = mlk_text)
# Tokenizing the entire text as one document
mlk_tokens <- mlk_data %>%
unnest_tokens(word, text, token = "words")
# Displaying the entire tokenized text
mlk_tokens
## # A tibble: 882 × 1
## word
## <chr>
## 1 i
## 2 am
## 3 happy
## 4 to
## 5 join
## 6 with
## 7 you
## 8 today
## 9 in
## 10 what
## # ℹ 872 more rows
Create a list of counts for each word.
# Write your code below.
# Counting the frequency of each word
word_freq <- mlk_tokens %>%
count(word, sort = TRUE)
# Viewing the word counts
word_freq
## # A tibble: 323 × 2
## word n
## <chr> <int>
## 1 the 54
## 2 of 49
## 3 to 29
## 4 and 27
## 5 a 20
## 6 in 17
## 7 be 16
## 8 will 16
## 9 we 14
## 10 freedom 13
## # ℹ 313 more rows
Determine how many positive words were in the speech. Scale the
number based on the total number of words in the speech.
Hint: One way to do this is to use match()
and then which(). If you choose the tidyverse method, try
group_by() and then count().
# Write your code below.
# Finding the positive words in the speech
positive_word_count <- sum(match(mlk_tokens$word, positive_words, nomatch = 0))
# Total number of words in the speech
total_word_count <- nrow(mlk_tokens)
# Scaling the count of positive words based on the total number of words
positive_word_ratio <- positive_word_count / total_word_count
# View
positive_word_ratio
## [1] 52.83333
Determine how many negative words were in the speech. Scale the
number based on the total number of words in the speech.
Hint: This is basically the same as Step 3.
# Write your code below.
# Finding the negative words in the speech
negative_word_count <- sum(match(mlk_tokens$word, negative_words, nomatch = 0))
# Scaling the count of negative words based on the total number of words
negative_word_ratio <- negative_word_count / total_word_count
# View
negative_word_ratio
## [1] 77.44558
Redo the “positive” and “negative” calculations for each 25% of the speech by following the steps below.
Compare the results (e.g., a simple bar chart of the 4
numbers).
For each quarter of the text, you calculate the positive and negative
ratio, as was done in Step 4 and Step 5.
The only extra work is to split the text to four equal parts, then
visualize the positive and negative ratios by plotting.
The final graphs should look like below:
HINT: The code below shows how to start the first 25% of the speech. Finish the analysis and use the same approach for the rest of the speech.
# Step 5: Redo the positive and negative calculations for each 25% of the speech
# define a cutpoint to split the document into 4 parts; round the number to get an interger
cutpoint <- round(length(words.corpus)/4)
# first 25%
# create word corpus for the first quarter using cutpoints
words.corpus1 <- words.corpus[1:cutpoint]
# create term document matrix for the first quarter
tdm1 <- TermDocumentMatrix(words.corpus1)
# convert tdm1 into a matrix called "m1"
m1 <- as.matrix(tdm1)
# create a list of word counts for the first quarter and sort the list
wordCounts1 <- rowSums(m1)
wordCounts1 <- sort(wordCounts1, decreasing=TRUE)
# calculate total words of the first 25%
# Write your code below.
# Defining a function to calculate positive and negative ratios
calculate_ratios <- function(words_df) {
# Countting total words
total_words <- nrow(words_df)
# Finding positive words
positive_word_count <- sum(words_df$word %in% positive_words)
# Finding negative words
negative_word_count <- sum(words_df$word %in% negative_words)
# Calculating positive and negative ratios
positive_ratio <- positive_word_count / total_words
negative_ratio <- negative_word_count / total_words
# Return ratios
return(c(positive_ratio, negative_ratio))
}
# Splitting the text into four equal parts
quarter_size <- nrow(mlk_tokens) / 4
# Initialize vectors to store results
positive_ratios <- numeric(4)
negative_ratios <- numeric(4)
# Iterate over each quarter of the speech
for (i in 1:4) {
# Defining start and end indices for the current quarter
start_index <- as.integer(((i - 1) * quarter_size) + 1)
end_index <- as.integer(min(i * quarter_size, nrow(mlk_tokens)))
# Subset the data for the current quarter
quarter_words <- mlk_tokens[start_index:end_index, ]
# Calculate positive and negative ratios for the current quarter
ratios <- calculate_ratios(quarter_words)
# Storing results
positive_ratios[i] <- ratios[1]
negative_ratios[i] <- ratios[2]
}
# Viewing the positive and negative ratios for each quarter
positive_ratios
## [1] 0.05454545 0.03619910 0.03181818 0.10859729
negative_ratios
## [1] 0.054545455 0.031674208 0.036363636 0.004524887
library(ggplot2)
# Creating a data frame for plotting positive ratios
positive_plot_data <- data.frame(
Quarter = 1:4,
Ratio = positive_ratios
)
# Plotting for positive ratios
ggplot(positive_plot_data, aes(x = Quarter, y = Ratio)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Positive Ratios by Quarter",
x = "Quarter",
y = "Ratio") +
theme_minimal()
# Creating a data frame for plotting negative ratios
negative_plot_data <- data.frame(
Quarter = 1:4,
Ratio = negative_ratios
)
# Plotting for negative ratios
ggplot(negative_plot_data, aes(x = Quarter, y = Ratio)) +
geom_bar(stat = "identity", fill = "red") +
labs(title = "Negative Ratios by Quarter",
x = "Quarter",
y = "Ratio") +
theme_minimal()
What do you see from the positive/negative ratio in the graph? State what you learned from the MLK speech using the sentiment analysis results:
[I struggled to get my code to match yours so I can only talk about what I got. Based on my data it suggests that the the postive ratio trends seemed to go from a postive feeling in the beginning and slowly shifts to a more negative feeling in the middle end section but at the end it does back to a postive heavy word selection at the end to wrap it all together which I think is pretty true to the feel of the speach. ]