Instructions

Conduct sentiment analysis on MLK’s speech to determine how positive/negative his speech was. Split his speech into four quartiles to see how that sentiment changes over time.Create two bar charts to display your results.

# Add your library below.
library(XML)

## Warning: package 'XML' was built under R version 4.2.3

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.2.3

library(tm)
# Set CRAN mirror
options(repos = "https://cran.r-project.org")

Step 1 - Read in the Bing Dictionary

Sentiment analysis relies on a “dictionary”. Most dictionaries categorize words as either positive or negative, but some dictionaries use emotion (such as the NRC EmoLex Dictionary). Each dictionary is different. This assignment will introduce you to the Bing dictionary, which researchers created by categorizing words used in online reviews from Amazon, Yelp, and other similar platforms.

Step 1.1 - Find the files

The files needed for this lab are stored in a RAR file. You must extract the files from the compressed RAR file by using a third-party application, such as 7Zip, winZip, or another program. Use google to find a RAR file extractor.

Find the RAR file on the UIC website (contains two text files: positive words and negative words). Ths file is about halfway down the page, listed as “A list of English positive and negative opinion words or sentiment words”. Use the link below:

http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Save these files in your “data” folder.

# No code necessary; Save the files in your project's data folder.

Step 1.2 - Create vectors

Create two vectors of words, one for the positive words and one for the negative words.

# Defining the path to the positive word file
positive_words_file <- "/Users/auz/Desktop/week7_Lab/data/positive-words.txt"

# Defining the path to the negative word file
negative_words_file <- "/Users/auz/Desktop/week7_Lab/data/negative-words.txt"

# Reading positive words from file
positive_words <- scan(positive_words_file, what = "character", sep = "\n")

# Reading negative words from file
negative_words <- scan(negative_words_file, what = "character", sep = "\n")

# Printing the first few words from each list to verify
head(positive_words)

## [1] ";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;"
## [2] "; "                                                                          
## [3] "; Opinion Lexicon: Positive"                                                 
## [4] ";"                                                                           
## [5] "; This file contains a list of POSITIVE opinion words (or sentiment words)." 
## [6] ";"

head(negative_words)

## [1] ";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;"
## [2] "; "                                                                            
## [3] "; Opinion Lexicon: Negative"                                                   
## [4] ";"                                                                             
## [5] "; This file contains a list of NEGATIVE opinion words (or sentiment words)."   
## [6] ";"

Step 1.3 - Clean the files

Note that when reading in the word files, there might be lines at the start and/or the end that will need to be removed (i.e. you should clean your dataset).

# Loading necessary libraries
install.packages("tidytext")

## 
## The downloaded binary packages are in
##  /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpZc7htQ/downloaded_packages

library(tidytext)

# Step 1.3 - Clean the files

# Cleaning the positive word file
positive_words <- read_lines(positive_words_file) %>%
  as_tibble() %>%
  filter(str_detect(value, "^[a-zA-Z]")) %>%
  pull(value)  # Keep only lines starting with alphabetic characters

# Cleaning the negative word file
negative_words <- read_lines(negative_words_file) %>%
  as_tibble() %>%
  filter(str_detect(value, "^[a-zA-Z]")) %>%
  pull(value)  # Keep only lines starting with alphabetic characters

Step 2: Process in the MLK speech

Text is stored in many different formats, such as TXT, CSV, HTML, and JSON. In this lab, you are going to experience how to “parse HTML” for text analysis.

Step 2.1 - Find and read in the file.

Find MLK’s speech on the AnalyticTech website. You can either read in the file using the XML package, or you can copy/paste the document into a TXT file.

Use the link below:

http://www.analytictech.com/mb021/mlk.htm

# Write your code below.
# Load necessary library
library(xml2)

# Load necessary library
library(xml2)

# Step 2.1 - Find and read in the file

# Defining the URL of MLK's speech
mlk_url <- "http://www.analytictech.com/mb021/mlk.htm"

# Reading and parse the HTML content of the webpage
mlk_html <- read_html(mlk_url)

# Extracting the text content from the HTML paragraphs
mlk_paragraphs <- xml_text(xml_find_all(mlk_html, "//p"))

# Concatenating the paragraphs into a single string
mlk_text <- paste(mlk_paragraphs, collapse = " ")

# Printing the extracted text
print(mlk_text)

## [1] "I am happy to join with you today in what will go down in\r\nhistory as the greatest demonstration for freedom in the history\r\nof our nation.  Five score years ago a great American in whose symbolic shadow\r\nwe stand today signed the Emancipation Proclamation. This\r\nmomentous decree came as a great beckoning light of hope to\r\nmillions of Negro slaves who had been seared in the flames of\r\nwithering injustice. It came as a joyous daybreak to end the long\r\nnight of their captivity.  But one hundred years later the Negro is still not free. One\r\nhundred years later the life of the Negro is still sadly crippled\r\nby the manacles of segregation and the chains of discrimination.  One hundred years later the Negro lives on a lonely island of\r\npoverty in the midst of a vast ocean of material prosperity.  One hundred years later the Negro is still languishing in the\r\ncomers of American society and finds himself in exile in his own\r\nland.  We all have come to this hallowed spot to remind America of\r\nthe fierce urgency of now. Now is the time to rise from the dark\r\nand desolate valley of segregation to the sunlit path of racial\r\njustice. Now is the time to change racial injustice to the solid\r\nrock of brotherhood. Now is the time to make justice ring out for\r\nall of God's children.  There will be neither rest nor tranquility in America until\r\nthe Negro is granted citizenship rights.  We must forever conduct our struggle on the high plane of\r\ndignity and discipline. We must not allow our creative protest to\r\ndegenerate into physical violence. Again and again we must rise\r\nto the majestic heights of meeting physical force with soul\r\nforce.  And the marvelous new militarism which has engulfed the Negro\r\ncommunity must not lead us to a distrust of all white people, for\r\nmany of our white brothers have evidenced by their presence here\r\ntoday that they have come to realize that their destiny is part\r\nof our destiny.  So even though we face the difficulties of today and tomorrow\r\nI still have a dream. It is a dream deeply rooted in the American\r\ndream.  I have a dream that one day this nation will rise up and live\r\nout the true meaning of its creed: 'We hold these truths to be\r\nself-evident; that all men are created equal.\"  I have a dream that one day on the red hills of Georgia the\r\nsons of former slaves and the sons of former slave owners will be\r\nable to sit together at the table of brotherhood.  I have a dream that one day even the state of Mississippi, a\r\nstate sweltering with the heat of injustice, sweltering with the\r\nheat of oppression, will be transformed into an oasis of freedom\r\nand justice.  I have a dream that little children will one day live in a\r\nnation where they will not be judged by the color of their skin\r\nbut by the content of their character.  I have a dream today.  I have a dream that one day down in Alabama, with its vicious\r\nracists, with its Governor having his lips dripping with the\r\nwords of interposition and nullification, one day right there in\r\nAlabama little black boys and black girls will be able to join\r\nhands with little white boys and white girls as sisters and\r\nbrothers.  I have a dream today.  I have a dream that one day every valley shall be exalted,\r\nevery hill and mountain shall be made low, the rough places\r\nplains, and the crooked places will be made straight, and before\r\nthe Lord will be revealed, and all flesh shall see it together.  This is our hope. This is the faith that I go back to the\r\nmount with. With this faith we will be able to hew out of the\r\nmountain of despair a stone of hope. With this faith we will be\r\nable to transform the genuine discords of our nation into a\r\nbeautiful symphony of brotherhood. With this faith we will be\r\nable to work together, pray together; to struggle together, to go\r\nto jail together, to stand up for freedom forever, )mowing that\r\nwe will be free one day.  And I say to you today my friends, let freedom ring. From the\r\nprodigious hilltops of New Hampshire, let freedom ring. From the\r\nmighty mountains of New York, let freedom ring. From the mighty\r\nAlleghenies of Pennsylvania!  Let freedom ring from the snow capped Rockies of Colorado!  Let freedom ring from the curvaceous slopes of California!  But not only there; let freedom ring from the Stone Mountain\r\nof Georgia!  Let freedom ring from Lookout Mountain in Tennessee!  Let freedom ring from every hill and molehill in Mississippi.\r\nFrom every mountainside, let freedom ring.  And when this happens, when we allow freedom to ring, when we\r\nlet it ring from every village and hamlet, from every state and\r\nevery city, we will be able to speed up that day when all of\r\nGod's children, black men and white men, Jews and Gentiles,\r\nProtestants and Catholics, will be able to join hands and sing in\r\nthe words of the old Negro spiritual, \"Free at last! Free at\r\nlast! Thank God almighty, we're free at last!\" "

Step 2.2 - Parse the files

If you choose to read the raw HTML using the XML package, you will need to parse the HTML object. For this exercise, we can split the HTML by the paragraph tag and then store the paragraphs inside a vector. The following code might help:

# Read and parse HTML file

doc.html = htmlTreeParse('http://www.analytictech.com/mb021/mlk.htm', 
                         useInternal = TRUE)

# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.

doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))

# Replace all \n by spaces
doc.text = gsub('\\n', ' ', doc.text)

# Replace all \r by spaces
doc.text = gsub('\\r', ' ', doc.text)

# Write your code below, if necessary.
# Loading necessary library
library(xml2)

# Step 2.2 - Parse the HTML file and extract paragraphs

# Reading and parse HTML file
mlk_html <- read_html('http://www.analytictech.com/mb021/mlk.htm')

# creating a character vector.
mlk_paragraphs <- xml_text(xml_find_all(mlk_html, "//p"))

# Concatenating the paragraphs into a single string
mlk_text <- paste(mlk_paragraphs, collapse = " ")

# Replacing all \n by spaces
mlk_text <- gsub('\\n', ' ', mlk_text)

# Replacing all \r by spaces
mlk_text <- gsub('\\r', ' ', mlk_text)

# Printing the extracted text
print(mlk_text)

## [1] "I am happy to join with you today in what will go down in  history as the greatest demonstration for freedom in the history  of our nation.  Five score years ago a great American in whose symbolic shadow  we stand today signed the Emancipation Proclamation. This  momentous decree came as a great beckoning light of hope to  millions of Negro slaves who had been seared in the flames of  withering injustice. It came as a joyous daybreak to end the long  night of their captivity.  But one hundred years later the Negro is still not free. One  hundred years later the life of the Negro is still sadly crippled  by the manacles of segregation and the chains of discrimination.  One hundred years later the Negro lives on a lonely island of  poverty in the midst of a vast ocean of material prosperity.  One hundred years later the Negro is still languishing in the  comers of American society and finds himself in exile in his own  land.  We all have come to this hallowed spot to remind America of  the fierce urgency of now. Now is the time to rise from the dark  and desolate valley of segregation to the sunlit path of racial  justice. Now is the time to change racial injustice to the solid  rock of brotherhood. Now is the time to make justice ring out for  all of God's children.  There will be neither rest nor tranquility in America until  the Negro is granted citizenship rights.  We must forever conduct our struggle on the high plane of  dignity and discipline. We must not allow our creative protest to  degenerate into physical violence. Again and again we must rise  to the majestic heights of meeting physical force with soul  force.  And the marvelous new militarism which has engulfed the Negro  community must not lead us to a distrust of all white people, for  many of our white brothers have evidenced by their presence here  today that they have come to realize that their destiny is part  of our destiny.  So even though we face the difficulties of today and tomorrow  I still have a dream. It is a dream deeply rooted in the American  dream.  I have a dream that one day this nation will rise up and live  out the true meaning of its creed: 'We hold these truths to be  self-evident; that all men are created equal.\"  I have a dream that one day on the red hills of Georgia the  sons of former slaves and the sons of former slave owners will be  able to sit together at the table of brotherhood.  I have a dream that one day even the state of Mississippi, a  state sweltering with the heat of injustice, sweltering with the  heat of oppression, will be transformed into an oasis of freedom  and justice.  I have a dream that little children will one day live in a  nation where they will not be judged by the color of their skin  but by the content of their character.  I have a dream today.  I have a dream that one day down in Alabama, with its vicious  racists, with its Governor having his lips dripping with the  words of interposition and nullification, one day right there in  Alabama little black boys and black girls will be able to join  hands with little white boys and white girls as sisters and  brothers.  I have a dream today.  I have a dream that one day every valley shall be exalted,  every hill and mountain shall be made low, the rough places  plains, and the crooked places will be made straight, and before  the Lord will be revealed, and all flesh shall see it together.  This is our hope. This is the faith that I go back to the  mount with. With this faith we will be able to hew out of the  mountain of despair a stone of hope. With this faith we will be  able to transform the genuine discords of our nation into a  beautiful symphony of brotherhood. With this faith we will be  able to work together, pray together; to struggle together, to go  to jail together, to stand up for freedom forever, )mowing that  we will be free one day.  And I say to you today my friends, let freedom ring. From the  prodigious hilltops of New Hampshire, let freedom ring. From the  mighty mountains of New York, let freedom ring. From the mighty  Alleghenies of Pennsylvania!  Let freedom ring from the snow capped Rockies of Colorado!  Let freedom ring from the curvaceous slopes of California!  But not only there; let freedom ring from the Stone Mountain  of Georgia!  Let freedom ring from Lookout Mountain in Tennessee!  Let freedom ring from every hill and molehill in Mississippi.  From every mountainside, let freedom ring.  And when this happens, when we allow freedom to ring, when we  let it ring from every village and hamlet, from every state and  every city, we will be able to speed up that day when all of  God's children, black men and white men, Jews and Gentiles,  Protestants and Catholics, will be able to join hands and sing in  the words of the old Negro spiritual, \"Free at last! Free at  last! Thank God almighty, we're free at last!\" "

Step 2.3 - Transform the text

Text must be processed before it can be analyzed. There are many ways to process text. This class has introduced you to two ways:

Using the TM package to manipulate term-document matrices
Using the tidytext package to unnest tokens

Either create a term-document matrix or unnest the tokens.

# Write your code below.
# Loading necessary library
library(tidytext)

# Step 2.3 - Unnest tokens
# Creating a tibble with a single column named 'text' containing the entire speech
mlk_data <- tibble(text = mlk_text)

# Tokenizing the entire text as one document
mlk_tokens <- mlk_data %>%
  unnest_tokens(word, text, token = "words")

# Displaying the entire tokenized text
mlk_tokens

## # A tibble: 882 × 1
##    word 
##    <chr>
##  1 i    
##  2 am   
##  3 happy
##  4 to   
##  5 join 
##  6 with 
##  7 you  
##  8 today
##  9 in   
## 10 what 
## # ℹ 872 more rows

Step 2.4 - Create a list of word frequencies

Create a list of counts for each word.

# Write your code below.
# Counting the frequency of each word
word_freq <- mlk_tokens %>%
  count(word, sort = TRUE)

# Viewing the word counts
word_freq

## # A tibble: 323 × 2
##    word        n
##    <chr>   <int>
##  1 the        54
##  2 of         49
##  3 to         29
##  4 and        27
##  5 a          20
##  6 in         17
##  7 be         16
##  8 will       16
##  9 we         14
## 10 freedom    13
## # ℹ 313 more rows

Step 3: Positive words

Determine how many positive words were in the speech. Scale the number based on the total number of words in the speech. Hint: One way to do this is to use match() and then which(). If you choose the tidyverse method, try group_by() and then count().

# Write your code below.
# Finding the positive words in the speech
positive_word_count <- sum(match(mlk_tokens$word, positive_words, nomatch = 0))

# Total number of words in the speech
total_word_count <- nrow(mlk_tokens)

# Scaling the count of positive words based on the total number of words
positive_word_ratio <- positive_word_count / total_word_count

# View
positive_word_ratio

## [1] 52.83333

Step 4: Negative words

Determine how many negative words were in the speech. Scale the number based on the total number of words in the speech.
Hint: This is basically the same as Step 3.

# Write your code below.
# Finding the negative words in the speech
negative_word_count <- sum(match(mlk_tokens$word, negative_words, nomatch = 0))

# Scaling the count of negative words based on the total number of words
negative_word_ratio <- negative_word_count / total_word_count

# View
negative_word_ratio

## [1] 77.44558

Step 5: Get Quartile values

Redo the “positive” and “negative” calculations for each 25% of the speech by following the steps below.

5.1 Compare the results in a graph

Compare the results (e.g., a simple bar chart of the 4 numbers).
For each quarter of the text, you calculate the positive and negative ratio, as was done in Step 4 and Step 5.
The only extra work is to split the text to four equal parts, then visualize the positive and negative ratios by plotting.

The final graphs should look like below:
Step 5.1 - Negative Step 5.1 - Positive

HINT: The code below shows how to start the first 25% of the speech. Finish the analysis and use the same approach for the rest of the speech.

# Step 5: Redo the positive and negative calculations for each 25% of the speech
  # define a cutpoint to split the document into 4 parts; round the number to get an interger
  cutpoint <- round(length(words.corpus)/4)
 
# first 25%
  # create word corpus for the first quarter using cutpoints
  words.corpus1 <- words.corpus[1:cutpoint]
  # create term document matrix for the first quarter
  tdm1 <- TermDocumentMatrix(words.corpus1)
  # convert tdm1 into a matrix called "m1"
  m1 <- as.matrix(tdm1)
  # create a list of word counts for the first quarter and sort the list
  wordCounts1 <- rowSums(m1)
  wordCounts1 <- sort(wordCounts1, decreasing=TRUE)
  # calculate total words of the first 25%

# Write your code below.
# Defining a function to calculate positive and negative ratios
calculate_ratios <- function(words_df) {
  # Countting total words
  total_words <- nrow(words_df)
  # Finding positive words
  positive_word_count <- sum(words_df$word %in% positive_words)
  # Finding negative words
  negative_word_count <- sum(words_df$word %in% negative_words)
  # Calculating positive and negative ratios
  positive_ratio <- positive_word_count / total_words
  negative_ratio <- negative_word_count / total_words
  # Return ratios
  return(c(positive_ratio, negative_ratio))
}

# Splitting the text into four equal parts
quarter_size <- nrow(mlk_tokens) / 4

# Initialize vectors to store results
positive_ratios <- numeric(4)
negative_ratios <- numeric(4)

# Iterate over each quarter of the speech
for (i in 1:4) {
  # Defining start and end indices for the current quarter
  start_index <- as.integer(((i - 1) * quarter_size) + 1)
  end_index <- as.integer(min(i * quarter_size, nrow(mlk_tokens)))
  # Subset the data for the current quarter
  quarter_words <- mlk_tokens[start_index:end_index, ]
  # Calculate positive and negative ratios for the current quarter
  ratios <- calculate_ratios(quarter_words)
  # Storing results
  positive_ratios[i] <- ratios[1]
  negative_ratios[i] <- ratios[2]
}

# Viewing the positive and negative ratios for each quarter
positive_ratios

## [1] 0.05454545 0.03619910 0.03181818 0.10859729

negative_ratios

## [1] 0.054545455 0.031674208 0.036363636 0.004524887

library(ggplot2)

# Creating a data frame for plotting positive ratios
positive_plot_data <- data.frame(
  Quarter = 1:4,
  Ratio = positive_ratios
)

# Plotting for positive ratios
ggplot(positive_plot_data, aes(x = Quarter, y = Ratio)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Positive Ratios by Quarter",
       x = "Quarter",
       y = "Ratio") +
  theme_minimal()

# Creating a data frame for plotting negative ratios
negative_plot_data <- data.frame(
  Quarter = 1:4,
  Ratio = negative_ratios
)

# Plotting for negative ratios
ggplot(negative_plot_data, aes(x = Quarter, y = Ratio)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Negative Ratios by Quarter",
       x = "Quarter",
       y = "Ratio") +
  theme_minimal()

5.2 Analysis

What do you see from the positive/negative ratio in the graph? State what you learned from the MLK speech using the sentiment analysis results:

[I struggled to get my code to match yours so I can only talk about what I got. Based on my data it suggests that the the postive ratio trends seemed to go from a postive feeling in the beginning and slowly shifts to a more negative feeling in the middle end section but at the end it does back to a postive heavy word selection at the end to wrap it all together which I think is pretty true to the feel of the speach. ]

Week 7: Lab - Text Mining (Sentiment Analysis)

[Austyn Bushman]

[02/25/2024]