Instructions

Conduct sentiment analysis on MLK’s speech to determine how positive/negative his speech was. Split his speech into four quartiles to see how that sentiment changes over time.Create two bar charts to display your results.


# Add your library below.
library(tidyverse)   # For data manipulation and plotting
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
library(tidytext)    # For easier text mining and sentiment analysis
## Warning: package 'tidytext' was built under R version 4.4.3
library(stringr)     # For working with text and string operations

Step 1 - Read in the Bing Dictionary

Sentiment analysis relies on a “dictionary”. Most dictionaries categorize words as either positive or negative, but some dictionaries use emotion (such as the NRC EmoLex Dictionary). Each dictionary is different. This assignment will introduce you to the Bing dictionary, which researchers created by categorizing words used in online reviews from Amazon, Yelp, and other similar platforms.

Step 1.1 - Find the files

The files needed for this lab are stored in a RAR file. You must extract the files from the compressed RAR file by using a third-party application, such as 7Zip, winZip, or another program. Use google to find a RAR file extractor.

Find the RAR file on the UIC website (contains two text files: positive words and negative words). Ths file is about halfway down the page, listed as “A list of English positive and negative opinion words or sentiment words”. Use the link below:

Save these files in your “data” folder.

# No code necessary; Save the files in your project's data folder.
#install.packages("textdata")

library(textdata)
## Warning: package 'textdata' was built under R version 4.4.3
#lexicon_bing()
lexicon_bing(dir="data/")
## # A tibble: 6,789 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 abnormal    negative 
##  4 abolish     negative 
##  5 abominable  negative 
##  6 abominably  negative 
##  7 abominate   negative 
##  8 abomination negative 
##  9 abort       negative 
## 10 aborted     negative 
## # ℹ 6,779 more rows

Step 1.2 - Create vectors

Create two vectors of words, one for the positive words and one for the negative words.

# Write your code below.
# Replace the path with the correct one for your system/project folder
positive_words <- scan("C:/Users/Alejandro/OneDrive/Documents/week7_Lab/data/bing/positive-words (2).txt", what = "character", comment.char = ";")
negative_words <- scan("C:/Users/Alejandro/OneDrive/Documents/week7_Lab/data/bing/negative-words (2).txt", what = "character", comment.char = ";")

Step 1.3 - Clean the files

Note that when reading in the word files, there might be lines at the start and/or the end that will need to be removed (i.e. you should clean your dataset).

# Write your code below.

# Remove NA values first
positive_words <- positive_words[!is.na(positive_words)]
negative_words <- negative_words[!is.na(negative_words)]

# Convert to ASCII, replacing non-ASCII characters with ""
positive_words <- iconv(positive_words, from = "UTF-8", to = "ASCII//TRANSLIT", sub = "")
negative_words <- iconv(negative_words, from = "UTF-8", to = "ASCII//TRANSLIT", sub = "")

# Then filter to alphabetic lowercase words only
positive_words <- positive_words[grepl("^[a-z]+$", positive_words)]
negative_words <- negative_words[grepl("^[a-z]+$", negative_words)]

Step 2: Process in the MLK speech

Text is stored in many different formats, such as TXT, CSV, HTML, and JSON. In this lab, you are going to experience how to “parse HTML” for text analysis.

Step 2.1 - Find and read in the file.

Find MLK’s speech on the AnalyticTech website. You can either read in the file using the XML package, or you can copy/paste the document into a TXT file.

Use the link below:

# Write your code below.
library(XML)
## Warning: package 'XML' was built under R version 4.4.3
# Read and parse the HTML document from the URL
url <- "http://www.analytictech.com/mb021/mlk.htm"
html_doc <- htmlParse(url)

# Extract the main speech text; looking for the text under <pre> tag (common for speeches)
speech_text <- xpathSApply(html_doc, "//pre", xmlValue)

# Collapse text if multiple lines to a single character string
speech_text <- paste(speech_text, collapse = " ")

# View the first 500 characters to confirm
cat(substr(speech_text, 1, 500))

Step 2.2 - Parse the files

If you choose to read the raw HTML using the XML package, you will need to parse the HTML object. For this exercise, we can split the HTML by the paragraph tag and then store the paragraphs inside a vector. The following code might help:

# Read and parse HTML file

doc.html = htmlTreeParse('http://www.analytictech.com/mb021/mlk.htm', 
                         useInternal = TRUE)

# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.

doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))

# Replace all \n by spaces
doc.text = gsub('\\n', ' ', doc.text)

# Replace all \r by spaces
doc.text = gsub('\\r', ' ', doc.text)
# Write your code below, if necessary.
library(XML)

# Read and parse the HTML file from the URL
doc.html <- htmlTreeParse('http://www.analytictech.com/mb021/mlk.htm', useInternal = TRUE)

# Extract all paragraphs <p> as a character vector
doc.text <- unlist(xpathApply(doc.html, '//p', xmlValue))

# Clean newlines and carriage returns by replacing them with spaces
doc.text <- gsub('\\n', ' ', doc.text)
doc.text <- gsub('\\r', ' ', doc.text)

# View first few paragraphs to verify
head(doc.text, 3)
## [1] "I am happy to join with you today in what will go down in  history as the greatest demonstration for freedom in the history  of our nation. "                                                                                                                                                                                                        
## [2] "Five score years ago a great American in whose symbolic shadow  we stand today signed the Emancipation Proclamation. This  momentous decree came as a great beckoning light of hope to  millions of Negro slaves who had been seared in the flames of  withering injustice. It came as a joyous daybreak to end the long  night of their captivity. "
## [3] "But one hundred years later the Negro is still not free. One  hundred years later the life of the Negro is still sadly crippled  by the manacles of segregation and the chains of discrimination. "

Step 2.3 - Transform the text

Text must be processed before it can be analyzed. There are many ways to process text. This class has introduced you to two ways:

  • Using the TM package to manipulate term-document matrices
  • Using the tidytext package to unnest tokens

Either create a term-document matrix or unnest the tokens.

# Write your code below.
library(tidytext)
library(dplyr)
library(tibble)

# Put the paragraph text into a tibble with paragraph ID
speech_df <- tibble(paragraph = 1:length(doc.text), text = doc.text)

# Unnest tokens (split text into individual words)
speech_words <- speech_df %>%
  unnest_tokens(word, text)

# View the first few words as a check
head(speech_words)
## # A tibble: 6 × 2
##   paragraph word 
##       <int> <chr>
## 1         1 i    
## 2         1 am   
## 3         1 happy
## 4         1 to   
## 5         1 join 
## 6         1 with

Step 2.4 - Create a list of word frequencies

Create a list of counts for each word.

# Write your code below.
library(dplyr)

# Count the frequency of each word
word_counts <- speech_words %>%
  count(word, sort = TRUE)

# View the top words
head(word_counts, 10)
## # A tibble: 10 × 2
##    word        n
##    <chr>   <int>
##  1 the        54
##  2 of         49
##  3 to         29
##  4 and        27
##  5 a          20
##  6 in         17
##  7 be         16
##  8 will       16
##  9 we         14
## 10 freedom    13

Step 3: Positive words

Determine how many positive words were in the speech. Scale the number based on the total number of words in the speech. Hint: One way to do this is to use match() and then which(). If you choose the tidyverse method, try group_by() and then count().

# Write your code below.
library(dplyr)

# Filter speech words that appear in positive_words vector
positive_word_counts <- speech_words %>%
  filter(word %in% positive_words) %>%
  count() %>%
  pull(n)

# Calculate total word count in speech
total_word_count <- nrow(speech_words)

# Scale positive count by total words
positive_ratio <- positive_word_counts / total_word_count

positive_word_counts
## [1] 51
positive_ratio
## [1] 0.05782313

Step 4: Negative words

Determine how many negative words were in the speech. Scale the number based on the total number of words in the speech.
Hint: This is basically the same as Step 3.

# Write your code below.
library(dplyr)

# Filter speech words that appear in negative_words vector
negative_word_counts <- speech_words %>%
  filter(word %in% negative_words) %>%
  count() %>%
  pull(n)

# Calculate total word count in speech
total_word_count <- nrow(speech_words)

# Scale negative count by total words
negative_ratio <- negative_word_counts / total_word_count

negative_word_counts
## [1] 28
negative_ratio
## [1] 0.03174603

Step 5: Get Quartile values

Redo the “positive” and “negative” calculations for each 25% of the speech by following the steps below.

5.1 Compare the results in a graph

Compare the results (e.g., a simple bar chart of the 4 numbers).
For each quarter of the text, you calculate the positive and negative ratio, as was done in Step 4 and Step 5.
The only extra work is to split the text to four equal parts, then visualize the positive and negative ratios by plotting.

The final graphs should look like below:
Step 5.1 - Negative Step 5.1 - Positive

HINT: The code below shows how to start the first 25% of the speech. Finish the analysis and use the same approach for the rest of the speech.

# Step 5: Redo the positive and negative calculations for each 25% of the speech
  # define a cutpoint to split the document into 4 parts; round the number to get an interger
  cutpoint <- round(length(words.corpus)/4)
 
# first 25%
  # create word corpus for the first quarter using cutpoints
  words.corpus1 <- words.corpus[1:cutpoint]
  # create term document matrix for the first quarter
  tdm1 <- TermDocumentMatrix(words.corpus1)
  # convert tdm1 into a matrix called "m1"
  m1 <- as.matrix(tdm1)
  # create a list of word counts for the first quarter and sort the list
  wordCounts1 <- rowSums(m1)
  wordCounts1 <- sort(wordCounts1, decreasing=TRUE)
  # calculate total words of the first 25%
# Write your code below.
library(dplyr)
library(ggplot2)

# Total number of words
total_words <- nrow(speech_words)

# Calculate cutpoints to split into four quartiles
cutpoints <- round(seq(0, total_words, length.out = 5))

# Prepare vectors to hold positive and negative ratios for each quartile
positive_ratios <- numeric(4)
negative_ratios <- numeric(4)

# Loop through each quartile
for (i in 1:4) {
  
  # Subset words for the quartile
  words_quartile <- speech_words[(cutpoints[i] + 1):cutpoints[i + 1], ]
  
  # Count positive words in quartile
  pos_count <- sum(words_quartile$word %in% positive_words)
  
  # Count negative words in quartile
  neg_count <- sum(words_quartile$word %in% negative_words)
  
  # Calculate total words in quartile
  quartile_total <- nrow(words_quartile)
  
  # Calculate ratios
  positive_ratios[i] <- pos_count / quartile_total
  negative_ratios[i] <- neg_count / quartile_total
}

# Create a data frame for plotting
sentiment_df <- data.frame(
  Quartile = factor(1:4),
  Positive = positive_ratios,
  Negative = negative_ratios
)

# Reshape for ggplot (long format)
sentiment_long <- tidyr::pivot_longer(sentiment_df, cols = c("Positive", "Negative"),
                                      names_to = "Sentiment", values_to = "Ratio")

# Plot positive sentiment by quartile
ggplot(filter(sentiment_long, Sentiment == "Positive"), aes(x = Quartile, y = Ratio)) +
  geom_col(fill = "skyblue") +
  labs(title = "Positive Sentiment Ratio by Quartile", x = "Quartile", y = "Positive Ratio") +
  theme_minimal()

# Plot negative sentiment by quartile
ggplot(filter(sentiment_long, Sentiment == "Negative"), aes(x = Quartile, y = Ratio)) +
  geom_col(fill = "salmon") +
  labs(title = "Negative Sentiment Ratio by Quartile", x = "Quartile", y = "Negative Ratio") +
  theme_minimal()

5.2 Analysis

What do you see from the positive/negative ratio in the graph? State what you learned from the MLK speech using the sentiment analysis results:

[ The positive ratio rises in the later quartiles and the negative ratio starts higher in the beginning;it suggests increasing optimism or hopefulness over time with expressing the negativity first. Early negative spike may emphasize challenges addressed at the speech’s start instead. I learned how to extract RAR files, and to read in bing dictionary;This was my first introduction to parsing html to transform the text and create a word frequency list to then create postive and negative vector words from.I as well learned that creating a simple bar chart with the specified quartiles helps you visualize from a statistical perspective and allow you to intrpret emotion behind those numbers such as this speech sentiment anaylsis. ]