Process (pseudocode)

  1. Read text files from regional newspapers into R
  2. Tokenized words of each text file to perform sentiment analysis using NRC, Bing, and AFINN lexicons
  3. Processed regional text file data into singular cells to create data frame of text grouped by regions
  4. Applied TF-IDF to data frame to determine the most impactful words in a given region

Sentiment Analysis

# Read in Text for Articles from Southern Region (Julia)
south <- read_lines("southern_region")
south <- tibble(south)
south$south <- as.character(south$south)

south <- south %>%
  unnest_tokens(word, south)%>%
  anti_join(stop_words)%>% 
  count(word, sort=TRUE)

# Read in Text for Articles from Western Region (Jess)
west <- read_lines("west_coast")
west <- tibble(west)
west$west <- as.character(west$west)

west <- west %>%
  unnest_tokens(word, west)%>%
  anti_join(stop_words)%>% 
  count(word, sort=TRUE)

setwd('C:/Users/Student/OneDrive - University of Virginia/Documents/R/Text-Mining-Lab/MidWestArticles')
# Read in Text for Articles from Midwestern Region (Kara)
mypath = 'C:/Users/Student/OneDrive - University of Virginia/Documents/R/Text-Mining-Lab/MidWestArticles'

#creating list of articles
txt_files_ls = list.files(path=mypath, pattern="*.txt") 
#reading articles 
txt_files_df <- lapply(txt_files_ls, read_lines)
# combining them into one df
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))

#defining a data preparation function
data_prep <- function(x,y,z){
  i <- as_tibble(t(x))
  #remove the white sapce, y:z is the range
  ii <- unite(i,"text",y:z,remove = TRUE,sep = "")
}

#applying data prep function to create one cell of words
midwest <- data_prep(combined_df,'V1','V193')
midwest <- tibble(midwest)
midwest2 <- midwest

midwest <- midwest %>%
  unnest_tokens(word,text)%>%
  anti_join(stop_words)%>% 
  count(word, sort=TRUE)

Bing Sentiment

Southern Region:

## 
## negative positive 
##      179      105

Western Region:

## 
## negative positive 
##      161       81

Midwestern Region:

## 
## negative positive 
##      158       79

From the Bing lexicon we can see that all regions skew towards negative sentiment. Both the Midwest and Western regions have roughly 33% positive sentiment while the Southern region has a slightly higher percentage at 36%.

NRC analysis

Southern Region:

## 
##        anger anticipation      disgust         fear          joy     negative 
##           74           94           43           89           62          170 
##     positive      sadness     surprise        trust 
##          231           63           48          137

Western Region:

## 
##        anger anticipation      disgust         fear          joy     negative 
##           59           69           37           82           44          137 
##     positive      sadness     surprise        trust 
##          187           58           33          116

Midwestern Region:

## 
##        anger anticipation      disgust         fear          joy     negative 
##           68           84           29           88           53          141 
##     positive      sadness     surprise        trust 
##          187           70           41          108

Similar to the Bing lexicon, the NRC lexicon has consistent results across each region. Interestingly, NRC measured positive sentiment at roughly 42% for all regions, which is 10% greater than that measured by Bing.

Using the Bing and NRC lexicon currently indicates that sentiment across the US for climate change is relatively homogenous and skews negatively.

AFFIN analysis

# Plot of Southern Region's Sentiment Range
ggplot(data = south_sentiment_affin, 
       aes(x=value)
        )+
  geom_histogram(bins=20)+
  ggtitle("Southern Region Sentiment Range")+
  theme_minimal()

# Plot of Western Region's Sentiment Range
ggplot(data = west_sentiment_affin, 
       aes(x=value)
        )+
  geom_histogram(bins=20)+
  ggtitle("Western Region Sentiment Range")+
  theme_minimal()

# Plot of Midwestern Region's Sentiment Range
ggplot(data = midwest_sentiment_affin, 
       aes(x=value)
        )+
  geom_histogram(bins=20)+
  ggtitle("Midwestern Region Sentiment Range")+
  theme_minimal()

The AFFIN analysis emphasizes the findings from the previous lexicons: the plots have very similar distributions indicating that climate change is associated with negative sentiment. The AFFIN analysis does differ from the previous findings in that it indicates the Western is more positive compared to the other two (a larger 1 column). This could indicate that less effort needs to be put into climate change efforts in the Western region.

Word Clouds for Different Regions

# Removing words that will not aid in sentiment understanding
custom_sw <- tibble(word =c("climate","change","global","warming"))

south <- south%>%
      anti_join(custom_sw, by="word")

west <- west%>%
      anti_join(custom_sw, by="word")

midwest <- midwest%>%
      anti_join(custom_sw, by="word")

# Word Cloud for Southern Region
set.seed(42)
ggplot(south[1:50,], aes(label = word, size = n)
       ) +
  geom_text_wordcloud() +
  theme_minimal()

# Word Cloud for Western Region
set.seed(42)
ggplot(west[1:50,], aes(label = word, size = n)
       ) +
  geom_text_wordcloud() +
  theme_minimal()

# Word Cloud for Midwestern Region
set.seed(42)
ggplot(midwest[1:50,], aes(label = word, size = n)
       ) +
  geom_text_wordcloud() +
  theme_minimal()

Unlike the previous lexicon analysis, using word clouds we can start to see distinct differences between the regions. The southern word cloud is much more centered on politics with the words “republicans” and “conservatives” standing out. On the other hand, the Western word cloud is much more focused on natural phenomenon with words like “heat”, “weather”, and “temperatures”. Similar to the Western word cloud, the Midwestern word cloud is focused on nature with words like “forest”, “tress”, and “birds” being prominent. However, the Midwestern word cloud differentiates itself from the other two word clouds with the prominence of “scientists” and “researchers” which could indicate a more scientific perspective of climate change.

Term Frequency - Inverse Document Frequency for Regions

# Term Frequency - Inverse Document Frequency (tf-idf)
south <- read_lines("southern_region")
south <- tibble(south)

west <- read_lines("west_coast")
west <- tibble(west)



data_prep <- function(x,y,z){
  i <- as_tibble(t(x))
  ii <- unite(i,"text",y:z,remove = TRUE,sep = "")
}

south_bag <- data_prep(south[1:371,1],'V1','V371')

west_bag <- data_prep(west,'V1','V141')

region <- c("South","West","Midwest")


tf_idf_text <- tibble(region,text=t(tibble(south_bag,west_bag,midwest2,.name_repair = "universal")))

View(tf_idf_text)

word_count <- tf_idf_text %>%
  unnest_tokens(word, text) %>%
  count(region, word, sort = TRUE)


total_words <- word_count %>% 
  group_by(region) %>% 
  summarize(total = sum(n))

region_words <- left_join(word_count, total_words)

View(region_words)

region_words <- region_words %>%
  bind_tf_idf(word, region, n)

top10 <- region_words%>%
  arrange(desc(tf_idf))%>%
  group_by(region)%>%
  slice(1:10)

fig <- ggplot(top10, aes(tf_idf, word, fill = tf_idf)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~region, scales = "free_y") +
  labs(x = "Regions and Most Impactful Words",
       y = NULL)
fig

Looking more closely at the top 10 most impactful words per region we can get an initial idea of the type of campaign we will want to run to target people in these regions.

For the Midwestern region many of the words (“birds”, “forest”, “hill”, ect.) have to do with things in the natural world. With this information our campaign should emphasize the impacts of climate change on the environment around us.

On the other hand, the Southern region is much more occupied with political themes like “conservatives”, “backer”, and “money”. The Southern campaign should focus on highlighting political iniatives.

The Western region is focused on the immediate natural environment shown through the words “Joshua” (likely from Joshua Tree) and “Mojave”. Therefore, the Western campaign should focus specifically on the natural world in California and emphasize climate change impacts on beloved National Parks.

Recommendation

While sentiment analysis using the Bing, NRC, and AFINN lexicons indicated a relatively homogeneous distribution of sentiment across U.S. regions, word specific analysis using word clouds and TF-IDF isolated regional themes which could be used to inform targeted campaigns.

For the Southern region, both the word cloud and TF-IDF indicated a more political perspective on climate change. Our organization could therefore try to change Southern perspectives and run campaigns to make climate change a non-bipartisan issue.

For the Western region, the topic of climate change centered around the natural world. From the word cloud we saw an emphasis on “heat”, “weather”, and “temperatures” likely a reflection of the wildfires that have plagued the west coast. From the TF-IDF we also saw a focus on California specific natural areas like the Mojave, Joshua Tree, and the Valley. Because of the West’s focus on issues close to home, our organization could focus a campaign centered around the effects of climate change on beloved national parks.

The Midwestern region had a similar focus on the natural world like the West with prevalent words being “trees”, “birds”, and “forest”. Unlike the other regions the Midwest word cloud had a focus on “scientists” and “research”. This indicates that the campaign for the Midwestern region could benefit from the testimonies of scientists and researchers. The campaign should emphasize the science behind climate change.