There’s many discussions today about how the media influences the
population and the amount of power media outlets have over peoples’
anxiety, depression, and general mental health. The purpose of this
research is to investigate a very small portion of what a real study may
look like. The sentiment analyisis of news article titles may help
propose theories for the recent shift in mental health crises. Do
articles typically have a negative sentiment? Positive sentiment? Are
there certain publishers with particularly negative scores or positive
scores? How do they perform vs their peers? How has the average news
article sentiment changed over time and what could that look like in
juxtaposition to the mental health crises? This analysis scrapes the
surface of what the the relationship could look like.
Load Libraries
# Load libraries
library(dplyr) # Data manipulation
library(anytime) # Date/time conversion
library(tidytext) # Text mining
library(ggplot2) # Data visualization
library(readr)
Read CSV from API Call data frame
# URL of the CSV file
file_url <- 'https://github.com/wberritt913/CUNY_DATA607/raw/main/combined_df_final_project.csv'
# Read CSV into a data frame
df <- read_csv(file_url)
Change date from char to date and deduplicate dataset
# Data preprocessing
colnames(df)
## [1] "source" "author" "title" "description" "publishedAt"
df$publishedAt <- as.Date(anytime(df$publishedAt))
unique(df$publishedAt)
## [1] "2024-05-09" "2024-04-26" "2024-05-04" "2024-05-02" "1970-01-01"
## [6] "2024-04-14" "2024-04-18" "2024-04-20" "2024-04-29" "2024-04-11"
## [11] "2024-05-05" "2024-04-17" "2024-05-07" "2024-04-24" "2024-04-22"
## [16] "2024-04-23" "2024-04-10" "2024-05-03" "2024-05-01" "2024-04-25"
## [21] "2024-04-16" "2024-04-21" "2024-04-12" "2024-05-08" "2024-05-06"
## [26] "2024-04-28"
df <- df[!duplicated(df$description), ]
dim(df)
## [1] 262 5
Unnest the titles of each article, remove stop words, get sentiment,
then map scores to values
# Sentiment analysis on titles
title_unnested <- df |> unnest_tokens(word, title)
title_clean <- title_unnested |> anti_join(stop_words)
sentiment_title <- title_clean |> inner_join(get_sentiments('bing'))
head(sentiment_title, 20)
## # A tibble: 20 × 6
## source author description publishedAt word sentiment
## <chr> <chr> <chr> <date> <chr> <chr>
## 1 Yahoo Entertainment Karen Friar "Stocks ha… 2024-05-09 win positive
## 2 CNBC Dan Mangan "Trump Med… 2024-05-09 trump positive
## 3 New York Post Lydia Moynihan, … "Many on W… 2024-05-09 death negative
## 4 New York Post Lydia Moynihan, … "Many on W… 2024-05-09 anger negative
## 5 New York Post Lydia Moynihan, … "Many on W… 2024-05-09 prom… positive
## 6 Yahoo Entertainment Ryan Vlastelica "(Bloomber… 2024-05-09 batt… negative
## 7 HuffPost Ron Dicker "The \"Mor… 2024-05-09 hard negative
## 8 HuffPost Ron Dicker "The \"Mor… 2024-05-09 stif… negative
## 9 HuffPost Ron Dicker "The \"Mor… 2024-05-09 trump positive
## 10 New York Post Josh Christenson "Sen. John… 2024-05-09 squi… negative
## 11 New York Post Josh Christenson "Sen. John… 2024-05-09 hara… negative
## 12 New York Post Josh Christenson "Sen. John… 2024-05-09 cree… negative
## 13 MarketWatch MarketWatch <NA> 2024-05-09 swipe negative
## 14 WCVB Boston Jamy Pombo Sesse… "The presi… 2024-05-09 dela… negative
## 15 Foxweather.com Scott Sistek, An… "Tropical … 2024-05-09 stub… negative
## 16 TheStreet Martin Baccardax "The Fed i… 2024-05-09 gains positive
## 17 Yahoo Entertainment Reuters "The Easte… 2024-05-09 gloo… negative
## 18 Yahoo Entertainment Reuters "The Easte… 2024-05-09 fans positive
## 19 Yahoo Entertainment Reuters "The Easte… 2024-05-09 fears negative
## 20 Yahoo Entertainment Reuters "Bets that… 2024-05-09 fall negative
title_sentiment_w_score <- sentiment_title %>%
mutate(sentiment_score = if_else(sentiment == 'positive', 1, -1))
head(title_sentiment_w_score, 20)
## # A tibble: 20 × 7
## source author description publishedAt word sentiment sentiment_score
## <chr> <chr> <chr> <date> <chr> <chr> <dbl>
## 1 Yahoo Enterta… Karen… "Stocks ha… 2024-05-09 win positive 1
## 2 CNBC Dan M… "Trump Med… 2024-05-09 trump positive 1
## 3 New York Post Lydia… "Many on W… 2024-05-09 death negative -1
## 4 New York Post Lydia… "Many on W… 2024-05-09 anger negative -1
## 5 New York Post Lydia… "Many on W… 2024-05-09 prom… positive 1
## 6 Yahoo Enterta… Ryan … "(Bloomber… 2024-05-09 batt… negative -1
## 7 HuffPost Ron D… "The \"Mor… 2024-05-09 hard negative -1
## 8 HuffPost Ron D… "The \"Mor… 2024-05-09 stif… negative -1
## 9 HuffPost Ron D… "The \"Mor… 2024-05-09 trump positive 1
## 10 New York Post Josh … "Sen. John… 2024-05-09 squi… negative -1
## 11 New York Post Josh … "Sen. John… 2024-05-09 hara… negative -1
## 12 New York Post Josh … "Sen. John… 2024-05-09 cree… negative -1
## 13 MarketWatch Marke… <NA> 2024-05-09 swipe negative -1
## 14 WCVB Boston Jamy … "The presi… 2024-05-09 dela… negative -1
## 15 Foxweather.com Scott… "Tropical … 2024-05-09 stub… negative -1
## 16 TheStreet Marti… "The Fed i… 2024-05-09 gains positive 1
## 17 Yahoo Enterta… Reute… "The Easte… 2024-05-09 gloo… negative -1
## 18 Yahoo Enterta… Reute… "The Easte… 2024-05-09 fans positive 1
## 19 Yahoo Enterta… Reute… "The Easte… 2024-05-09 fears negative -1
## 20 Yahoo Enterta… Reute… "Bets that… 2024-05-09 fall negative -1
title_sentiment_w_score <- na.omit(title_sentiment_w_score)
Find the average sentiment of all titles
# Average sentiment score by title
mean_score_by_title <- title_sentiment_w_score |> group_by(description) |>
summarize(avg_sentiment_score = mean(sentiment_score), count = n())
head(mean_score_by_title, 20)
## # A tibble: 20 × 3
## description avg_sentiment_score count
## <chr> <dbl> <int>
## 1 "\"Aamrok 3.0, ported to Qt5/KDE Frameworks 5, has… -1 1
## 2 "\"There's no guarantee that changing the trajecto… -1 1
## 3 "'Batman: Caped Crusader,' from J.J. Abrams, Matt … 1 1
## 4 "'Let the Evil Go West' has set Sebastian Stan and… -1 1
## 5 "'The Bear' will return with Season 3 on June 27 a… 1 1
## 6 "(Bloomberg) -- Intel Corp bulls just cannot catch… -1 1
## 7 "(Bloomberg) -- Moderna Inc.’s pioneering Covid sh… 0 2
## 8 "A NASA contractor is urging the agency to suspend… -1 1
## 9 "A Ukrainian ATACMS long-range missile strike repo… -1 1
## 10 "A review of the excessively frozen smartphones of… -1 1
## 11 "A who woman couldn't understand why she wasn't lo… -1 1
## 12 "According to the Texas Department of Family and P… 1 1
## 13 "After losing a pregnancy at 36 weeks, I thought I… -1 1
## 14 "Amazon announces three interactive ad formats int… -1 1
## 15 "Amazon has cut paid perks for Alexa developers. W… 1 1
## 16 "Amidst tensions about free speech on university c… -1 1
## 17 "Another week is coming to a close, which means it… -1 1
## 18 "Apple has announced a refreshed Magic Keyboard ac… 1 1
## 19 "Arati Prabhakar has the ear of the US president a… 1 1
## 20 "At his Hollywood Bowl show, Matt Rife quipped abo… -0.5 4
Join original data frame back in to include other variables for
analysis
# Merge sentiment scores with main dataframe
new_df <- left_join(mean_score_by_title, df, by = 'description')
new_df$weekday <- weekdays(as.Date(new_df$publishedAt))
dim(new_df)
## [1] 137 8
mean(new_df$avg_sentiment_score)
## [1] -0.1435523
Find average sentiment grouped by the publisher of the article
# Average sentiment score by source
mean_score_by_source <- new_df |> group_by(source) |>
summarize(avg_sentiment_score = mean(avg_sentiment_score), count = n())
head(mean_score_by_source, 20)
## # A tibble: 20 × 3
## source avg_sentiment_score count
## <chr> <dbl> <int>
## 1 9to5google.com 1 1
## 2 ABC News 0 1
## 3 Aeon.co -1 1
## 4 Android Central -1 3
## 5 Arrowhead Pride 1 1
## 6 Ars Technica 0 2
## 7 Associated Press -0.5 2
## 8 BBC News 0 4
## 9 Boredpanda.com -1 1
## 10 Business Insider -0.417 12
## 11 CBS News -1 1
## 12 CNBC 1 1
## 13 CNET -0.333 3
## 14 CNN -1 1
## 15 Deadline -1 1
## 16 Design-milk.com 0.5 2
## 17 Eonline.com 1 1
## 18 Fort Worth Star-Telegram 1 1
## 19 Fox News 1 1
## 20 Foxweather.com -1 1
Only take those sources with 3 or more articles and calculate
weighted average
# Filter sources with sufficient data
mean_score_by_source <- subset(mean_score_by_source, count >= 3)
mean_score_by_source$weighted_amount <- mean_score_by_source$count/sum(mean_score_by_source$count)
sum(mean_score_by_source$weighted_amount*mean_score_by_source$avg_sentiment_score)
## [1] -0.09195402
Create bar chart to show average sentiment score by source
# Visualization: Average sentiment score by source
ggplot(mean_score_by_source, aes(x = source, y = avg_sentiment_score, fill = source)) +
geom_bar(stat = "identity") +
labs(title = "Average Sentiment Score by Source",
x = "Source",
y = "Average Sentiment Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Look at sentiment score over time
# Average sentiment score over time
mean_score_by_date <- new_df |> group_by(publishedAt) |>
summarize(avg_sentiment_score = mean(avg_sentiment_score), count = n())
head(mean_score_by_date, 20)
## # A tibble: 20 × 3
## publishedAt avg_sentiment_score count
## <date> <dbl> <int>
## 1 2024-04-10 -0.5 2
## 2 2024-04-11 0.333 3
## 3 2024-04-12 0 1
## 4 2024-04-16 -0.25 2
## 5 2024-04-17 1 2
## 6 2024-04-18 1 1
## 7 2024-04-20 1 2
## 8 2024-04-22 1 2
## 9 2024-04-23 0.2 5
## 10 2024-04-26 0.333 3
## 11 2024-04-28 -1 1
## 12 2024-04-29 1 1
## 13 2024-05-01 -0.333 3
## 14 2024-05-02 -1 1
## 15 2024-05-03 -0.294 42
## 16 2024-05-04 0 2
## 17 2024-05-05 1 1
## 18 2024-05-06 -1 1
## 19 2024-05-07 0.333 3
## 20 2024-05-08 -1 2
Plot the line chart showing the change in average sentiment over
time
# Visualization: Average sentiment score over time
ggplot(mean_score_by_date, aes(x = publishedAt, y = avg_sentiment_score)) +
geom_line(stat = "identity") +
labs(title = "Average Sentiment Score Over Time",
x = "Date",
y = "Average Sentiment Score") +
theme_minimal()

Take mean sentiment score by weekday to see if there’s any specific
day that puts out more positive news than others
# Average sentiment score by weekday
mean_score_by_weekday <- new_df |> group_by(weekday) |>
summarize(avg_sentiment_score = mean(avg_sentiment_score), count = n())
head(mean_score_by_weekday, 20)
## # A tibble: 7 × 3
## weekday avg_sentiment_score count
## <chr> <dbl> <int>
## 1 Friday -0.246 46
## 2 Monday 0.5 4
## 3 Saturday 0.5 4
## 4 Sunday 0 2
## 5 Thursday -0.191 62
## 6 Tuesday 0.15 10
## 7 Wednesday -0.222 9
Plot means by weekday
# Visualization: Average sentiment score by weekday
ggplot(mean_score_by_weekday, aes(x = weekday, y = avg_sentiment_score, fill = weekday)) +
geom_bar(stat = "identity") +
labs(title = "Average Sentiment Score by Weekday",
x = "Weekday",
y = "Average Sentiment Score") +
theme_minimal()
