Motivation:

There’s many discussions today about how the media influences the population and the amount of power media outlets have over peoples’ anxiety, depression, and general mental health. The purpose of this research is to investigate a very small portion of what a real study may look like. The sentiment analyisis of news article titles may help propose theories for the recent shift in mental health crises. Do articles typically have a negative sentiment? Positive sentiment? Are there certain publishers with particularly negative scores or positive scores? How do they perform vs their peers? How has the average news article sentiment changed over time and what could that look like in juxtaposition to the mental health crises? This analysis scrapes the surface of what the the relationship could look like.

Load Libraries

# Load libraries
library(dplyr)  # Data manipulation
library(anytime)  # Date/time conversion
library(tidytext)  # Text mining
library(ggplot2)  # Data visualization
library(readr)

Read CSV from API Call data frame

# URL of the CSV file
file_url <- 'https://github.com/wberritt913/CUNY_DATA607/raw/main/combined_df_final_project.csv'

# Read CSV into a data frame
df <- read_csv(file_url)

Change date from char to date and deduplicate dataset

# Data preprocessing
colnames(df)

## [1] "source"      "author"      "title"       "description" "publishedAt"

df$publishedAt <- as.Date(anytime(df$publishedAt))
unique(df$publishedAt)

##  [1] "2024-05-09" "2024-04-26" "2024-05-04" "2024-05-02" "1970-01-01"
##  [6] "2024-04-14" "2024-04-18" "2024-04-20" "2024-04-29" "2024-04-11"
## [11] "2024-05-05" "2024-04-17" "2024-05-07" "2024-04-24" "2024-04-22"
## [16] "2024-04-23" "2024-04-10" "2024-05-03" "2024-05-01" "2024-04-25"
## [21] "2024-04-16" "2024-04-21" "2024-04-12" "2024-05-08" "2024-05-06"
## [26] "2024-04-28"

df <- df[!duplicated(df$description), ]
dim(df)

## [1] 262   5

Unnest the titles of each article, remove stop words, get sentiment, then map scores to values

# Sentiment analysis on titles
title_unnested <- df |> unnest_tokens(word, title)
title_clean <- title_unnested |> anti_join(stop_words)
sentiment_title <- title_clean |> inner_join(get_sentiments('bing'))
head(sentiment_title, 20)

## # A tibble: 20 × 6
##    source              author            description publishedAt word  sentiment
##    <chr>               <chr>             <chr>       <date>      <chr> <chr>    
##  1 Yahoo Entertainment Karen Friar       "Stocks ha… 2024-05-09  win   positive 
##  2 CNBC                Dan Mangan        "Trump Med… 2024-05-09  trump positive 
##  3 New York Post       Lydia Moynihan, … "Many on W… 2024-05-09  death negative 
##  4 New York Post       Lydia Moynihan, … "Many on W… 2024-05-09  anger negative 
##  5 New York Post       Lydia Moynihan, … "Many on W… 2024-05-09  prom… positive 
##  6 Yahoo Entertainment Ryan Vlastelica   "(Bloomber… 2024-05-09  batt… negative 
##  7 HuffPost            Ron Dicker        "The \"Mor… 2024-05-09  hard  negative 
##  8 HuffPost            Ron Dicker        "The \"Mor… 2024-05-09  stif… negative 
##  9 HuffPost            Ron Dicker        "The \"Mor… 2024-05-09  trump positive 
## 10 New York Post       Josh Christenson  "Sen. John… 2024-05-09  squi… negative 
## 11 New York Post       Josh Christenson  "Sen. John… 2024-05-09  hara… negative 
## 12 New York Post       Josh Christenson  "Sen. John… 2024-05-09  cree… negative 
## 13 MarketWatch         MarketWatch        <NA>       2024-05-09  swipe negative 
## 14 WCVB Boston         Jamy Pombo Sesse… "The presi… 2024-05-09  dela… negative 
## 15 Foxweather.com      Scott Sistek, An… "Tropical … 2024-05-09  stub… negative 
## 16 TheStreet           Martin Baccardax  "The Fed i… 2024-05-09  gains positive 
## 17 Yahoo Entertainment Reuters           "The Easte… 2024-05-09  gloo… negative 
## 18 Yahoo Entertainment Reuters           "The Easte… 2024-05-09  fans  positive 
## 19 Yahoo Entertainment Reuters           "The Easte… 2024-05-09  fears negative 
## 20 Yahoo Entertainment Reuters           "Bets that… 2024-05-09  fall  negative

title_sentiment_w_score <- sentiment_title %>%
  mutate(sentiment_score = if_else(sentiment == 'positive', 1, -1))
head(title_sentiment_w_score, 20)

## # A tibble: 20 × 7
##    source         author description publishedAt word  sentiment sentiment_score
##    <chr>          <chr>  <chr>       <date>      <chr> <chr>               <dbl>
##  1 Yahoo Enterta… Karen… "Stocks ha… 2024-05-09  win   positive                1
##  2 CNBC           Dan M… "Trump Med… 2024-05-09  trump positive                1
##  3 New York Post  Lydia… "Many on W… 2024-05-09  death negative               -1
##  4 New York Post  Lydia… "Many on W… 2024-05-09  anger negative               -1
##  5 New York Post  Lydia… "Many on W… 2024-05-09  prom… positive                1
##  6 Yahoo Enterta… Ryan … "(Bloomber… 2024-05-09  batt… negative               -1
##  7 HuffPost       Ron D… "The \"Mor… 2024-05-09  hard  negative               -1
##  8 HuffPost       Ron D… "The \"Mor… 2024-05-09  stif… negative               -1
##  9 HuffPost       Ron D… "The \"Mor… 2024-05-09  trump positive                1
## 10 New York Post  Josh … "Sen. John… 2024-05-09  squi… negative               -1
## 11 New York Post  Josh … "Sen. John… 2024-05-09  hara… negative               -1
## 12 New York Post  Josh … "Sen. John… 2024-05-09  cree… negative               -1
## 13 MarketWatch    Marke…  <NA>       2024-05-09  swipe negative               -1
## 14 WCVB Boston    Jamy … "The presi… 2024-05-09  dela… negative               -1
## 15 Foxweather.com Scott… "Tropical … 2024-05-09  stub… negative               -1
## 16 TheStreet      Marti… "The Fed i… 2024-05-09  gains positive                1
## 17 Yahoo Enterta… Reute… "The Easte… 2024-05-09  gloo… negative               -1
## 18 Yahoo Enterta… Reute… "The Easte… 2024-05-09  fans  positive                1
## 19 Yahoo Enterta… Reute… "The Easte… 2024-05-09  fears negative               -1
## 20 Yahoo Enterta… Reute… "Bets that… 2024-05-09  fall  negative               -1

title_sentiment_w_score <- na.omit(title_sentiment_w_score)

Find the average sentiment of all titles

# Average sentiment score by title
mean_score_by_title <- title_sentiment_w_score |> group_by(description) |> 
  summarize(avg_sentiment_score = mean(sentiment_score), count = n())
head(mean_score_by_title, 20)

## # A tibble: 20 × 3
##    description                                         avg_sentiment_score count
##    <chr>                                                             <dbl> <int>
##  1 "\"Aamrok 3.0, ported to Qt5/KDE Frameworks 5, has…                -1       1
##  2 "\"There's no guarantee that changing the trajecto…                -1       1
##  3 "'Batman: Caped Crusader,' from J.J. Abrams, Matt …                 1       1
##  4 "'Let the Evil Go West' has set Sebastian Stan and…                -1       1
##  5 "'The Bear' will return with Season 3 on June 27 a…                 1       1
##  6 "(Bloomberg) -- Intel Corp bulls just cannot catch…                -1       1
##  7 "(Bloomberg) -- Moderna Inc.’s pioneering Covid sh…                 0       2
##  8 "A NASA contractor is urging the agency to suspend…                -1       1
##  9 "A Ukrainian ATACMS long-range missile strike repo…                -1       1
## 10 "A review of the excessively frozen smartphones of…                -1       1
## 11 "A who woman couldn't understand why she wasn't lo…                -1       1
## 12 "According to the Texas Department of Family and P…                 1       1
## 13 "After losing a pregnancy at 36 weeks, I thought I…                -1       1
## 14 "Amazon announces three interactive ad formats int…                -1       1
## 15 "Amazon has cut paid perks for Alexa developers. W…                 1       1
## 16 "Amidst tensions about free speech on university c…                -1       1
## 17 "Another week is coming to a close, which means it…                -1       1
## 18 "Apple has announced a refreshed Magic Keyboard ac…                 1       1
## 19 "Arati Prabhakar has the ear of the US president a…                 1       1
## 20 "At his Hollywood Bowl show, Matt Rife quipped abo…                -0.5     4

Join original data frame back in to include other variables for analysis

# Merge sentiment scores with main dataframe
new_df <- left_join(mean_score_by_title, df, by = 'description')
new_df$weekday <- weekdays(as.Date(new_df$publishedAt))
dim(new_df)

## [1] 137   8

mean(new_df$avg_sentiment_score)

## [1] -0.1435523

Find average sentiment grouped by the publisher of the article

# Average sentiment score by source
mean_score_by_source <- new_df |> group_by(source) |> 
  summarize(avg_sentiment_score = mean(avg_sentiment_score), count = n())
head(mean_score_by_source, 20)

## # A tibble: 20 × 3
##    source                   avg_sentiment_score count
##    <chr>                                  <dbl> <int>
##  1 9to5google.com                         1         1
##  2 ABC News                               0         1
##  3 Aeon.co                               -1         1
##  4 Android Central                       -1         3
##  5 Arrowhead Pride                        1         1
##  6 Ars Technica                           0         2
##  7 Associated Press                      -0.5       2
##  8 BBC News                               0         4
##  9 Boredpanda.com                        -1         1
## 10 Business Insider                      -0.417    12
## 11 CBS News                              -1         1
## 12 CNBC                                   1         1
## 13 CNET                                  -0.333     3
## 14 CNN                                   -1         1
## 15 Deadline                              -1         1
## 16 Design-milk.com                        0.5       2
## 17 Eonline.com                            1         1
## 18 Fort Worth Star-Telegram               1         1
## 19 Fox News                               1         1
## 20 Foxweather.com                        -1         1

Only take those sources with 3 or more articles and calculate weighted average

# Filter sources with sufficient data
mean_score_by_source <- subset(mean_score_by_source, count >= 3)
mean_score_by_source$weighted_amount <- mean_score_by_source$count/sum(mean_score_by_source$count)
sum(mean_score_by_source$weighted_amount*mean_score_by_source$avg_sentiment_score)

## [1] -0.09195402

Create bar chart to show average sentiment score by source

# Visualization: Average sentiment score by source
ggplot(mean_score_by_source, aes(x = source, y = avg_sentiment_score, fill = source)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Sentiment Score by Source",
       x = "Source",
       y = "Average Sentiment Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Look at sentiment score over time

# Average sentiment score over time
mean_score_by_date <- new_df |> group_by(publishedAt) |> 
  summarize(avg_sentiment_score = mean(avg_sentiment_score), count = n())
head(mean_score_by_date, 20)

## # A tibble: 20 × 3
##    publishedAt avg_sentiment_score count
##    <date>                    <dbl> <int>
##  1 2024-04-10               -0.5       2
##  2 2024-04-11                0.333     3
##  3 2024-04-12                0         1
##  4 2024-04-16               -0.25      2
##  5 2024-04-17                1         2
##  6 2024-04-18                1         1
##  7 2024-04-20                1         2
##  8 2024-04-22                1         2
##  9 2024-04-23                0.2       5
## 10 2024-04-26                0.333     3
## 11 2024-04-28               -1         1
## 12 2024-04-29                1         1
## 13 2024-05-01               -0.333     3
## 14 2024-05-02               -1         1
## 15 2024-05-03               -0.294    42
## 16 2024-05-04                0         2
## 17 2024-05-05                1         1
## 18 2024-05-06               -1         1
## 19 2024-05-07                0.333     3
## 20 2024-05-08               -1         2

Plot the line chart showing the change in average sentiment over time

# Visualization: Average sentiment score over time
ggplot(mean_score_by_date, aes(x = publishedAt, y = avg_sentiment_score)) +
  geom_line(stat = "identity") +
  labs(title = "Average Sentiment Score Over Time",
       x = "Date",
       y = "Average Sentiment Score") +
  theme_minimal()

Take mean sentiment score by weekday to see if there’s any specific day that puts out more positive news than others

# Average sentiment score by weekday
mean_score_by_weekday <- new_df |> group_by(weekday) |> 
  summarize(avg_sentiment_score = mean(avg_sentiment_score), count = n())
head(mean_score_by_weekday, 20)

## # A tibble: 7 × 3
##   weekday   avg_sentiment_score count
##   <chr>                   <dbl> <int>
## 1 Friday                 -0.246    46
## 2 Monday                  0.5       4
## 3 Saturday                0.5       4
## 4 Sunday                  0         2
## 5 Thursday               -0.191    62
## 6 Tuesday                 0.15     10
## 7 Wednesday              -0.222     9

Plot means by weekday

# Visualization: Average sentiment score by weekday
ggplot(mean_score_by_weekday, aes(x = weekday, y = avg_sentiment_score, fill = weekday)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Sentiment Score by Weekday",
       x = "Weekday",
       y = "Average Sentiment Score") +
  theme_minimal()

final_607_analysis

Will Berritt

2024-05-10

Motivation:

Load Libraries

Read CSV from API Call data frame

Change date from char to date and deduplicate dataset

Unnest the titles of each article, remove stop words, get sentiment, then map scores to values

Find the average sentiment of all titles

Join original data frame back in to include other variables for analysis

Find average sentiment grouped by the publisher of the article

Only take those sources with 3 or more articles and calculate weighted average

Create bar chart to show average sentiment score by source

Look at sentiment score over time

Plot the line chart showing the change in average sentiment over time

Take mean sentiment score by weekday to see if there’s any specific day that puts out more positive news than others

Plot means by weekday

Conclusion:

Here is a link to my API code that will allow you to filter the mean sentiment score for articles in a given date range: https://github.com/wberritt913/CUNY_DATA607/blob/main/final_proj_api.R