Final Project Data Analysis

##Final R Makrdown, this project uses three distinct data sources: CSV, API, and web-scraped data.

#Read the loaded csv files
yahoo_articles <- read_csv("https://raw.githubusercontent.com/lher96/MSDS-Assignments/refs/heads/main/yahoo_articles_cleaned.csv",
                           show_col_types = FALSE)
stock_data <- read_csv("https://raw.githubusercontent.com/lher96/MSDS-Assignments/refs/heads/main/stock_prices_combined_cleaned",
                       show_col_types = FALSE)
nyt_articles <- read_csv("https://raw.githubusercontent.com/lher96/MSDS-Assignments/refs/heads/main/nyt_articles_cleaned",
                         show_col_types = FALSE)

glimpse(yahoo_articles)

## Rows: 698
## Columns: 11
## $ ticker            <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAP…
## $ company           <chr> "Apple Inc.", "Apple Inc.", "Apple Inc.", "Apple Inc…
## $ title             <chr> "Did Apple’s Executive Overhaul Around AI, Regulatio…
## $ url               <chr> "https://finance.yahoo.com/news/did-apple-executive-…
## $ source            <chr> "Simply Wall St.", "CorpGov.com", "Motley Fool", "Mo…
## $ published_at      <chr> "2025-12-10T18:14:25", "2025-12-10T17:58:39", "2025-…
## $ snippet           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "In recent days,…
## $ scraped_at        <dttm> 2025-12-10 23:52:38, 2025-12-10 23:52:38, 2025-12-1…
## $ platform          <chr> "Yahoo Finance", "Yahoo Finance", "Yahoo Finance", "…
## $ text_for_analysis <chr> "Did Apple’s Executive Overhaul Around AI, Regulatio…
## $ published_date    <dttm> 2025-12-10 18:14:25, 2025-12-10 17:58:39, 2025-12-1…

glimpse(nyt_articles)

## Rows: 892
## Columns: 24
## $ ticker            <chr> "MSFT", "NFLX", "NFLX", "NVDA", "NVDA", "AMZN", "MET…
## $ company           <chr> "Microsoft", "Netflix", "Netflix", "Nvidia", "Nvidia…
## $ search_query      <chr> "Microsoft", "Netflix", "Netflix", "Nvidia", "Nvidia…
## $ article_id        <chr> "nyt://article/423988df-2bb2-53db-9b39-ac7e2c4bd154"…
## $ web_url           <chr> "https://www.nytimes.com/2025/12/09/us/politics/step…
## $ headline          <chr> "Stephen Miller’s Stock Sale Raises Questions, Ethic…
## $ abstract          <chr> "Mr. Miller, one of President Trump’s top advisers, …
## $ lead_paragraph    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ snippet           <chr> "Mr. Miller, one of President Trump’s top advisers, …
## $ source            <chr> "The New York Times", "The New York Times", "The New…
## $ pub_date          <dttm> 2025-12-09 23:30:28, 2025-12-09 22:22:25, 2025-12-0…
## $ document_type     <chr> "article", "article", "article", "article", "article…
## $ news_desk         <chr> "Washington", "Washington", "Business", "OpEd", "For…
## $ section_name      <chr> "U.S.", "U.S.", "Business", "Opinion", "World", "Bus…
## $ subsection_name   <chr> "Politics", "Politics", "Media", NA, NA, NA, "Family…
## $ word_count        <dbl> 1052, 1303, 755, 912, 1671, 654, 629, 829, 736, 1458…
## $ print_page        <dbl> NA, NA, NA, 18, NA, 3, NA, NA, NA, 1, 1, NA, NA, NA,…
## $ print_section     <chr> NA, NA, NA, "A", NA, "B", NA, NA, NA, "A", "B", NA, …
## $ byline            <chr> "By Ana Swanson", "By Charlie Savage", "By Benjamin …
## $ keywords          <chr> "United States Politics and Government, Conflicts of…
## $ scraped_at        <dttm> 2025-12-10 21:55:37, 2025-12-10 22:16:10, 2025-12-1…
## $ platform          <chr> "New York Times", "New York Times", "New York Times"…
## $ text_for_analysis <chr> "Stephen Miller’s Stock Sale Raises Questions, Ethic…
## $ pub_date_only     <date> 2025-12-09, 2025-12-09, 2025-12-09, 2025-12-09, 202…

glimpse(stock_data)

## Rows: 7,250
## Columns: 13
## $ date              <date> 2024-12-10, 2024-12-11, 2024-12-12, 2024-12-13, 202…
## $ ticker            <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAP…
## $ company           <chr> "Apple Inc.", "Apple Inc.", "Apple Inc.", "Apple Inc…
## $ weight_pct        <dbl> 6.62, 6.62, 6.62, 6.62, 6.62, 6.62, 6.62, 6.62, 6.62…
## $ open              <dbl> 245.7784, 246.8436, 245.7784, 246.7042, 246.8734, 24…
## $ high              <dbl> 247.0925, 249.6708, 247.6201, 248.1676, 250.2482, 25…
## $ low               <dbl> 244.2354, 245.1512, 244.5738, 245.1313, 246.5350, 24…
## $ close             <dbl> 246.6544, 245.3802, 246.8436, 247.0128, 249.9097, 25…
## $ volume            <dbl> 36914800, 45205800, 32777500, 33155300, 51694800, 51…
## $ dividends         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ stock_splits      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ daily_return      <dbl> NA, -0.0051661189, 0.0059637825, 0.0006855352, 0.011…
## $ cumulative_return <dbl> NA, -0.005166119, 0.000766854, 0.001452915, 0.013197…

Sentiment Analysis vs Stock Price

In this document we will be exploring a sentiment analysis comparison to their predictive power of the daily changes in the top 30 S&P500 stocks. We will do this with regression models as well as a factoring the weight of the stocks into their change and correlation.

We will use the classic AFINN Library and also a financial specific sentiment analysis with Loughran-McDonald. Since one of these two is a financial specific sentiment, I foudn it important to compare that analysis to the AFINN library.

#Sentiment Analysis Function

score_articles <- function(df, text_col, article_id_col = row_id) 
  {
  #Create tokens from text_col
  tokens <- df %>%
    select({{article_id_col}}, {{text_col}}) %>%
    unnest_tokens(word, {{text_col}})

  # total tokens per article 
  token_counts <- tokens %>%
    count({{article_id_col}}, name = "n_tokens")

  # AFINN per article
  afinn <- tokens %>%
    inner_join(get_sentiments("afinn"), by = "word") %>%
    group_by({{article_id_col}}) %>%
    summarize(
      afinn_sum  = sum(value, na.rm = TRUE),
      afinn_mean = mean(value, na.rm = TRUE),  # average over matched AFINN words
      .groups = "drop"
    )

  # LM category counts per article
  lm_counts <- tokens %>%
    inner_join(get_sentiments("loughran"), by = "word") %>%
    count({{article_id_col}}, sentiment, name = "n_words") %>%
    pivot_wider(names_from = sentiment, values_from = n_words, values_fill = 0)

  # Derived LM numeric features (normalized by length)
  lm_features <- lm_counts %>%
    mutate(
      lm_net = positive - negative
    ) %>%
    left_join(token_counts, by = rlang::as_name(rlang::enquo(article_id_col))) %>%
    mutate(
      lm_net_rate        = lm_net / n_tokens,
      lm_pos_rate        = positive / n_tokens,
      lm_neg_rate        = negative / n_tokens,
      lm_uncertainty_rate= uncertainty / n_tokens
    )

  # Combine
  df %>%
    left_join(token_counts, by = rlang::as_name(rlang::enquo(article_id_col))) %>%
    left_join(afinn,        by = rlang::as_name(rlang::enquo(article_id_col))) %>%
    left_join(lm_features,  by = rlang::as_name(rlang::enquo(article_id_col)))
}

#Run score function on articles

yahoo_articles <- yahoo_articles %>% mutate(row_id = row_number())
yahoo_articles <- score_articles(yahoo_articles, text_for_analysis, row_id)

## Warning in inner_join(., get_sentiments("loughran"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1487 of `x` matches multiple rows in `y`.
## ℹ Row 3726 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

nyt_articles <- nyt_articles %>% mutate(row_id = row_number())
nyt_articles <- score_articles(nyt_articles, text_for_analysis, row_id)

## Warning in inner_join(., get_sentiments("loughran"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 63 of `x` matches multiple rows in `y`.
## ℹ Row 2410 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

#Fixing Yahoo Articles Date format with lubridate
yahoo_articles <- yahoo_articles %>%
  mutate(
    # Try for the plain datetime first: "2025-12-10 13:13:49"
    dt_plain = suppressWarnings(ymd_hms(published_at, tz = "UTC")),

    # Remove each weekday prefix by removing first 3 letters
    published_at_clean = str_remove(published_at, "^[A-Za-z]{3},\\s*"),
    
    #Parse dates again
    dt_rfc = suppressWarnings(parse_date_time2(
      published_at_clean,
      orders = "d b Y H:M:S z",
      tz = "UTC"
    )),
    #Taking our non-NA Date Time values and Cleanup
    published_date = coalesce(published_date, dt_plain, dt_rfc)
  ) %>%
  select(-dt_plain, -dt_rfc, -published_at_clean)

#Create Daily Article Sentiment Tables for Visualization

yahoo_daily <- yahoo_articles %>%
  mutate(day = as.Date(published_date)) %>%
  group_by(ticker, day) %>%
  summarize(
    n_articles = n(),
    afinn_day = mean(afinn_sum, na.rm = TRUE),
    lm_net_day = mean(lm_net, na.rm = TRUE),
    lm_uncertainty_day = mean(uncertainty / n_tokens.x, na.rm = TRUE),
    .groups = "drop"
  )

nyt_daily <- nyt_articles %>%
  group_by(ticker, day = pub_date_only) %>%
  summarize(
    n_articles = n(),
    afinn_day = mean(afinn_sum, na.rm = TRUE),
    lm_net_day = mean(lm_net, na.rm = TRUE),
    lm_uncertainty_day = mean(uncertainty / n_tokens.x, na.rm = TRUE),
    .groups = "drop"
  )

#Plots for our Yahoo and NYT Daily Tables

#Daily Sentiment table Yahoo
yahoo_daily_sent <- yahoo_articles %>%
  mutate(day = as.Date(published_date)) %>%
  group_by(day) %>%
  summarize(
    afinn_daily = mean(afinn_sum, na.rm = TRUE),
    lm_daily    = mean(lm_net, na.rm = TRUE),
    n_articles  = n(),
    .groups = "drop"
  )

#AFINN Plots Yahoo
ggplot(yahoo_daily_sent, aes(x = day, y = afinn_daily)) +
  geom_line() +
  geom_smooth(se = FALSE) +
  labs(
    title = "Yahoo Daily Sentiment Over Time (AFINN)",
    x = "Date",
    y = "Daily AFINN sentiment (mean across all articles)"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(yahoo_daily_sent, aes(x = day, y = afinn_daily)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = "Yahoo Daily Sentiment Over Time (AFINN)",
    x = "Date",
    y = "Daily AFINN sentiment (mean across all articles)"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Loughran McDonald Plots Yahoo
ggplot(yahoo_daily_sent, aes(x = day, y = lm_daily)) +
  geom_line() +
  geom_smooth(se = FALSE) +
  labs(
    title = "Yahoo Daily Sentiment Over Time (LM)",
    x = "Date",
    y = "Daily LM sentiment (mean across all articles)"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

ggplot(yahoo_daily_sent, aes(x = day, y = lm_daily)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = "Yahoo Daily Sentiment Over Time (LM)",
    x = "Date",
    y = "Daily LM sentiment (mean across all articles)"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

#Here our NYT articles had a lot more months of data to show us. In these graphs it is clear that there is. not consistent pattern but the average sentiment score for both AFINN and LM have a noticeable dip fin May and June. It does not seem to become negative but more neutral which suggests to me that there is a more positive sentiment being pushed in the beginning of the year and end of the year.

#Daily sentiment Table NYT
nyt_daily_sent <- nyt_articles %>%
  mutate(day = as.Date(pub_date_only)) %>%
  group_by(day) %>%
  summarize(
    afinn_daily = mean(afinn_sum, na.rm = TRUE),
    lm_daily    = mean(lm_net, na.rm = TRUE),
    n_articles  = n(),
    .groups = "drop"
  )

#AFINN Plots
ggplot(nyt_daily_sent, aes(x = day, y = afinn_daily)) +
  geom_line() +
  geom_smooth(se = FALSE) +
  labs(
    title = "NYT Daily Sentiment Over Time (AFINN)",
    x = "Date",
    y = "Daily AFINN sentiment (mean across all articles)"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_smooth()`).

ggplot(nyt_daily_sent, aes(x = day, y = afinn_daily)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = "NYT Daily Sentiment Over Time (AFINN)",
    x = "Date",
    y = "Daily AFINN sentiment (mean across all articles)"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

#Loughran McDonald Plots NYT
ggplot(nyt_daily_sent, aes(x = day, y = lm_daily)) +
  geom_line() +
  geom_smooth(se = FALSE) +
  labs(
    title = "NYT Daily Sentiment Over Time (AFINN)",
    x = "Date",
    y = "Daily AFINN sentiment (mean across all articles)"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 20 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

ggplot(nyt_daily_sent, aes(x = day, y = lm_daily)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = "NYT Daily Sentiment Over Time (LM)",
    x = "Date",
    y = "Daily LM sentiment (mean across all articles)"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 20 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_point()`).

#Breaking the sentiment scores by ticker

#Daily Stock Table 
stock_data_daily <- stock_data %>%
  group_by(date) %>%
  summarize(
    # simple average across all stocks that day
    avg_daily_return = mean(daily_return, na.rm = TRUE),

    # weighted average using weight_pct
    weighted_avg_daily_return = weighted.mean(
      daily_return,
      w = weight_pct,
      na.rm = TRUE
    ),

    n_stocks = n(),
    .groups = "drop"
  )
#Daily Stock table/Sentiment by ticker
stock_data_by_ticker <- stock_data %>%
  group_by(date, ticker) %>%
  summarize(
    avg_daily_return = mean(daily_return, na.rm = TRUE),
    weight_pct = first(weight_pct),   # keep the ticker's weight
    n_obs = n(),
    .groups = "drop"
  )

yahoo_sent_by_ticker <- yahoo_articles %>%
  mutate(day = as.Date(published_date)) %>%
  group_by(day, ticker) %>%
  summarize(
    afinn_daily = mean(afinn_sum, na.rm = TRUE),
    lm_daily    = mean(lm_net, na.rm = TRUE),
    n_articles  = n(),
    .groups = "drop"
  )

nyt_sent_by_ticker <- nyt_articles %>%
  mutate(day = as.Date(pub_date_only)) %>%
  group_by(day, ticker) %>%
  summarize(
    afinn_daily = mean(afinn_sum, na.rm = TRUE),
    lm_daily    = mean(lm_net, na.rm = TRUE),
    n_articles  = n(),
    .groups = "drop"
  )

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

#Graphs For Daily Stock Outcomes Daily
ggplot(stock_data_daily, aes(x = date, y = avg_daily_return)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average Daily Return (All Stocks)",
    x = "Date",
    y = "Avg Daily Return"
  )

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

#Graphs For Daily Stock Outcomes Daily Weighted  
ggplot(stock_data_daily, aes(x = date, y = weighted_avg_daily_return)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Weighted Average Daily Return (All Stocks)",
    x = "Date",
    y = "Weighted Avg Daily Return"
  )

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

#Graphs For Daily Stock Outcomes Daily by Ticker
ggplot(stock_data_by_ticker, aes(x = date, y = avg_daily_return, group = ticker)) +
  geom_line(alpha = 0.4) +
  geom_point(alpha = 0.4, size = 0.8) +
  labs(
    title = "Daily Returns by Ticker",
    x = "Date",
    y = "Daily Return"
  )

## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(stock_data_by_ticker, aes(x = date, y = avg_daily_return, group = ticker)) +
  geom_point(alpha = 0.4, size = 0.8) +
  labs(
    title = "Daily Returns by Ticker",
    x = "Date",
    y = "Daily Return"
  )

## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).

**Above we have our returns for our average market as well as plotted in one graph we have all tickers to see if any day had outstanding trends. The main observable one is again in early april where we see a ton of variation in the returns.

#Comparison of Graphs: Here we can see that daily returns vary across each ticker with some big days of losses and gains spread throughout. The largest having been in April likely to the large dip upon tariff announcement. Comparatively we see that our lacking data for yahoo shows weakness for the overlay of these. We also see much more data in the AFINN scores, but clearly more variance from the neutral sentiment score of 0. The per ticker sentiment scores are filled with outliers that may be a strong indication that these scores alone are poor predictors of stock performance.

#Graphs For Daily Stock Outcomes/Sentiment Daily for every Ticker
ggplot(stock_data_by_ticker, aes(x = date, y = avg_daily_return)) +
  geom_line() +
  geom_point(size = 0.6) +
  facet_wrap(~ ticker, scales = "free_y") +
  labs(
    title = "Daily Returns by Ticker",
    x = "Date",
    y = "Daily Return"
  )

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(yahoo_sent_by_ticker, aes(x = day, y = afinn_daily)) +
  geom_line() +
  geom_point(size = 0.6) +
  facet_wrap(~ ticker, scales = "free_y") +
  labs(title = "Yahoo Daily Sentiment by Ticker (AFINN)", x = "Date", y = "AFINN (daily mean)")

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(nyt_sent_by_ticker, aes(x = day, y = afinn_daily)) +
  geom_line() +
  geom_point(size = 0.6) +
  facet_wrap(~ ticker, scales = "free_y") +
  labs(title = "NYT Daily Sentiment by Ticker (AFINN)", x = "Date", y = "AFINN (daily mean)")

## Warning: Removed 93 rows containing missing values or values outside the scale range
## (`geom_point()`).

#Building clean data tables for regression/correlaiton
model_yahoo <- stock_data_by_ticker %>%
  left_join(
    yahoo_sent_by_ticker,
    by = c("date" = "day", "ticker" = "ticker")
  )

model_nyt <- stock_data_by_ticker %>%
  left_join(
    nyt_sent_by_ticker,
    by = c("date" = "day", "ticker" = "ticker")
  )

cor(model_yahoo$avg_daily_return,
    model_yahoo$afinn_daily,
    use = "complete.obs")

## [1] -0.2851574

cor(model_yahoo$avg_daily_return,
    model_yahoo$lm_daily,
    use = "complete.obs")

## [1] -0.2266228

cor_by_ticker <- model_yahoo %>%
  group_by(ticker) %>%
  summarize(
    n_pairs_afinn = sum(complete.cases(avg_daily_return, afinn_daily)),
    n_pairs_lm    = sum(complete.cases(avg_daily_return, lm_daily)),

    cor_afinn = if (n_pairs_afinn >= 3) {
      cor(avg_daily_return, afinn_daily, use = "complete.obs")
    } else {
      NA_real_
    },

    cor_lm = if (n_pairs_lm >= 3) {
      cor(avg_daily_return, lm_daily, use = "complete.obs")
    } else {
      NA_real_
    },

    .groups = "drop"
  ) %>%
  arrange(desc(abs(cor_lm)))

m1 <- lm(avg_daily_return ~ afinn_daily + lm_daily, data = model_yahoo)
summary(m1)

## 
## Call:
## lm(formula = avg_daily_return ~ afinn_daily + lm_daily, data = model_yahoo)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.042110 -0.005437  0.001463  0.008823  0.022171 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0001643  0.0036851  -0.045    0.965
## afinn_daily -0.0014539  0.0012960  -1.122    0.271
## lm_daily    -0.0004991  0.0031484  -0.159    0.875
## 
## Residual standard error: 0.01376 on 29 degrees of freedom
##   (7218 observations deleted due to missingness)
## Multiple R-squared:  0.09081,    Adjusted R-squared:  0.02811 
## F-statistic: 1.448 on 2 and 29 DF,  p-value: 0.2515

m2 <- lm(avg_daily_return ~ lm_daily + lm_daily, data = model_yahoo)
summary(m2)

## 
## Call:
## lm(formula = avg_daily_return ~ lm_daily + lm_daily, data = model_yahoo)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.043530 -0.004589  0.000834  0.008591  0.023096 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003106   0.002600  -1.194    0.242
## lm_daily    -0.002927   0.002297  -1.274    0.212
## 
## Residual standard error: 0.01382 on 30 degrees of freedom
##   (7218 observations deleted due to missingness)
## Multiple R-squared:  0.05136,    Adjusted R-squared:  0.01974 
## F-statistic: 1.624 on 1 and 30 DF,  p-value: 0.2123

cor_by_ticker <- model_nyt %>%
  group_by(ticker) %>%
  summarize(
    n_pairs_afinn = sum(complete.cases(avg_daily_return, afinn_daily)),
    n_pairs_lm    = sum(complete.cases(avg_daily_return, lm_daily)),

    cor_afinn = if (n_pairs_afinn >= 3) {
      cor(avg_daily_return, afinn_daily, use = "complete.obs")
    } else {
      NA_real_
    },

    cor_lm = if (n_pairs_lm >= 3) {
      cor(avg_daily_return, lm_daily, use = "complete.obs")
    } else {
      NA_real_
    },

    .groups = "drop"
  ) %>%
  arrange(desc(abs(cor_lm)))

m1 <- lm(avg_daily_return ~ afinn_daily + lm_daily, data = model_nyt)
summary(m1)

## 
## Call:
## lm(formula = avg_daily_return ~ afinn_daily + lm_daily, data = model_nyt)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.167378 -0.008503  0.000915  0.012484  0.094813 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.0024895  0.0016750  -1.486   0.1381  
## afinn_daily  0.0006389  0.0005092   1.255   0.2105  
## lm_daily    -0.0023699  0.0012875  -1.841   0.0665 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02454 on 342 degrees of freedom
##   (6905 observations deleted due to missingness)
## Multiple R-squared:  0.009844,   Adjusted R-squared:  0.004054 
## F-statistic:   1.7 on 2 and 342 DF,  p-value: 0.1842

m2 <- lm(avg_daily_return ~ lm_daily + lm_daily, data = model_nyt)
summary(m2)

## 
## Call:
## lm(formula = avg_daily_return ~ lm_daily + lm_daily, data = model_nyt)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168930 -0.009332  0.000768  0.011809  0.094571 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0020238  0.0014909  -1.357    0.175
## lm_daily    -0.0015070  0.0009407  -1.602    0.110
## 
## Residual standard error: 0.02383 on 373 degrees of freedom
##   (6875 observations deleted due to missingness)
## Multiple R-squared:  0.006834,   Adjusted R-squared:  0.004171 
## F-statistic: 2.567 on 1 and 373 DF,  p-value: 0.11

#Conclusion of Regression testing Daily sentiment scores derived from AFINN and Loughran–McDonald dictionaries show weak negative correlations with average daily stock returns (r ≈ −0.23 to −0.29). However, when modeled jointly in a linear regression framework, neither sentiment measure significantly explains daily returns, and the model accounts for less than 3% of total variance. The lack of significance is consistent with the high noise of daily stock returns, substantial missingness reducing the effective sample size, and overlapping information between sentiment measures. These results suggest that contemporaneous news sentiment does not provide meaningful explanatory power for same-day stock returns at the aggregate level. This still leaves the question that with an explanation of 3% could there be a better model of different data that could contribute to a higher R^2 value.

m_weighted <- lm(
  avg_daily_return ~ lm_daily,
  data = model_nyt,
  weights = weight_pct
)

summary(m_weighted)

## 
## Call:
## lm(formula = avg_daily_return ~ lm_daily, data = model_nyt, weights = weight_pct)
## 
## Weighted Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274967 -0.011256 -0.000616  0.011065  0.152183 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.001295   0.001551  -0.835   0.4041  
## lm_daily    -0.002040   0.001051  -1.940   0.0531 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03254 on 373 degrees of freedom
##   (6875 observations deleted due to missingness)
## Multiple R-squared:  0.009994,   Adjusted R-squared:  0.00734 
## F-statistic: 3.765 on 1 and 373 DF,  p-value: 0.05308

model_yahoo2 <- model_yahoo %>%
  mutate(
    afinn_daily = if_else(is.na(afinn_daily), 0, afinn_daily),
    lm_daily    = if_else(is.na(lm_daily),    0, lm_daily)
  )


ticker_models <- model_yahoo2 %>%
  group_by(ticker) %>%
  group_modify(~ {
    d <- .x %>% filter(!is.na(avg_daily_return), !is.na(lm_daily))

    # if there are no usable rows (or too few), return an empty result
    if (nrow(d) < 10) return(tibble())  # you can change 10 to 5/20/etc.

    fit <- lm(avg_daily_return ~ lm_daily, data = d)
    coefs <- summary(fit)$coefficients

    tibble(
      term      = rownames(coefs),
      estimate  = coefs[, "Estimate"],
      std.error = coefs[, "Std. Error"],
      statistic = coefs[, "t value"],
      p.value   = coefs[, "Pr(>|t|)"]
    )
  }) %>%
  ungroup()

ticker_models %>%
  filter(term == "lm_daily") %>%
  arrange(p.value)

## # A tibble: 14 × 6
##    ticker term      estimate std.error statistic p.value
##    <chr>  <chr>        <dbl>     <dbl>     <dbl>   <dbl>
##  1 PG     lm_daily -0.0129     0.00446   -2.90   0.00409
##  2 XOM    lm_daily  0.134      0.103      1.31   0.193  
##  3 V      lm_daily -0.00680    0.00689   -0.988  0.324  
##  4 HD     lm_daily -0.0635     0.0738    -0.860  0.390  
##  5 MA     lm_daily -0.00989    0.0127    -0.778  0.437  
##  6 WMT    lm_daily -0.00984    0.0128    -0.772  0.441  
##  7 ABBV   lm_daily -0.00409    0.00634   -0.645  0.520  
##  8 CSCO   lm_daily  0.00543    0.00858    0.633  0.527  
##  9 GE     lm_daily -0.00389    0.00949   -0.410  0.682  
## 10 JNJ    lm_daily -0.00323    0.0136    -0.238  0.812  
## 11 KO     lm_daily  0.00118    0.00664    0.177  0.859  
## 12 COST   lm_daily  0.000918   0.00910    0.101  0.920  
## 13 CVX    lm_daily  0.00156    0.0155     0.101  0.920  
## 14 BAC    lm_daily  0.000584   0.00668    0.0875 0.930

#Conclusion on Weighted Results A weighted linear regression of average daily returns on NYT-based Loughran–McDonald sentiment scores shows a small negative association between sentiment and returns. The estimated coefficient (β = −0.0020, p = 0.053) suggests that days with more positive sentiment are associated with slightly lower weighted average returns, while more negative sentiment corresponds to marginally higher returns. Although the model explains less than 1% of return variation, this magnitude is typical for daily financial data. The result provides weak but suggestive evidence of a contrarian sentiment effect when accounting for stock importance via portfolio weights.

Final Project Data Analysis

12-14-2025