When an assassin took the life of United Healthcare CEO Brian Thompson on the streets of New York City early December 3rd, it ignited an intense national conversation surrounding the ethics of privatized healthcare in the United States. I’m curious what can be learned about changes in public perception of healthcare companies due to this recent development.

While there isn’t much polling data to indicate how people’s perception of/faith in these companies might have changed, we can instead turn to a proxy: stock prices. Of course, a company’s stock price is far from an exact representation of its public perception, but it can provide meaningful insights into investors’ feelings toward that company.

How did the assassination of the United Healthcare CEO affect the (parent) company’s stock price? Can we say anything meaningful about this event and how it affected the company’s general stock trends?

Let’s begin by scraping data from nasdaq.com. To provide ample context prior to the event in question, we’ll load in opening stock price data from the last three months. Included in this data set are six different variables:

library(tidyverse)
library(lubridate)

unh <- read_csv("/Users/aaroncohen/Downloads/HistoricalData_1734395855985.csv")

head(unh)
## # A tibble: 6 × 6
##   Date       `Close/Last`   Volume Open     High      Low      
##   <chr>      <chr>           <dbl> <chr>    <chr>     <chr>    
## 1 12/13/2024 $520.48       8202905 $515.64  $527.53   $510.72  
## 2 12/12/2024 $515.76       9403795 $531.53  $534.00   $514.19  
## 3 12/11/2024 $533.53      10362990 $555.655 $558.10   $532.67  
## 4 12/10/2024 $565.19       5360590 $562.00  $567.7499 $557.0301
## 5 12/09/2024 $560.62       7684650 $552.00  $562.98   $544.6412
## 6 12/06/2024 $549.62      13003640 $582.105 $582.105  $544.1401

Before any analysis, I’d like to get a better understanding of which variables (high, low, open, and trading volume) contribute the most to the closing price. Considering my incredibly basic understanding of the stock market, having this information might better guide me as I explore the data and attempt to answer my research question.

After we clean up our data, let’s use a random forest to assess variable importance:

# convert date to workable value
unh$Date <- mdy(unh$Date)

# limit to relevant date range (three months)
unh <- head(unh, n = nrow(unh) / 2)

# remove dollar sign and convert stock price to numerical value
unh$Open <- substr(unh$Open, 2, nchar(unh$Open))
unh$Open <- as.numeric(unh$Open)

unh$High <- substr(unh$High, 2, nchar(unh$High))
unh$High <- as.numeric(unh$High)

unh$Low <- substr(unh$Low, 2, nchar(unh$Low))
unh$Low <- as.numeric(unh$Low)

unh$`Close/Last` <- substr(unh$`Close/Last`, 2, nchar(unh$`Close/Last`))
unh$`Close/Last` <- as.numeric(unh$`Close/Last`)

# rename
unh <- unh |>
  rename(Close = `Close/Last`)

# random forest!
library(randomForest)

rf1 <- randomForest(Close ~ Open + High + Low + Volume,
                    data = unh)

rf1
## 
## Call:
##  randomForest(formula = Close ~ Open + High + Low + Volume, data = unh) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 43.34216
##                     % Var explained: 90.91

As expected, a random forest predicting closing price from open, high, and low prices as well as trading volume is able to explain a high percent of the variance in the data. This makes sense since these variables are highly related to one another: a higher than average opening price often (but not always!) indicates a higher than average closing price. Now let’s look at variable importance within our random forest:

rf1$importance
##        IncNodePurity
## Open        7139.455
## High        8613.906
## Low         8011.598
## Volume      5356.755

One major takeaway is that, while ranked lowest in terms of importance, volume makes a significant contribution to accurately predicting closing price. Let’s visualize each variable on a graph, with the red dotted line denoting the day of the assassination:

unh |> 
  ggplot() +
  geom_point(aes(x = Date, y = Close)) +
  labs(
    title = "UnitedHealth Group, Inc. (UNH) Stock Prices, 2024",
    x = "Date",
    y = "Stock Price (USD)"
  ) +
  geom_vline(xintercept = mdy("12/4/2024"), color = "red", linetype = "dashed", size = 1)

unh |> 
  ggplot() +
  geom_point(aes(x = Date, y = Volume)) +
  labs(
    title = "UnitedHealth Group, Inc. (UNH) Trading Volume, 2024",
    x = "Date",
    y = "Volume" 
  ) +
  geom_vline(xintercept = mdy("12/4/2024"), color = "red", linetype = "dashed", size = 1)

Except for a few outliers, trading volume shows a pretty clear trend: volume dramatically increases after December 4. This means investors were buying and selling significantly more in the days after the assassination than in the days before. Volume is vague, though, since it describes stock interaction and does not tell us whether perception has changed positively or negatively.

The closing price trend is less clear. Let’s explore it further.

At first glance, we can see what might be a non-linear relationship between time and price, but there’s also a fair bit of variation in the data. To help us make sense of these points and better understand any underlying trends, let’s use smoothing splines.

Smoothing splines, a supervised learning algorithm, can be a helpful tool to estimate the functional relationship between variables: in our case, time (predictor) and closing stock price (response). What makes them particularly useful is their ability to consider not just goodness of fit but smoothness, as well.

A regression line estimating the relationship between time and stock price that simply connects each data point would technically be well fitted to our data since it would estimate the true closing price on any given day. Such a line would not help us to understand more general trends, though, such as whether the price has been generally increasing/decreasing, if the rate of change of price is increasing/decreasing, etc. This is why smoothing can be useful: it allows us to tune out the noise and messiness of individual data points and, instead, focus on broader trends in the data. Smoothing splines try to maximize the effectiveness of both, so our regression line stays true to actual price points while preventing overfitting.

Let’s start by fitting a smoothing spline using the default values provided by the ss() function in the ‘npreg’ package.

library(npreg)

mod.ss <- with(unh, ss(Date, Close))
mod.ss
## 
## Call:
## ss(x = Date, y = Close)
## 
## Smoothing Parameter  spar = -0.2574128   lambda = 8.232749e-10
## Equivalent Degrees of Freedom (Df) 46.9539
## Penalized Criterion (RSS) 350.1739
## Generalized Cross-Validation (GCV) 85.68119
summary(mod.ss)
## 
## Call:
## ss(x = Date, y = Close)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8.13118 -0.81913  0.04112  0.69172  7.18037 
## 
## Approx. Signif. of Parametric Effects:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   583.00     0.8486 687.054 0.000e+00 ***
## x             -59.37     6.3963  -9.283 7.475e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Approx. Signif. of Nonparametric Effects:
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(x)      44.95 29285.3  651.45   29.85 1.403e-09 ***
## Residuals 16.05   350.2   21.82                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 4.672 on 16.05 degrees of freedom
## Multiple R-squared:  0.9885,    Adjusted R-squared:  0.9549
## F-statistic: 29.2 on 45.95 and 16.05 DF,  p-value: 1.628e-09

Smoothing parameter (spar), lambda, and Equivalent Degrees of Freedom (df) are all tuning parameters related to the flexibility and smoothness of our regression line. Spar controls the balance of fit and smoothness. A low spar value would fit the data closely, staying true to as many actual prices of stocks on given days as possible, while a high spar value would result in a smoother line with less bend. Spar is essentially a scaled version of lambda, which similarly controls how tightly fitted the regression line will be. Unless we assign our own value, df is calculated as a function of these other parameters. It dictates how complex or simple our model will be: generally speaking, the lower the df, the smoother the model.

By default our df is about 47. Let’s see what that looks like in a graph:

plot(mod.ss, xlab = "Date", ylab = "Close") +
  with(unh, points(Date, Close))

## integer(0)

Now, let’s set the df a bit lower to see how it affects our smoothing spline:

mod.ss1 <- with(unh, ss(Date, Close, df = 5))
mod.ss1
## 
## Call:
## ss(x = Date, y = Close, df = 5)
## 
## Smoothing Parameter  spar =   lambda = 6.514807e-05
## Equivalent Degrees of Freedom (Df) 5
## Penalized Criterion (RSS) 13219.69
## Generalized Cross-Validation (GCV) 247.5744
plot(mod.ss1, xlab = "Date", ylab = "Close") +
  with(unh, points(Date, Close))

## integer(0)

As we can see, this line is significantly smoother. This model is less ‘free’ to bend, and, as a result, misses lots of data points that our last model passed through. In context, this model would less accurately estimate the closing stock price of a given day (as can be seen from the distance between each point and our regression line), but instead finds more of an average of all of our points.

Another factor influencing our df is the number of “knots” in our spline. Splines work by splitting up the data into sections/ranges (ie piecewise) and estimating the regression of each section. ‘Knots’ are the points at which these sections are divided, which in our case are represented by specific days. If we set our number of knots close to the number of days in our data set, we wouldn’t be allowing for much smoothing since we’d be estimating the regression between almost every individual day’s stock price. But if we set our number of knots too low, we might oversimplify and come up with a regression line that ignores true price points. The goal is find a value that strikes a good balance.

# overfitted, knots = 63
mod.ss2 <- with(unh, ss(Date, Close, nknots = 63))

plot(mod.ss2, xlab = "Date", ylab = "Close") +
  with(unh, points(Date, Close))

## integer(0)
# underfitted, knots = 3
mod.ss3 <- with(unh, ss(Date, Close, nknots = 3))

plot(mod.ss3, xlab = "Date", ylab = "Close") +
  with(unh, points(Date, Close))

## integer(0)

The estimated relationship of time and price between each knot is represented by a polynomial, most often cubic, meaning the estimated line does not necessarily need to be a straight line. These individual curves are then pieced together smoothly to give us our final smoothing spline!

When it comes to assessing our smoothing splines, we can’t rely on usual metrics of evaluation like Mean Squared Error or R-squared values. This is because these metrics tell us how accurately our model estimates true stock prices, which we can optimize simply by raising the number of knots and degrees of freedom. Since we have a more subjective goal of finding a regression line that captures the essence of the data, let’s take a more subjective approach to assessing which df is optimal.

Let’s select three df values, not too high or low: 10, 20, and 30.

We’ll use Bootstrap Aggregation (bagging) to assess the robustness of each value and determine which one results in the most useful/explanatory smoothing spline.

mod.ss10 <- with(unh, ss(Date, Close, df = 10))
mod.ss20 <- with(unh, ss(Date, Close, df = 20))
mod.ss30 <- with(unh, ss(Date, Close, df = 30))

Sources: