You can assume that I (as your manager) have the same degree of technical knowledge that I do as your professor in this class: you will not need to “talk down” to me, and since I am the one making the requests, you will not even need to explain the requests for me, just complete them.

Hi team. This sprint we’ll be working with stock market data on large, publicly traded companies. You can find the data in Canvas under the name “stockreturns.csv”:

Okay, now the job. Shareholders are filing suit against the companyLinks to an external site., and allege the following “fraud on the market” scenario:

First I’d like you to create a daily dataset over the entire trading period with the following columns:

This dataset should have one day per row, so ~3700 rows total.

# CUSIP AMD: 00790310 (ADVANCED MICRO DEVICES INC)
# CUSIP NVIDIA: 67066G10 (NVIDIA CORP)
# CUSIP IBM: 45920010 (INTERNATIONAL BUSINESS MACHS COR) 
# CUSIP Analog Devices: 03265410 (ANALOG DEVICES INC)
# CUSIP Intel: 45814010 (INTEL CORP)
# CUSIP Qualcomm: 74752510 (QUALCOMM INC)
company_names <- c("ADVANCED MICRO DEVICES INC", "NVIDIA CORP", "INTERNATIONAL BUSINESS MACHS COR", "ANALOG DEVICES INC", "INTEL CORP", "QUALCOMM INC")
cusips <- c("00790310", "67066G10", "45920010", "03265410", "45814010", "74752510")

file_path <- "/Users/sarakochanny/MS-ADS/Statistical Models for Data Science/Second Assignment/stockreturns.csv"
stock_data <- read_csv(file_path, show_col_types = FALSE)
## New names:
## • `` -> `...1`
stock_data_tech <- stock_data %>%
  filter(cusip %in% cusips)
print(unique(stock_data_tech$hcomnam))
## [1] "NVIDIA CORP"                      "QUALCOMM INC"                    
## [3] "ANALOG DEVICES INC"               "INTERNATIONAL BUSINESS MACHS COR"
## [5] "ADVANCED MICRO DEVICES INC"       "INTEL CORP"
# Step 1: Filter just AMD rows
amd <- stock_data_tech %>%
  filter(hcomnam == "ADVANCED MICRO DEVICES INC")

# Step 2: Define peer companies
peer_companies <- c(
  "NVIDIA CORP",
  "INTERNATIONAL BUSINESS MACHS COR",
  "ANALOG DEVICES INC",
  "INTEL CORP",
  "QUALCOMM INC"
)

# Step 3: Filter rows for other tech companies (excluding AMD)
other_tech <- stock_data_tech %>%
  filter(hcomnam %in% peer_companies)

# Step 4: Group by date and compute peer metrics
other_tech_summary <- other_tech %>%
  group_by(date) %>%
  summarise(
    peer_ewavg_return = mean(ret, na.rm = TRUE),   # Equal-weighted average return
    peer_tot_vol = sum(vol, na.rm = TRUE),         # Total volume
    .groups = "drop"
  )

# Step 5: Join the summary stats to AMD dataframe by date
amd_joined <- amd %>%
  left_join(other_tech_summary, by = "date")

# rename columns and drop unneeded columsn
amd_joined <- amd_joined %>%
  select(-c(...1, permco, cusip, hcomnam)) %>%
  rename(
    amd_vol = vol,
    amd_ret = ret
  )

Next, I’d like you to consider a regression model which econometricians sometimes use to explain daily stock market movements. The basic idea is that AMD’s daily return has a fixed component, a component based on how its peers are doing, a component based on broad market conditions for all big companies, and a random error term. Let’s see if we can use this model to answer some shareholder questions as they prepare for their lawsuit against the company.

r_amd_t = beta0 + beta1 * r_peer_t + beta2 * r_sp500_t + error_t

Q1: Let’s test the model out, using only the data from 2022. Run the regression, and find point estimates for each beta (including the intercept), and determine whether each estimate is significantly different than zero using a 10% threshold. Create a small table containing the parameter estimates and also R^2.

# Filter data just for year 2022
amd_joined_2022 <- amd_joined %>%
  filter(format(date, "%Y") == "2022")

# Train linear model to predict AMD's daily return based on its peer's weighted average returns and the S&P500's performance
stocks.lm <- lm(amd_ret ~ 1 + peer_ewavg_return + sprtrn, data=amd_joined_2022)

summary(stocks.lm)
## 
## Call:
## lm(formula = amd_ret ~ 1 + peer_ewavg_return + sprtrn, data = amd_joined_2022)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.068315 -0.010089 -0.000626  0.012102  0.057044 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.0008766  0.0011687  -0.750   0.4539    
## peer_ewavg_return  1.2378875  0.1152885  10.737   <2e-16 ***
## sprtrn             0.3877553  0.1737241   2.232   0.0265 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01849 on 248 degrees of freedom
## Multiple R-squared:  0.771,  Adjusted R-squared:  0.7692 
## F-statistic: 417.6 on 2 and 248 DF,  p-value: < 2.2e-16

Describe each figure reported in the table in plain language. What does each estimate and R^2 represent or mean? What do you learn or what are your conclusions from those specific values? The point estimates are the fitting coefficients. Intercept Estimate: -0.0008766 –> correspond to beta0 peer_ewavg_return Estimate: 1.2378875 –> correspond to beta1 sprtrn Estimate: 0.3877553 –> correspond to beta2 R-squared: 0.771 Adjusted R-squared: 0.7692

The t value (or t-statistic) tells you how many standard errors the estimated coefficient is away from zero. t = Estimate / Standard Error A large t value (positive or negative) means the coefficient is far from 0, meaning it is likely to be statistically significant. Pr(>|t|) is the p-value, the probability of observing a t-statistic as extreme as the one computed if the true coefficient was 0 (under the null hypothesis). The peer_ewavg_return companies has a large t value – 10.737 SDs away from zero. This is very unlikely to happen if its true value was 0, and thus its p-value is very low: <2e-16 *** The sprtrn has a much smaller t value of 2.232 SDs away from zero. This is still under the 10% threshold of probability at 0.0265 (2.65%) probability to be seen.

R-squared is the coefficient of determination It measures the proportion of variance captured by the model. R-squared = SSM / SST.

SST is the total sum of squares: the sum of the true observed values of y minus the mean of the observed values (y-bar) squared. SST is variance(y). SST <- sum((y-mean(y))^2) SSM is the model sum of squares: the sum of the predicted values (y-hat) minus the mean of the observed values (y-bar) squared. SSM is variance(y-hat). SSM <- sum((yhat-mean(y))^2) SSE is the sum of squared errors: the sum of the true observed values of y minus the average of the fitted values (y-hat) squared. SSE is variance(e) aka variance(residuals) aka residual variance. SSE <- sum((y-yhat)^2) So by taking the SSM / SST, we get the variance of the predicted values(difference of predicted values from mean of true values squared) relative variance of the true values (difference from observed mean squared).

So R-squared is what proportion of the true variance is explained by the variance of the predicted values. And since our fitting of the model came from the peer_ewavg_return (performance of similar peer companies) and performance of the S&P/stock market overall, we are saying that 0.771 of the total variance of the observed data can be explained by our model, aka by the performance of how other peer companies are doing and how the stock market itself is doing.

Conduct a test of the normality of the residuals, a test for autocorrelation among the residuals, and a test of heteroskedasticity among the residuals. Report the results of each test and what they imply for your model.

res <- stocks.lm$residuals
e <- res
yhat <- stocks.lm$fitted.values
betas <- stocks.lm$coefficients
y <- c(amd_joined_2022$amd_ret)
sigma2 <- summary(stocks.lm)$sigma^2
# test for normality
# plot(y~yhat)
# abline(reg=lm(y~yhat),lty=2)
plot(e~yhat,ylab='Residuals', main='Residuals vs. Fitted values (Normality)')
abline(h=0,lty=2)

plot(e,ylab='Residuals', main='Testing for Normality')
abline(h=0,lty=2)

hist(e)

# Test the standardized residuals (res/sigma) and comparing to a standard normal distribution N(0,1)
ks.test(e/sqrt(sigma2),pnorm) 
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  e/sqrt(sigma2)
## D = 0.0603, p-value = 0.321
## alternative hypothesis: two-sided
# Issue with K-S test:
# 1. Assumes known parameters: The K–S test compares your data to a fully specified distribution (e.g., Normal(μ, σ) with known mean and SD).
# But your residuals have estimated mean and variance (from your model), violating this assumption.
# 2. Overly sensitive with large samples: Like many tests, K–S can detect very small, irrelevant deviations from normality in large samples, even if those deviations don’t affect your model's validity.
# Recommended to use Shapiro-Wilk Test instead
shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.98623, p-value = 0.01625
print(mean(e))
## [1] -1.092841e-18

Normality

Normality: Assumes residuals are normally distributed, especially important for valid inference (e.g., confidence intervals, p-values).

Histogram appears to show roughly normal distribution, but it looks slightly skewed to the right. The mean appears to be slightly below 0, which fits with what we see in the Residuals vs. Fitted Values plot (gap at around -0.02). It is negative, but its still close to 0: -1.092841e-18.

Kolmogorox-Smirnow test for normality: The issue with the K-S test is that it assumes known parameters, but our residuals have estimated mean and variance, which violates the assumption of the model. So we can use Shapiro-Wilk instead.

Shapiro-Wilk test specifically tests for Normality, which relies on the W statistic which is calculated through Monte Carlo simulation. A W statistic closer to 1 means more normal-like. Our W value is 0.98623 with a p-value of 0.01625, which is lower than the usual 0.05 cutoff. The Null Hypothesis is that these data come from a normal distribution. The p-value is the probability of seeing a W this extreme if data were truly normal. It’s a very low p-value, which means that there is only a 1.625% chance of seeing this value if the data were truly normal, thus it seems like the residuals are NOT normal. Though 0.986 looks is close to 1, but with our sample size this deviation is still statistically significant.

# test for autocorrelation
plot(e, ylab = "Residuals (e)",
     main = "Residuals (e) vs Residuals (e) (Autocorrelation)")
abline(h = 0, col = "red")

# Autocorrelation Function Plot
acf(res, lag.max = 20, main = "ACF of Residuals (Lag 1)")

# Lag 1 residual plot
# Create lagged residuals
res_lag1 <- res[-length(res)]      # Residuals at time t-1
res_now  <- res[-1]                # Residuals at time t

# Plot lag-1 relationship
plot(res_lag1, res_now,
     xlab = "Residual at time t−1",
     ylab = "Residual at time t",
     main = "Lag-1 Residual Plot",
     pch = 19, col = "steelblue")

# Add trend line
abline(lm(res_now ~ res_lag1), col = "red", lwd = 2)

# Run Durbin-Watson test
dwtest(stocks.lm) 
## 
##  Durbin-Watson test
## 
## data:  stocks.lm
## DW = 1.9573, p-value = 0.3675
## alternative hypothesis: true autocorrelation is greater than 0
# It takes your fitted linear model object (like stocks.lm), then extracts the residuals, and tests whether those residuals show first-order autocorrelation (i.e., whether residual at time t is correlated with residual at time t−1)
# Null hypothesis (H₀): No autocorrelation
# Alternative: Positive or negative autocorrelation
# DW statistic ranges from 0 to 4:
# ~2 = no autocorrelation
# <2 = positive autocorrelation
# 2 = negative autocorrelation
# If p-value < 0.05, you reject H₀ → there is autocorrelation

# Run Ljung-Box test over many lags and plot the p-value result
pvals <- sapply(1:20, function(lag) Box.test(res, lag = lag, type = "Ljung-Box")$p.value)
plot(1:20, pvals, type = "b", pch = 19,
     xlab = "Lag", ylab = "Ljung-Box p-value",
     main = "Ljung-Box Test at Multiple Lags")
abline(h = 0.05, col = "red", lty = 2)

Autocorrelation

In the plot, there does seem to be some interesting behavior of the residuals (seem to be peaks and valleys with more/less error). This could be reflective of periods of higher or lower growth/returns than expected by the model. The periodicity it seems like we can almost see may be reflections of seasonal trading patterns which consistently diverge from the model’s predictions. Given how this is one single year’s data and the periodicity seems to show roughly 4 periods, seems to be that these could correspond to quarterly earning reports. There’s also some clustering of some residuals, like around the 150 index mark. But this could just be from true correlations in trading across many days. Perhaps reflective of good/bad news, or global events like the pandemic.

The Durbin-Watson test tests lag 1 aka first order autocorrelation. A DW value of 2 indicates no autocorrelation. Our value of 1.9573 is very close to 2 with a p-value of 0.3675, indicating that there is very little autocorrelation. Negative AC implies that the model is underpredicting the next day and we swing from overestimating to under-estimating, which is a sign of “overcorrection” (model responds too strongly to recent errors). So it sounds like we overpredicted the stock market slightly with our model. Cute, just like most investors do.

Plotting my current residuals vs. lag 1 residuals, I see only the SLIGHTEST positive slope. This kinda makes sense because the previous day’s stock market performance might roll over into the next day.

The ACF plot does seem to show that there is some auto correlation in the lag residuals. Don’t totally know why.

The Ljung-Box test shows that while there does appear to be more autocorrelation after 10 days and then it climbs again, this isn’t statisticaly significant. Perhaps indicative of some weird trading behavior.

# test for heteroskedasticity 
# The error variance displays homoskedasticity. The variance of Y around its mean is exactly the same size for all predictor choices. This does not mean that the errors are all the same size, simply that they are all drawn from the same distribution. Rather than saying that each observation has a unique error variance, sigma_i^2, we say they all share the same error variance sigma^2.
plot(stocks.lm$fitted.values, resid(stocks.lm),
     xlab = "Fitted Values (yhat)",
     ylab = "Residuals (e)",
     main = "Testing Conditional Heteroskedasticity: Residuals (e) vs Fitted (yhat)")
abline(h = 0, col = "red")

# If the spread of residuals increases/decreases with fitted values (funnel shape), that’s a sign of heteroskedasticity.

# Bruesch-Pagan Test
bptest(stocks.lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  stocks.lm
## BP = 1.9464, df = 2, p-value = 0.3779
# Plotting residuals in order to look for regime change
plot(e, ylab = "Residuals (e)",
     main = "Residuals (e) vs Residuals (e) (Heteroskedasticity Regimen Change)")
abline(h = 0, col = "red")

#### Heteroskedasticity We can plot the residuals in order to look for a regime change/structural break as well as unconditional heteroskedasticity. I don’t really see a regimen change, though perhaps the errors do tend to not be as extreme past the 150 day mark. I think I do perceive that the errors grow and shrink with a sort of periodicity around every 50 or 75 trading days or so. This may correspond to quarterly earning reports, and I think we can say there is some unconditional heteroskedasticity in the model.

We can plot the residuals as a function of the fitted values/predictions to look for conditional heteroskedasticity. If it were present, we would see a “trumpet” shape. We don’t really appear to see one, at least using the human eye. To test more subtle HS, we can use a Breusch-Pagan test.

The Breusch-Pagan test: two-stage approach to identify heteroskedasticity associated with the predictors. Our BP value is 1.9464 with p-value 0.3779.

Q2) Re-run your regression, focusing on the year immediately prior to the first announcement (the dates from October 12th, 2011 through October 11th, 2012.)

start_date <- as.Date("2011-10-12")
end_date <- as.Date("2012-10-11")
amd_joined_2011 <- amd_joined %>%
  filter(date >= start_date & date <= end_date)

# Train linear model to predict AMD's daily return based on its peer's weighted average returns and the S&P500's performance
stocks_2011.lm <- lm(amd_ret ~ 1 + peer_ewavg_return + sprtrn, data=amd_joined_2011)

summary(stocks_2011.lm)
## 
## Call:
## lm(formula = amd_ret ~ 1 + peer_ewavg_return + sprtrn, data = amd_joined_2011)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.100432 -0.010862 -0.001364  0.010123  0.092715 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.002168   0.001370  -1.583  0.11479    
## peer_ewavg_return  1.133231   0.200485   5.652  4.3e-08 ***
## sprtrn             0.854096   0.259213   3.295  0.00113 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02166 on 250 degrees of freedom
## Multiple R-squared:  0.5369, Adjusted R-squared:  0.5332 
## F-statistic: 144.9 on 2 and 250 DF,  p-value: < 2.2e-16

Report any interesting differences you see in the parameter estimates or R^2 from your earlier 2022 regression. 2022 Regression: Multiple R-squared: 0.771, Adjusted R-squared: 0.7692 2011 Regression: Multiple R-squared: 0.5369, Adjusted R-squared: 0.5332 So the R-squared value is the correlation between data (x) and predictions (y). How much of our the total variance (SST) is captured by our model’s predictions’ variance (SSM). Our model is predicting AMD’s daily return based on the performance of its peer’s (peer_ewavg_return) the rest of the stock market (sprtrn). In 2022, our model was able to capture 77.1% of the variance, meaning that the information from the peers/S&P was enough to predict how AMD was doing. AMD’s performance was correlated with these factors by 77% roughly. In 2011, the R-squared value dropped to 0.5369, meaning we only captured ~53% of the variance with our model. In 2011, the performance of AMD’s stock was less correlated with the performance of its peers and the stock market by almost 25%.

2022 estimates: Estimate Std. Error t value Pr(>|t|)
peer_ewavg_return 1.2378875 0.1152885 10.737 <2e-16 sprtrn 0.3877553 0.1737241 2.232 0.0265
2011 estimates: peer_ewavg_return 1.133231 0.200485 5.652 4.3e-08
* sprtrn 0.854096 0.259213 3.295 0.00113 **

The standard error for our estimates is higher in 2011 than in 2022, indicating that our parameters are not as good at predicting AMD’s return (more errors) Compared to 2022, in 2011 the beta for sprtrn (S&P’s performance) is given more weight in the model (2022: ~0.39, 2011: ~0.85), whereas the beta for peer_ewavg_return (peers’ performance) dropped (2022: ~1.23, 2011: 1.13). This tells us that the performance of the S&P was more predictive of AMD’s return in 2011 than 2022, echoed by the fact that it was give ** significance compared to * in 2022.

Report whether the model passes tests of the IID assumptions.

res <- stocks_2011.lm$residuals
e <- res
yhat <- stocks_2011.lm$fitted.values
betas <- stocks_2011.lm$coefficients
y <- c(amd_joined_2011$amd_ret)
sigma2 <- summary(stocks_2011.lm)$sigma^2
# test for normality
# plot(y~yhat)
# abline(reg=lm(y~yhat),lty=2)
plot(e~yhat,ylab='Residuals', main='Residuals vs. Fitted values (Normality)')
abline(h=0,lty=2)

plot(e,ylab='Residuals', main='Testing for Normality')
abline(h=0,lty=2)

hist(e)

# Test the standardized residuals (res/sigma) and comparing to a standard normal distribution N(0,1)
ks.test(e/sqrt(sigma2),pnorm) 
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  e/sqrt(sigma2)
## D = 0.086826, p-value = 0.04409
## alternative hypothesis: two-sided
shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.92847, p-value = 1.045e-09
print(amd_joined_2011$date[175])
## [1] "2012-06-21"
print(amd_joined_2011$date[190])
## [1] "2012-07-13"
print(mean(e))
## [1] 1.508455e-18

Normality

Plotting our Residuals vs. Fitting values, they all do seem to cluster around the mean, which is slightly lower than 0.00. EXCEPT it’s not, it is actually super close to 0: 1.508455e-18, but there are these extreme positive outliers, suggesting that there are certain days that the model underpredicted how well the market did.

Plotting the residuals in order, we see a trumpet shape flaring our around the 175th day of trading or so, which is 6/21/2012. 190th is 2012-07-13.

Our histogram of residuals looks more normal than the previous one though, and the mean seems to be centered around 0.00 instead of below as before, but we do see a longer tail, corresponding to the outliers seen in the previous plot.

The K-S test isn’t the best test for normality because our residuals have estimated mean and variance, which violates the assumptions of the test, but it’s delta value is 0.086826 with a p-value of 0.04409. This is below the 0.05 significance threshold, so we reject H0 that our data come from a Normal Distribution. Likewise, the Shapiro-Wilk Test W-statistic is 0.92847 (more extreme than our previous test of ~0.98) with a more extreme p-value of 1.045e-09, telling us it is VERY unlikely these data came from a normal distribution.

# test for autocorrelation
plot(e, ylab = "Residuals (e)",
     main = "Residuals (e) vs Residuals (e) (Autocorrelation)")
abline(h = 0, col = "red")

# Autocorrelation Function Plot
acf(res, lag.max = 20, main = "ACF of Residuals (Lag 1)")

# Lag 1 residual plot
# Create lagged residuals
res_lag1 <- res[-length(res)]      # Residuals at time t-1
res_now  <- res[-1]                # Residuals at time t

# Plot lag-1 relationship
plot(res_lag1, res_now,
     xlab = "Residual at time t−1",
     ylab = "Residual at time t",
     main = "Lag-1 Residual Plot",
     pch = 19, col = "steelblue")

# Add trend line
abline(lm(res_now ~ res_lag1), col = "red", lwd = 2)

# Run Durbin-Watson test
dwtest(stocks_2011.lm) 
## 
##  Durbin-Watson test
## 
## data:  stocks_2011.lm
## DW = 1.9029, p-value = 0.2198
## alternative hypothesis: true autocorrelation is greater than 0
# Run Ljung-Box test over many lags and plot the p-value result
pvals <- sapply(1:20, function(lag) Box.test(res, lag = lag, type = "Ljung-Box")$p.value)
plot(1:20, pvals, type = "b", pch = 19,
     xlab = "Lag", ylab = "Ljung-Box p-value",
     main = "Ljung-Box Test at Multiple Lags")
abline(h = 0.05, col = "red", lty = 2)

#### Autocorrelation The lag 1 autocorrelation is higher in this year than in 2022. The bar in the ACF plot is higher and the lag-1 residual plot has a steeper slope it appears. The DW value is 1.9029 which is lower than the previous test, but not by much, and the p-value is slightly lower but it’s not significant. So I would say there is only slightly more autocorrelation than before (at least at lag 1).

But the Ljung-Box test seems to indicate quite a bit MORE autocorrelation around the 11-14 days marks, with a p-value between 0.2-0.3 percent. Still not significant, but increased compared to before. This would be more autocorrelation 2-3 trading weeks later. There might be some sort of cyclical pattern in AMD’s returns, maybe regular monthly movements or options expiration cycles that cause correlation to show up 10-15 days apart.

# test for heteroskedasticity 
# The error variance displays homoskedasticity. The variance of Y around its mean is exactly the same size for all predictor choices. This does not mean that the errors are all the same size, simply that they are all drawn from the same distribution. Rather than saying that each observation has a unique error variance, sigma_i^2, we say they all share the same error variance sigma^2.
plot(stocks_2011.lm$fitted.values, resid(stocks_2011.lm),
     xlab = "Fitted Values (yhat)",
     ylab = "Residuals (e)",
     main = "Testing Conditional Heteroskedasticity: Residuals (e) vs Fitted (yhat)")
abline(h = 0, col = "red")

# If the spread of residuals increases/decreases with fitted values (funnel shape), that’s a sign of heteroskedasticity.

# Bruesch-Pagan Test
bptest(stocks_2011.lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  stocks_2011.lm
## BP = 0.26636, df = 2, p-value = 0.8753
# Plotting residuals in order to look for regime change
plot(e, ylab = "Residuals (e)",
     main = "Residuals (e) vs Residuals (e) (Heteroskedasticity Regimen Change)")
abline(h = 0, col = "red")

print(amd_joined_2011$date[240])
## [1] "2012-09-24"

Heteroskedasticity

I don’t see anything weird about the residuals vs. fitted values. If there were conditional heteroskedasticity, I would expect us to see a trumpet shape but we don’t. The Bruesch-Pagan test backs this up: BP=0.26636 with p-value of 0.8753.

We do see a trumpet shape in the ordered Residuals plot, indicating a potential structural break/regime change. Something changed at the beginning of summer 2012 it would appear, but more drastically at the end of September 2012 going into October, which fits with what we are investigating.

Use your model and the market activity to predict how AMD shares would be expected to move on 2012-10-12 and 2012-10-19. Construct 95% confidence intervals for these predictions (not the mean response). Create a table containing the following information for both days: the actual AMD return, the “worst case” return from your 95% CI, and the difference between them, which we can call the “excess return” or the “abnormal return” of AMD stock on these two days. Provide your understanding of what these two excess returns represent from a legal/factual standpoint.

# Step 1: Filter or create prediction data
oct_data <- amd_joined[amd_joined$date %in% as.Date(c("2012-10-12", "2012-10-19")), ]

# Step 2: Make predictions with 95% confidence intervals
predictions <- predict(
  stocks_2011.lm,
  newdata = oct_data,
  interval = "confidence",  # Use "prediction" for prediction intervals
  level = 0.95
)

results <- data.frame(
  date = oct_data$date,
  actual_amd_return = oct_data$amd_ret,
  abnormal_return = predictions[, "lwr"],
  excess_return = oct_data$amd_ret - predictions[, "lwr"]
)

print(results)
##         date actual_amd_return abnormal_return excess_return
## 1 2012-10-12         -0.143750    -0.009782725    -0.1339673
## 2 2012-10-19         -0.167939    -0.050577258    -0.1173617

AMD excess returns in October

So the excess return is how much higher than the lowest/worst bound the return actually was. How much it “exceeded” the expected lower bound.

The predicted AMD return lower bound is the worst case scenario. It’s “worst case” was expected to basically be a loss of -0.0097, a loss around 1%. It’s actual return was a loss of 14%, which is ~13% worse than it was expected to do (on 2012-10-12). It’s a similar story for 2012-10-19. It was expected to be a loss of ~5%, but it actually lost ~17%, an excess loss of 11%. On both of these days, AMD performed worse than even the worst-case prediction as expected by the model. AMD’s loss was worse than what the model predicted, even at the 95% confidence level. This might signal a company-specific issue.

In litigation/SEC context, such an abnormal return might be used to argue that investors were affected by undisclosed or material information. When a stock drops sharply and unexpectdely, regulators/shareholders might investigate whether 1) There was material non-public information that should’ve been disclosed 2) The company misled investors about performance, products, or risks 3) Insiders traded based on non-public forecasts.

# Step 1: Define abnormal returns (from your excess return table)
abret1 <- results$excess_return[results$date == as.Date("2012-10-12")]
abret2 <- results$excess_return[results$date == as.Date("2012-10-19")]

# Step 2: Compute cumulative abnormal return (CAR)
CAR <- (1 + abret1) * (1 + abret2) - 1

# Step 3: Get AMD stock price on 2012-10-11
price_2012_10_11 <- amd_joined$prc[amd_joined$date == as.Date("2012-10-11")]

# Step 4: Set shares outstanding on 2012-10-11 (example: 750 million)
shares_outstanding <- amd_joined$shrout[amd_joined$date == as.Date("2012-10-11")]  

# Step 5: Compute market capitalization 
market_cap <- price_2012_10_11 * shares_outstanding

# Step 6: Estimate total shareholder damages
estimated_damages <- CAR * market_cap

# Step 7: Print results
cat("Cumulative abnormal return (CAR):", round(CAR * 100, 3), "%\n")
## Cumulative abnormal return (CAR): -23.561 %
cat("Market cap on 2012-10-11: $", formatC(market_cap, format = "f", big.mark = ",", digits = 0), "\n")
## Market cap on 2012-10-11: $ 2,264,176
cat("Estimated shareholder damages: $", formatC(estimated_damages, format = "f", big.mark = ",", digits = 0), "\n")
## Estimated shareholder damages: $ -533,454

Market capitalization (or market cap) is the total value of a publicly traded company’s equity, based on the current stock price. The actual formula is Market Cap = Share Price x Shares Outstanding Market cap tells us: How much the equity market thinks the company is worth today and is a snapshot of investor valuation If a negative event (like fraud, bad news, or lack of disclosure) causes the stock to fall, the drop in market cap estimates how much shareholder wealth was lost. When we multiply the cumulative abnormal return by the market cap, we’re estimating how much value AMD shareholders lost beyond what the market would have expected.

Cumulative abnormal return (CAR): -23.561 % Market cap on 2012-10-11: $ 2,264,176 Estimated shareholder damages: $ -533,454

Okay, now the job. Shareholders are filing suit against the companyLinks to an external site., and allege the following “fraud on the market” scenario:

In 2011, AMD faced supply issues for its new “Llano” chips, creating production shortages AMD falsely told its shareholders that these supply issues had been solved in late 2011 and early 2012 In reality, the supply issues had continued, and by the time all the back-orders had finally been filled, demand for the chips had evaporated, leaving the company with $100mm in unsold stock The full cost of this writedown was learned by investors during two company announcements on October 12th, 2012Links to an external site. and October 18th, 2012Links to an external site.. (The second announcement occurred at 5pm, and so the market did not react to the news until October 19th, 2012.)

Q3) The method above is called a “stock event study” and has often been used to estimate damages in financial litigation. Price returns are well-studied and supported by a huge body of econometric literature. In contrast, the trading volume of stocks is less well-understood. Let’s see how well we can model this variable in our data Predict AMD daily trading volume during all trade days in 2022 using whatever you can. Try to build a model with R^2 above 0.4. Feel free to transform your predictors, your response, explore lags, differences, rates or proportions, etc: use the whole bag of tricks. Describe the precise predictors and response variables used in your final model, the thought process which led you to your final model, and the R^2. (It’s okay if you cannot reach an R^2 of 0.4!)

# just get 2022 data
amd_2022 <- amd_joined[format(amd_joined$date, "%Y") == "2022", ]

# Plot the Raw Volume (Target Variable)
plot(amd_2022$date, amd_2022$amd_vol, type = "l",
     main = "AMD Daily Trading Volume (2022)", ylab = "Volume", xlab = "Date")

# Log-transform the volume, because it's often often right-skewed and heteroskedastic.
# use log_val as the response variable
amd_2022$log_vol <- log(amd_2022$amd_vol)

# Plot log(volume) vs. return
plot(amd_2022$amd_ret, amd_2022$log_vol,
     main = "Log Volume vs AMD Return", xlab = "AMD Daily Return", ylab = "Log Volume")

# adding squared term
amd_2022$ret_sq <- amd_2022$amd_ret^2

# try lags of return and volume
# Trading volume may depend on momentum or yesterday’s activity
amd_2022$lag_ret <- dplyr::lag(amd_2022$amd_ret, 1)
amd_2022$lag_vol <- dplyr::lag(amd_2022$log_vol, 1)

# do S&P return as well, but squared
amd_2022$sp_ret_sq <- amd_2022$sprtrn^2

model <- lm(log_vol ~ amd_ret + ret_sq + lag_ret + lag_vol + sprtrn + sp_ret_sq, data = amd_2022)
summary(model)
## 
## Call:
## lm(formula = log_vol ~ amd_ret + ret_sq + lag_ret + lag_vol + 
##     sprtrn + sp_ret_sq, data = amd_2022)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11786 -0.09726 -0.00642  0.10589  0.59591 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.77201    0.70200   6.798 8.17e-11 ***
## amd_ret        1.01470    0.52965   1.916   0.0566 .  
## ret_sq        43.71428    6.31894   6.918 4.03e-11 ***
## lag_ret        0.03460    0.30086   0.115   0.9085    
## lag_vol        0.73685    0.03843  19.176  < 2e-16 ***
## sprtrn        -1.42766    1.33375  -1.070   0.2855    
## sp_ret_sq   -105.92872   43.47583  -2.436   0.0155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.182 on 243 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6581, Adjusted R-squared:  0.6496 
## F-statistic: 77.95 on 6 and 243 DF,  p-value: < 2.2e-16

lag_vol is the most predictive

So… is using lagged volume (lag_vol) data leakage? ✅ No — not if you’re using it correctly You’re predicting today’s volume using yesterday’s volume, which is totally valid. That’s not leakage, it’s called using autoregressive structure, and it’s a standard time-series modeling technique.

Using ret_sq, the return squared, we turned directionality into magnitude. Whether the return is +5% or −5%, ret_sq = 0.0025 either way. So this captures the idea that high volatility → high trading volume, no matter the direction. Volatility drives attention: Traders, bots, and retail investors react more when a stock moves a lot. Big up or down days often generate volume spikes. Momentum traders jump in when a stock moves a lot.

This is consistent with the “mixture of distributions hypothesis” in finance: Volume is driven by the arrival of new information, and volatility is a proxy for that.

ret_sq is acting like a volatility proxy. ret_sq coefficient of 43.71: This means that for every 1-unit increase in ret_sq, log-volume goes up by ~44 units — but because returns are small (~±0.05), ret_sq is usually tiny. Still, the impact is very large when AMD has a volatile day.

# exploring each coefficient's impact on R2

# Compute relative importance
rel <- calc.relimp(model, type = "lmg", rela = TRUE)  # 'lmg' = R² decomposition like Shapley values

# View result
print(rel)
## Response variable: log_vol 
## Total response variance: 0.09454724 
## Analysis based on 250 observations 
## 
## 6 Regressors: 
## amd_ret ret_sq lag_ret lag_vol sprtrn sp_ret_sq 
## Proportion of variance explained by model: 65.81%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                   lmg
## amd_ret   0.008054427
## ret_sq    0.136098573
## lag_ret   0.001757375
## lag_vol   0.826980685
## sprtrn    0.002437139
## sp_ret_sq 0.024671800
## 
## Average coefficients for different model sizes: 
## 
##                    1X         2Xs         3Xs        4Xs         5Xs
## amd_ret     0.5840257   0.6621267  0.71872299  0.7825081   0.8769460
## ret_sq     43.5643334  42.7358580 42.29895346 42.3537935  42.8895634
## lag_ret     0.3907060   0.2903383  0.20638943  0.1356856   0.0782438
## lag_vol     0.7690422   0.7604814  0.75324215  0.7471878   0.7419634
## sprtrn      0.7306303   0.3293380 -0.03039273 -0.4120178  -0.8684948
## sp_ret_sq 158.2579986 100.2107211 45.25732560 -7.5925643 -58.4174556
##                     6Xs
## amd_ret      1.01469971
## ret_sq      43.71427838
## lag_ret      0.03460258
## lag_vol      0.73685406
## sprtrn      -1.42766330
## sp_ret_sq -105.92871969
avPlots(model)

# just trying log_vol

model <- lm(log_vol ~ lag_vol, data = amd_2022)
summary(model)
## 
## Call:
## lm(formula = log_vol ~ lag_vol, data = amd_2022)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15796 -0.11919 -0.01103  0.10432  0.66498 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.22181    0.76523   5.517 8.66e-08 ***
## lag_vol      0.76904    0.04184  18.381  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2005 on 248 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5767, Adjusted R-squared:  0.575 
## F-statistic: 337.9 on 1 and 248 DF,  p-value: < 2.2e-16

Okay so by itself, lag-1 volume has an R-squared of 0.5767, which is over 0.4.

# just trying ret_sq

model <- lm(log_vol ~ ret_sq, data = amd_2022)
summary(model)
## 
## Call:
## lm(formula = log_vol ~ ret_sq, data = amd_2022)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15817 -0.14883  0.00901  0.17930  0.67396 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.21995    0.02137 852.459  < 2e-16 ***
## ret_sq      43.43993    7.48280   5.805 1.95e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2895 on 249 degrees of freedom
## Multiple R-squared:  0.1192, Adjusted R-squared:  0.1157 
## F-statistic:  33.7 on 1 and 249 DF,  p-value: 1.949e-08