knitr::opts_chunk$set(echo = TRUE)

Create daily return dataset

setwd("~/Desktop/Statistical Model/assignment_2")
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
stockreturns <- read_csv("stockreturns.csv")
## New names:
## • `` -> `...1`
## Rows: 1561542 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): cusip, hcomnam
## dbl  (7): ...1, permco, prc, vol, ret, shrout, sprtrn
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
target_5 <- c("ANALOG DEVICES INC", "INTERNATIONAL BUSINESS MACHS COR", "INTEL CORP", "NVIDIA CORP", "QUALCOMM INC")

target_5_by_date <- stockreturns %>%
  filter(hcomnam %in% target_5) %>%
  group_by(date) %>%
  summarise(
    avg_ret_5 = mean(ret, na.rm=TRUE),
    sum_vol_5 = sum(vol, na.rm=TRUE),
  )

amd_daily_partial <- stockreturns %>%
  filter(hcomnam == "ADVANCED MICRO DEVICES INC") %>%
  select(date, ret, vol, sprtrn)

amd_daily <- amd_daily_partial %>%
  left_join(target_5_by_date, by = "date")
  1. Let’s test the model out, using only the data from 2022. A. Run the regression, and find point estimates for each beta (including the intercept), and determine whether each estimate is significantly different than zero using a 10% threshold. Create a small table containing the parameter estimates and also R^2.
fit1 <- lm(ret ~ avg_ret_5 + sprtrn, 
           data = amd_daily,
           subset = format(date, "%Y") == "2022")
s1 <- summary(fit1)
coefs1 <- coef(s1)
r2_1 <- s1$r.squared

panelA1 <- data.frame(
  parameter = rownames(coefs1),
  estimate = coefs1[, "Estimate"],
  #std_error = coefs1[, "Std. Error"],
  #t_value = coefs1[, "t value"],
  p_value = coefs1[, "Pr(>|t|)"],
  sig_10pct = (coefs1[, "Pr(>|t|)"] < 0.10)
)

panelB1 <- data.frame(
  metric   = c("R-squared", "Adj R-squared"),
  value    = c(round(s1$r.squared, 4), round(s1$adj.r.squared, 4)),
  stringsAsFactors = FALSE
)

results1 <- list(
  "Panel A: Coefficients" = panelA1,
  "Panel B: Model Fit"    = panelB1
)

s1
## 
## Call:
## lm(formula = ret ~ avg_ret_5 + sprtrn, data = amd_daily, subset = format(date, 
##     "%Y") == "2022")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.068315 -0.010089 -0.000626  0.012102  0.057044 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0008766  0.0011687  -0.750   0.4539    
## avg_ret_5    1.2378875  0.1152885  10.737   <2e-16 ***
## sprtrn       0.3877553  0.1737241   2.232   0.0265 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01849 on 248 degrees of freedom
## Multiple R-squared:  0.771,  Adjusted R-squared:  0.7692 
## F-statistic: 417.6 on 2 and 248 DF,  p-value: < 2.2e-16
results1
## $`Panel A: Coefficients`
##               parameter      estimate      p_value sig_10pct
## (Intercept) (Intercept) -0.0008765884 4.539453e-01     FALSE
## avg_ret_5     avg_ret_5  1.2378875358 2.457374e-22      TRUE
## sprtrn           sprtrn  0.3877552848 2.650752e-02      TRUE
## 
## $`Panel B: Model Fit`
##          metric  value
## 1     R-squared 0.7710
## 2 Adj R-squared 0.7692

B. Describe each figure reported in the table in plain language. What does each estimate and R^2 represent or mean? What do you learn or what are your conclusions from those specific values?

Keeping the daily return of S&P500 constant (ceteris, paribus), a 1 unit increase in the average return on the 5 competitors causes 1.2379 units increase in the return on AMD stock. This measures the level of comovement between the AMD stock and its competitors’. Since the estimate is larger than 1, we can conclude that AMD stock is more volatile than the average of its five competitors.

Keeping the average return on the 5 competitors constant (ceteris, paribus), a 1 unit increase in the daily return of S&P500 causes 0.3878 unit increase in the return on AMD stock. This measures the level of comovement between the AMD stock and the broader market (as measured by S&P500), usually this measure is called the beta However, since we are controlling for the competitors separately who are also in the S&P500, the estimate we get here is not truly beta. We can therefore not conclude definitively whether AMD is more volatile than the market or not.

The R^2 means that the average return on the 5 competitors and the return on S&P500 together explain 77.10% of the variation in AMD stock return. We can conclude that these two factors together have substantial explanatory power for AMD return.

C. Conduct a test of the normality of the residuals, a test for autocorrelation among the residuals, and a test of heteroskedasticity among the residuals. Report the results of each test and what they imply for your model.

Normality test (shapiro wilk test) H0: residuals are normally distributed p-value = 0.01625 < 0.05, reject H0 that residuals are normally distributed at the 5% significance level;The residuals are likely not normally distributed. However, we cannot reject H0 at the 1% significance level.

Autocorrelation test (durbin watson) H0: no first-order autocorrelation If DW=2, no autocorrelation; DW<2, positive autocorrelation; DW>2, negative autocorrelation p-value = 0.3675 > 0.10, cannot reject H0 that there is no first-order autocorrelation in residuals. The residuals are likely not autocorrelated.

Heteroskedasticity test H0: homoskedasticity p-value = 0.3779 > 0.10 fail to reject H0 that residuals are homoskedastic. Residuals are likely homoskedastic.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# normality test (shapiro wilk test)
# H0: residuals are normally distributed
# p < 0.05, reject H0 that residuals are normally distributed at the 5% significance level
res <- residuals(fit1)
shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.98623, p-value = 0.01625
# autocorrelation test (durbin watson)
# H0: no first-order autocorrelation
# DW=2, no autocorrelation; DW<2, positive autocorrelation; DW>2, negative autocorrelation
# p > 0.10, cannot reject H0 that there is no first-order autocorrelation in residuals
dwtest(fit1)
## 
##  Durbin-Watson test
## 
## data:  fit1
## DW = 1.9573, p-value = 0.3675
## alternative hypothesis: true autocorrelation is greater than 0
# heteroskedasticity test
# H0: homoskedasticity
# P > 0.10 fail to reject H0 that residuals are homoskedastic
bptest(fit1)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit1
## BP = 1.9464, df = 2, p-value = 0.3779
  1. Re-run your regression, focusing on the year immediately prior to the first announcement (the dates from October 12th, 2011 through October 11th, 2012.) A. Report any interesting differences you see in the parameter estimates or R^2 from your earlier 2022 regression.

The main difference is the conditional market sensitivity and overall explanatory power: in Part 1 the S&P coefficient is smaller (≈0.39) but the model explains substantially more variation (Adj-R² ≈ 0.77), meaning the S&P and competitor regressors together capture a larger share of AMD’s daily-return variation in that sample. In Part II the conditional S&P comovement is much larger (≈0.85) while Adj-R² is lower (≈0.53), which means AMD shows stronger comovement with the S&P after partialing out competitors, but a larger share of total variation remains unexplained by the model. We conclude that AMD stock return’s variation is more driven by idiosyncratic factors rather than systematic factors in the second sample compared to the first sample.

start_date <- as.Date("2011-10-12")
end_date   <- as.Date("2012-10-11")

fit2 <- lm(ret ~ avg_ret_5 + sprtrn, 
           data = amd_daily,
           subset = (date >= start_date & date <= end_date))
s2 <- summary(fit2)
coefs2 <- coef(s2)
r2_2 <- s2$r.squared

panelA2 <- data.frame(
  parameter = rownames(coefs2),
  estimate = coefs2[, "Estimate"],
  #std_error = coefs2[, "Std. Error"],
  #t_value = coefs2[, "t value"],
  p_value = coefs2[, "Pr(>|t|)"],
  sig_10pct = (coefs2[, "Pr(>|t|)"] < 0.10)
)

panelB2 <- data.frame(
  metric   = c("R-squared", "Adj R-squared"),
  value    = c(round(s2$r.squared, 4), round(s2$adj.r.squared, 4)),
  stringsAsFactors = FALSE
)

results2 <- list(
  "Panel A: Coefficients" = panelA2,
  "Panel B: Model Fit"    = panelB2
)

s2
## 
## Call:
## lm(formula = ret ~ avg_ret_5 + sprtrn, data = amd_daily, subset = (date >= 
##     start_date & date <= end_date))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.100432 -0.010862 -0.001364  0.010123  0.092715 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.002168   0.001370  -1.583  0.11479    
## avg_ret_5    1.133231   0.200485   5.652  4.3e-08 ***
## sprtrn       0.854096   0.259213   3.295  0.00113 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02166 on 250 degrees of freedom
## Multiple R-squared:  0.5369, Adjusted R-squared:  0.5332 
## F-statistic: 144.9 on 2 and 250 DF,  p-value: < 2.2e-16
results2
## $`Panel A: Coefficients`
##               parameter     estimate      p_value sig_10pct
## (Intercept) (Intercept) -0.002167859 1.147851e-01     FALSE
## avg_ret_5     avg_ret_5  1.133230892 4.303211e-08      TRUE
## sprtrn           sprtrn  0.854095906 1.126948e-03      TRUE
## 
## $`Panel B: Model Fit`
##          metric  value
## 1     R-squared 0.5369
## 2 Adj R-squared 0.5332

B. Report whether the model passes tests of the IID assumptions.

Residuals likely are not normally distributed, are not autocorrelated, and are homoskedastic.

library(lmtest)
# normality test (shapiro wilk test)
# H0: residuals are normally distributed
# p < 0.01, reject H0 that residuals are normally distributed at the 1% significance level
res <- residuals(fit2)
shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.92847, p-value = 1.045e-09
# autocorrelation test (durbin watson)
# H0: no first-order autocorrelation
# DW=2, no autocorrelation; DW<2, positive autocorrelation; DW>2, negative autocorrelation
# p > 0.10, cannot reject H0 that there is no first-order autocorrelation in residuals
dwtest(fit2)
## 
##  Durbin-Watson test
## 
## data:  fit2
## DW = 1.9029, p-value = 0.2198
## alternative hypothesis: true autocorrelation is greater than 0
# heteroskedasticity test
# H0: homoskedasticity
# P > 0.10 fail to reject H0 that residuals are homoskedastic
bptest(fit2)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 0.26636, df = 2, p-value = 0.8753

C. Use your model and the market activity to predict how AMD shares would be expected to move on 2012-10-12 and 2012-10-19. Construct 95% confidence intervals for these predictions (not the mean response). Create a table containing the following information for both days: the actual AMD return, the “worst case” return from your 95% CI, and the difference between them, which we can call the “excess return” or the “abnormal return” of AMD stock on these two days. Provide your understanding of what these two excess returns represent from a legal/factual standpoint.

These two negative excess returns suggest that the actual returns of AMD on those days were less than the worst case scenario in a 95% confidence interval as predicted by an OLS regression model. Since the regression model feature two explanatory variables, S&P500 return and 5 competitor average return, this fact means that the movement of AMD stock on those two days are very unlikely caused by market movement, and are very likely due to idiosyncratic events, which correspond to the announcement of AMD’s full writedown of the unsold inventory. From a legal standpoint, this supports the lawsuit of stockholders against the company and justifies compensation for damages caused by misconduct of the company.

data_10_12 <- amd_daily %>%
  filter(date == as.Date("2012-10-12"))

data_10_19 <- amd_daily %>%
  filter(date == as.Date("2012-10-19"))

# 95% confidence interval for predictions
pred_10_12_ci <- predict(fit2, newdata = data_10_12, interval = "prediction", level = 0.95)
cat("The predicted return for AMD on 2012-10-12 is",round(pred_10_12_ci[1, "fit"], 6),
  "with a 95% prediction interval from",
  round(pred_10_12_ci[1, "lwr"], 6), "to",
  round(pred_10_12_ci[1, "upr"], 6), ".\n")
## The predicted return for AMD on 2012-10-12 is -0.006828 with a 95% prediction interval from -0.049593 to 0.035938 .
pred_10_19_ci <- predict(fit2, newdata = data_10_19, interval = "prediction", level = 0.95)
cat("The predicted return for AMD on 2012-10-19 is",round(pred_10_19_ci[1, "fit"], 6),
  "with a 95% prediction interval from",
  round(pred_10_19_ci[1, "lwr"], 6), "to",
  round(pred_10_19_ci[1, "upr"], 6), ".\n")
## The predicted return for AMD on 2012-10-19 is -0.044803 with a 95% prediction interval from -0.087855 to -0.001752 .
# excess return
actual_10_12 <- amd_daily %>%
  filter(date == as.Date("2012-10-12")) %>%
  pull(ret)

actual_10_19 <- amd_daily %>%
  filter(date == as.Date("2012-10-19")) %>%
  pull(ret)

worst_case_10_12 <- pred_10_12_ci[1, "lwr"]

worst_case_10_19 <- pred_10_19_ci[1, "lwr"]

excess_ret_10_12 <- actual_10_12 - worst_case_10_12
excess_ret_10_19 <- actual_10_19 - worst_case_10_19

excess_result <- data.frame(
  date = as.Date(c("2012-10-12", "2012-10-19")),
  actual_return = round(c(actual_10_12, actual_10_19), 6),
  worst_case_pred = round(c(worst_case_10_12, worst_case_10_19), 6),
  excess_return = round(c(excess_ret_10_12, excess_ret_10_19), 6)
)

print(excess_result)
##         date actual_return worst_case_pred excess_return
## 1 2012-10-12     -0.143750       -0.049593     -0.094157
## 2 2012-10-19     -0.167939       -0.087855     -0.080084

D. Cumulate these two abnormal returns to find the total abnormal decline of the stock on the two news days. The aggregate abnormal return will be equal to (1 + abret1)*(1 + abret2) – 1. Multiply the AMD stock price at 2012-10-11 by the shares outstanding on the same day to find the total market capitalization of the company. Finally, apply the aggregate abnormal return to this market capitalization in order to estimate the damages which the shareholders might seek in their lawsuit.

The shareholders suffered a damage of 377439397

agg_abnormal_ret <- ((1 + excess_ret_10_12)*(1 + excess_ret_10_19)-1)

agg_abnormal_ret
## [1] -0.1667006
amd_2012_10_11 <- stockreturns %>%
  filter(hcomnam == "ADVANCED MICRO DEVICES INC",
         date == as.Date("2012-10-11")) %>%
  select(prc, shrout)

mktcap <- abs(amd_2012_10_11$prc * amd_2012_10_11$shrout * 1000)

damage <- abs(agg_abnormal_ret * mktcap)

cat("The shareholders suffered a damage of", damage)
## The shareholders suffered a damage of 377439397
  1. The method above is called a “stock event study” and has often been used to estimate damages in financial litigation. Price returns are well-studied and supported by a huge body of econometric literature. In contrast, the trading volume of stocks is less well-understood. Let’s see how well we can model this variable in our data

A. Predict AMD daily trading volume during all trade days in 2022 using whatever you can. Try to build a model with R^2 above 0.4. Feel free to transform your predictors, your response, explore lags, differences, rates or proportions, etc: use the whole bag of tricks.

amd_daily <- amd_daily %>%
  arrange(date) %>%                
  mutate(vol_lag1 = lag(vol, 1))

data_2022 <- amd_daily %>%
  filter(format(date, "%Y") == "2022")

vol_fit <- lm(log(vol) ~ log(vol_lag1) + log(ret) + log(sprtrn) + log(avg_ret_5) + log(sum_vol_5), data = data_2022)
## Warning in log(ret): NaNs produced
## Warning in log(sprtrn): NaNs produced
## Warning in log(avg_ret_5): NaNs produced
summary(vol_fit)
## 
## Call:
## lm(formula = log(vol) ~ log(vol_lag1) + log(ret) + log(sprtrn) + 
##     log(avg_ret_5) + log(sum_vol_5), data = data_2022)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29831 -0.10608 -0.03307  0.07718  0.55171 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.14582    2.05871  -1.042 0.300615    
## log(vol_lag1)   0.72669    0.06859  10.595  < 2e-16 ***
## log(ret)        0.06234    0.02692   2.316 0.023304 *  
## log(sprtrn)    -0.06674    0.03154  -2.116 0.037658 *  
## log(avg_ret_5)  0.04793    0.03754   1.277 0.205568    
## log(sum_vol_5)  0.39322    0.10803   3.640 0.000499 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1588 on 75 degrees of freedom
##   (170 observations deleted due to missingness)
## Multiple R-squared:  0.7426, Adjusted R-squared:  0.7254 
## F-statistic: 43.27 on 5 and 75 DF,  p-value: < 2.2e-16

B. Describe the precise predictors and response variables used in your final model, the thought process which led you to your final model, and the R^2. (It’s okay if you cannot reach an R^2 of 0.4!)

Response variable: log of AMD volatility

Predictors: log of first order lag of volatility, log of return on AMD, log of S&P500 return, log of average return on the 5 competitors, and log of sum of volatility of the 5 competitors

R^2 = 0.7426

Thought process: Start with the most basic linear regression on all variables, realize R^2 very low. After trying to square each of the variable, R^2 experienced very little improvement, likely need transformation. Transformed the response variable to log scale and significantly improved R^2; transformed all predictors to log scale as well, experienced significant R^2 improvement. Added the first order lag of the response variable log transformed, boosted the R^2, suggesting the volatility is likely an AR process.

Rationale: 1. Log–log form ⇒ elasticities + multiplicative shocks. Trading volume is roughly log-normal and reacts multiplicatively to “information flow.” In logs, effects are linear and coefficients read as elasticities (e.g., a 1% move in peer volume lifts AMD volume by ≈0.39%). 2. Strong persistence in volume. The big coefficient on log(vol_lag1) (~0.73, p≪0.001) captures the well-known high autocorrelation of volume (slow-moving participation/liquidity). 3. Stationarity & variance stabilization through log transformation. Logging volume tames heteroskedasticity and outliers, improving linear fit.