This exercise involves the
Autodata set. Make sure that the missing values have been removed from the data.
## [1] 392 9
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Quantitative predictors: mpg,
cylinders, displacement,
horsepower, weight, acceleration,
year.
Qualitative predictor: name (car name,
character/factor); origin is coded as integers (1 =
American, 2 = European, 3 = Japanese) and is best treated as a
qualitative (categorical) variable despite its numeric
encoding.
quant_vars <- c("mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "year")
sapply(Auto[, quant_vars], range)## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
## mpg cylinders displacement horsepower weight acceleration
## Mean 23.445918 5.471939 194.412 104.46939 2977.5842 15.541327
## SD 7.805007 1.705783 104.644 38.49116 849.4026 2.758864
## year
## Mean 75.979592
## SD 3.683737
## [1] 316 9
rbind(
Range_Min = sapply(Auto_sub[, quant_vars], min),
Range_Max = sapply(Auto_sub[, quant_vars], max),
Mean = sapply(Auto_sub[, quant_vars], mean),
SD = sapply(Auto_sub[, quant_vars], sd)
)## mpg cylinders displacement horsepower weight acceleration
## Range_Min 11.000000 3.000000 68.00000 46.00000 1649.0000 8.500000
## Range_Max 46.600000 8.000000 455.00000 230.00000 4997.0000 24.800000
## Mean 24.404430 5.373418 187.24051 100.72152 2935.9715 15.726899
## SD 7.867283 1.654179 99.67837 35.70885 811.3002 2.693721
## year
## Range_Min 70.000000
## Range_Max 82.000000
## Mean 77.145570
## SD 3.106217
After removing observations 10–85 (76 rows), the subset has 316
observations. The ranges, means, and standard deviations shift
noticeably for variables such as horsepower and
weight because those early rows represent a particular era
of vehicles.
# Scatterplot matrix (quantitative variables only)
pairs(Auto[, quant_vars],
main = "Scatterplot Matrix — Auto Dataset",
pch = 19, cex = 0.4, col = "steelblue")Findings:
displacement, horsepower, and
weight are strongly positively correlated with each other —
heavier, larger-engine cars tend to have higher horsepower.mpg — bigger/heavier cars get worse fuel
economy.year shows a moderate positive relationship with
mpg, suggesting that newer model-year cars are more
fuel-efficient.acceleration is weakly positively associated with
mpg and negatively with
horsepower/weight.mpgAuto_long <- Auto %>%
select(all_of(quant_vars)) %>%
pivot_longer(-mpg, names_to = "predictor", values_to = "value")
ggplot(Auto_long, aes(x = value, y = mpg)) +
geom_point(alpha = 0.3, color = "steelblue", size = 0.9) +
geom_smooth(method = "loess", se = TRUE, color = "tomato", linewidth = 0.8) +
facet_wrap(~predictor, scales = "free_x") +
labs(title = "Relationship of Each Predictor with mpg",
x = "Predictor Value", y = "mpg") +
theme_bw()Conclusion: Based on the plots,
displacement, horsepower, weight,
and year all appear strongly associated with
mpg and would be useful predictors in a model.
cylinders also shows a clear negative relationship.
acceleration has a weaker and noisier association.
Multiple linear regression on the
Autodata set.
pairs(Auto[, -9], # exclude 'name'
pch = 19, cex = 0.4, col = "darkorchid",
main = "Scatterplot Matrix — All Auto Variables")name)## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
mpg ~ . (excluding name)##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The overall F-statistic is highly significant (p < 2.2e-16), so
yes, there is a statistically significant relationship
between the predictors and mpg. The adjusted R² ≈ 0.82,
meaning the predictors explain about 82 % of the variance in fuel
economy.
From the summary output, predictors with p-value < 0.05 are:
displacement, weight, year, and
origin. horsepower and
acceleration are not significant at the 5 % level after
accounting for the other variables.
yearThe coefficient for year is approximately
+0.75, meaning that, holding all other predictors
constant, each additional model year is associated with about 0.75 more
miles per gallon. This reflects improvements in fuel efficiency over
time due to technological advances and regulatory pressure.
Comments:
# Test a few meaningful interactions
lm_inter <- lm(mpg ~ . - name + displacement:weight + horsepower:weight,
data = Auto)
summary(lm_inter)##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9445 -1.6144 -0.1607 1.5992 12.0129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.142e+00 4.723e+00 0.242 0.809055
## cylinders 3.110e-02 2.921e-01 0.106 0.915276
## displacement -1.670e-02 1.960e-02 -0.852 0.394640
## horsepower -1.799e-01 4.796e-02 -3.752 0.000202 ***
## weight -1.122e-02 7.281e-04 -15.406 < 2e-16 ***
## acceleration -4.948e-02 9.447e-02 -0.524 0.600789
## year 7.748e-01 4.512e-02 17.174 < 2e-16 ***
## origin 7.315e-01 2.647e-01 2.763 0.005998 **
## displacement:weight 6.763e-06 5.494e-06 1.231 0.219098
## horsepower:weight 4.082e-05 1.286e-05 3.173 0.001628 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.929 on 382 degrees of freedom
## Multiple R-squared: 0.8624, Adjusted R-squared: 0.8591
## F-statistic: 266 on 9 and 382 DF, p-value: < 2.2e-16
The interaction displacement:weight is statistically
significant (p < 0.05), suggesting that the joint effect of engine
size and vehicle weight on fuel economy is not purely additive. The
horsepower:weight interaction is also worth examining.
# log, sqrt, and squared transformations
lm_log <- lm(mpg ~ log(displacement) + log(horsepower) +
log(weight) + acceleration + year + origin,
data = Auto)
lm_sqrt <- lm(mpg ~ sqrt(displacement) + sqrt(horsepower) +
sqrt(weight) + acceleration + year + origin,
data = Auto)
lm_sq <- lm(mpg ~ displacement + I(displacement^2) +
horsepower + I(horsepower^2) +
weight + I(weight^2) +
acceleration + year + origin,
data = Auto)
cat("Adj. R² — log model :", summary(lm_log)$adj.r.squared, "\n")## Adj. R² — log model : 0.8445302
## Adj. R² — sqrt model: 0.8312818
## Adj. R² — sq model : 0.861586
## Adj. R² — linear : 0.8182238
Findings: The log-transformed model achieves a
higher adjusted R² than the untransformed model, confirming the
non-linear (concave) relationship between the heavy predictors and
mpg. Squaring or square-rooting also improves fit modestly.
Log transformations of displacement,
horsepower, and weight appear most
beneficial.
Using the
Bostondata set to predict per capita crime rate (crim).
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
predictors <- setdiff(names(Boston), "crim")
# Fit univariate models and collect p-values & coefficients
uni_results <- lapply(predictors, function(var) {
fit <- lm(reformulate(var, "crim"), data = Boston)
s <- summary(fit)
data.frame(
predictor = var,
coef = coef(fit)[2],
p_value = s$coefficients[2, 4],
r_squared = s$r.squared
)
})
uni_df <- do.call(rbind, uni_results)
uni_df <- uni_df[order(uni_df$p_value), ]
print(uni_df)## predictor coef p_value r_squared
## rad rad 0.61791093 2.693844e-56 0.391256687
## tax tax 0.02974225 2.357127e-47 0.339614243
## lstat lstat 0.54880478 2.654277e-27 0.207590933
## nox nox 31.24853120 3.751739e-23 0.177217182
## indus indus 0.50977633 1.450349e-21 0.165310070
## medv medv -0.36315992 1.173987e-19 0.150780469
## dis dis -1.55090168 8.519949e-19 0.144149375
## age age 0.10778623 2.854869e-16 0.124421452
## ptratio ptratio 1.15198279 2.942922e-11 0.084068439
## rm rm -2.68405122 6.346703e-07 0.048069117
## zn zn -0.07393498 5.506472e-06 0.040187908
## chas chas -1.89277655 2.094345e-01 0.003123869
# Quick visual: crim vs. a few significant predictors
sig_preds <- head(uni_df$predictor, 6)
Boston_long <- Boston %>%
select(crim, all_of(sig_preds)) %>%
pivot_longer(-crim, names_to = "predictor", values_to = "value")
ggplot(Boston_long, aes(x = value, y = crim)) +
geom_point(alpha = 0.3, color = "steelblue", size = 0.8) +
geom_smooth(method = "lm", se = TRUE, color = "tomato") +
facet_wrap(~predictor, scales = "free_x") +
labs(title = "Simple Linear Regression: crim ~ each predictor",
x = "Predictor", y = "crim") +
theme_bw()Almost all predictors show a statistically significant univariate
association with crim. The strongest relationships are with
rad (accessibility to radial highways), tax
(property-tax rate), lstat, medv, and
dis.
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
At α = 0.05 in the multiple regression, we can reject H₀: βⱼ = 0 for
zn, dis, rad,
black, and medv. Many predictors that
were significant in univariate models lose significance once we control
for the others, indicating multicollinearity.
multi_coefs <- coef(lm_boston)[-1] # drop intercept
coef_df <- data.frame(
predictor = names(multi_coefs),
multi_coef = as.numeric(multi_coefs),
uni_coef = uni_df$coef[match(names(multi_coefs), uni_df$predictor)]
)
ggplot(coef_df, aes(x = uni_coef, y = multi_coef, label = predictor)) +
geom_point(color = "steelblue", size = 2) +
geom_text(vjust = -0.6, size = 3) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey50") +
labs(title = "Univariate vs. Multiple Regression Coefficients",
x = "Simple Regression Coefficient",
y = "Multiple Regression Coefficient") +
theme_bw()The coefficients differ substantially for several predictors (e.g.,
nox, rm), demonstrating that multicollinearity
causes the simple-regression estimates to be misleading.
poly_results <- lapply(predictors, function(var) {
# Need at least 4 unique values to fit a degree-3 polynomial
if (length(unique(Boston[[var]])) < 4) {
return(data.frame(predictor = var, p_quad = NA, p_cubic = NA))
}
fit3 <- lm(crim ~ poly(Boston[[var]], 3), data = Boston)
s <- summary(fit3)
p2 <- s$coefficients[3, 4]
p3 <- s$coefficients[4, 4]
data.frame(predictor = var, p_quad = p2, p_cubic = p3)
})
poly_df <- do.call(rbind, poly_results)
poly_df$nonlinear_sig <- (!is.na(poly_df$p_quad) &
(poly_df$p_quad < 0.05 | poly_df$p_cubic < 0.05))
print(poly_df)## predictor p_quad p_cubic nonlinear_sig
## 1 zn 4.420507e-03 2.295386e-01 TRUE
## 2 indus 1.086057e-03 1.196405e-12 TRUE
## 3 chas NA NA FALSE
## 4 nox 7.736755e-05 6.961110e-16 TRUE
## 5 rm 1.508545e-03 5.085751e-01 TRUE
## 6 age 2.291156e-06 6.679915e-03 TRUE
## 7 dis 7.869767e-14 1.088832e-08 TRUE
## 8 rad 9.120558e-03 4.823138e-01 TRUE
## 9 tax 3.665348e-06 2.438507e-01 TRUE
## 10 ptratio 2.405468e-03 6.300514e-03 TRUE
## 11 lstat 3.780418e-02 1.298906e-01 TRUE
## 12 medv 2.928577e-35 1.046510e-12 TRUE
For many predictors (e.g., indus, nox,
age, dis, ptratio,
medv), the quadratic or cubic terms are statistically
significant, providing evidence of non-linear
associations with per capita crime rate.
Using the
Weeklydata set (1,089 weekly returns, 1990–2010).
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
# Volume over time
ggplot(Weekly, aes(x = seq_len(nrow(Weekly)), y = Volume)) +
geom_line(color = "steelblue") +
labs(title = "Weekly Trading Volume Over Time",
x = "Week Index", y = "Volume (billions of shares)") +
theme_bw()pairs(Weekly[, 1:8], col = ifelse(Weekly$Direction == "Up", "green4", "tomato"),
pch = 19, cex = 0.4,
main = "Weekly Data — Green = Up, Red = Down")Patterns: Trading Volume has increased
substantially over the 21-year period. The lag returns
(Lag1–Lag5) show little obvious serial
correlation with each other or with Today. The market went
Up slightly more often than Down (~55 % vs ~45
%).
glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly,
family = binomial)
summary(glm_full)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Significant predictors: Only
Lag2 has a p-value below 0.05 (p ≈ 0.030).
The other lag variables and Volume are not statistically
significant in this model.
prob_full <- predict(glm_full, type = "response")
pred_full <- ifelse(prob_full > 0.5, "Up", "Down")
conf_full <- table(Predicted = pred_full, Actual = Weekly$Direction)
conf_full## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
# Overall fraction correct
accuracy_full <- mean(pred_full == Weekly$Direction)
cat("Overall accuracy (full model):", round(accuracy_full, 4), "\n")## Overall accuracy (full model): 0.5611
Interpretation: The model predicts Up
for the vast majority of weeks, achieving an overall accuracy of 56.1 %.
However, it does a poor job predicting Down weeks — it
misclassifies most of them as Up. The model’s apparent
accuracy is partly driven by the market’s slight upward bias.
train <- Weekly$Year <= 2008
test <- !train
glm_lag2 <- glm(Direction ~ Lag2,
data = Weekly,
subset = train,
family = binomial)
prob_test <- predict(glm_lag2, newdata = Weekly[test, ], type = "response")
pred_test <- ifelse(prob_test > 0.5, "Up", "Down")
conf_test <- table(Predicted = pred_test, Actual = Weekly$Direction[test])
conf_test## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
accuracy_test <- mean(pred_test == Weekly$Direction[test])
cat("Overall accuracy (Lag2 model, 2009-2010):", round(accuracy_test, 4), "\n")## Overall accuracy (Lag2 model, 2009-2010): 0.625
Result: Using only Lag2 and evaluating
on 2009–2010 held-out data, the model achieves an accuracy of 62.5 %.
The confusion matrix shows the model is still biased toward predicting
Up, but it correctly identifies a reasonable proportion of
up-weeks, which is practically relevant for a trading strategy.
End of Midterm Homework