Auto DatasetThis exercise explores the Auto dataset from the
ISLR2 package. The dataset contains information about 392
vehicles and is used to practice exploratory data analysis including
descriptive statistics and graphical investigation.
## Dimensions after removing missing values: 392 9
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Quantitative predictors (continuous/discrete
numeric): mpg, cylinders,
displacement, horsepower, weight,
acceleration, year
Qualitative predictors (categorical):
origin (1 = American, 2 = European, 3 = Japanese),
name (character identifier)
Although
cylindersandoriginare stored as integers, they represent categorical groupings and should be treated accordingly in modeling contexts.
quant_vars <- Auto[, c("mpg","cylinders","displacement","horsepower",
"weight","acceleration","year")]
range_df <- as.data.frame(t(sapply(quant_vars, range)))
colnames(range_df) <- c("Min", "Max")
range_df$Range <- range_df$Max - range_df$Min
knitr::kable(range_df, caption = "Range of Quantitative Predictors")| Min | Max | Range | |
|---|---|---|---|
| mpg | 9 | 46.6 | 37.6 |
| cylinders | 3 | 8.0 | 5.0 |
| displacement | 68 | 455.0 | 387.0 |
| horsepower | 46 | 230.0 | 184.0 |
| weight | 1613 | 5140.0 | 3527.0 |
| acceleration | 8 | 24.8 | 16.8 |
| year | 70 | 82.0 | 12.0 |
weightanddisplacementshow the widest absolute ranges, suggesting high variability across vehicle types.
summary_df <- data.frame(
Mean = sapply(quant_vars, mean),
SD = sapply(quant_vars, sd)
)
knitr::kable(round(summary_df, 3), caption = "Mean and Standard Deviation of Quantitative Predictors")| Mean | SD | |
|---|---|---|
| mpg | 23.446 | 7.805 |
| cylinders | 5.472 | 1.706 |
| displacement | 194.412 | 104.644 |
| horsepower | 104.469 | 38.491 |
| weight | 2977.584 | 849.403 |
| acceleration | 15.541 | 2.759 |
| year | 75.980 | 3.684 |
The high standard deviation of
displacement(104.6) andweight(849.4) indicates substantial spread around the mean, consistent with their wide ranges. In contrast,accelerationhas a relatively low SD, suggesting more homogeneity across vehicles.
Auto_sub <- Auto[-(10:85), ]
quant_sub <- Auto_sub[, c("mpg","cylinders","displacement","horsepower",
"weight","acceleration","year")]
range_sub <- as.data.frame(t(sapply(quant_sub, range)))
colnames(range_sub) <- c("Min", "Max")
stats_sub <- data.frame(
Mean = sapply(quant_sub, mean),
SD = sapply(quant_sub, sd)
)
knitr::kable(round(cbind(range_sub, stats_sub), 3),
caption = "Summary Statistics After Removing Observations 10–85")| Min | Max | Mean | SD | |
|---|---|---|---|---|
| mpg | 11.0 | 46.6 | 24.404 | 7.867 |
| cylinders | 3.0 | 8.0 | 5.373 | 1.654 |
| displacement | 68.0 | 455.0 | 187.241 | 99.678 |
| horsepower | 46.0 | 230.0 | 100.722 | 35.709 |
| weight | 1649.0 | 4997.0 | 2935.972 | 811.300 |
| acceleration | 8.5 | 24.8 | 15.727 | 2.694 |
| year | 70.0 | 82.0 | 77.146 | 3.106 |
Removing 76 observations (roughly 19% of the data) from the middle of the dataset noticeably shifts means and ranges for several predictors, particularly
horsepowerandcylinders. This reflects the fact that those observations were not a random sample — removing a contiguous block can induce selection bias in descriptive statistics.
pairs(quant_vars,
main = "Scatterplot Matrix — Auto Dataset",
col = "steelblue",
pch = 16,
cex = 0.5,
upper.panel = panel.smooth)Key observations:
displacement, horsepower, and
weight are strongly positively correlated with one another
— vehicles with larger engines tend to be heavier.mpg, suggesting a clear fuel efficiency penalty for
heavier, more powerful vehicles.year shows a mild positive association
with mpg, indicating gradual improvements in fuel economy
over time.acceleration shows a weaker and more diffuse
relationship with the other predictors.mpg)par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))
pred_vars <- c("cylinders","displacement","horsepower","weight","acceleration","year")
for (var in pred_vars) {
plot(Auto[[var]], Auto$mpg,
xlab = var, ylab = "mpg",
main = paste("mpg vs.", var),
col = adjustcolor("steelblue", alpha.f = 0.5),
pch = 16, cex = 0.8)
abline(lm(Auto$mpg ~ Auto[[var]]), col = "red", lwd = 2)
}Conclusion: weight,
displacement, horsepower, and
cylinders all exhibit strong negative linear relationships
with mpg and are likely the most useful predictors in a
regression model. year shows a positive trend reflecting
technological improvements. acceleration has a weaker and
less consistent relationship.
AutoThis section fits multiple linear regression models using
mpg as the response variable and investigates predictor
significance, diagnostics, interaction effects, and variable
transformations.
pairs(Auto[, -9],
main = "Scatterplot Matrix — All Auto Variables",
col = "darkslateblue",
pch = 16,
cex = 0.4,
upper.panel = panel.smooth)cor_mat <- cor(Auto[, -9])
knitr::kable(round(cor_mat, 3), caption = "Correlation Matrix (excluding 'name')")| mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | |
|---|---|---|---|---|---|---|---|---|
| mpg | 1.000 | -0.778 | -0.805 | -0.778 | -0.832 | 0.423 | 0.581 | 0.565 |
| cylinders | -0.778 | 1.000 | 0.951 | 0.843 | 0.898 | -0.505 | -0.346 | -0.569 |
| displacement | -0.805 | 0.951 | 1.000 | 0.897 | 0.933 | -0.544 | -0.370 | -0.615 |
| horsepower | -0.778 | 0.843 | 0.897 | 1.000 | 0.865 | -0.689 | -0.416 | -0.455 |
| weight | -0.832 | 0.898 | 0.933 | 0.865 | 1.000 | -0.417 | -0.309 | -0.585 |
| acceleration | 0.423 | -0.505 | -0.544 | -0.689 | -0.417 | 1.000 | 0.290 | 0.213 |
| year | 0.581 | -0.346 | -0.370 | -0.416 | -0.309 | 0.290 | 1.000 | 0.182 |
| origin | 0.565 | -0.569 | -0.615 | -0.455 | -0.585 | 0.213 | 0.182 | 1.000 |
The highest correlations with
mpgare seen forweight(−0.832),displacement(−0.805), andcylinders(−0.778). These strong negative correlations confirm that heavier, larger-engine vehicles are significantly less fuel efficient.yearhas a moderate positive correlation (0.581), confirming the fuel economy improvement trend over time.
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between predictors and
response? Yes. The model F-statistic is highly significant (p
< 2.2e-16), providing strong evidence of a collective linear
relationship between the predictors and mpg.
ii. Which predictors are statistically significant?
At the 5% significance level: displacement,
weight, year, and origin all have
p-values below 0.05. cylinders, horsepower,
and acceleration are not individually significant when
controlling for others — likely due to multicollinearity with
displacement and weight.
iii. Interpretation of the year
coefficient: The coefficient for year is
approximately +0.751, meaning that holding all other
predictors constant, fuel efficiency increases by about 0.75 mpg per
model year. This reflects systematic improvements in automotive
engineering during the 1970s and 1980s.
Interpretation:
lm_interact <- lm(mpg ~ displacement * weight + acceleration * year +
horsepower * weight, data = Auto)
summary(lm_interact)##
## Call:
## lm(formula = mpg ~ displacement * weight + acceleration * year +
## horsepower * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6205 -1.5878 0.0325 1.3787 11.8148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.144e+02 1.817e+01 6.296 8.38e-10 ***
## displacement -3.801e-02 1.607e-02 -2.365 0.018549 *
## weight -1.073e-02 6.881e-04 -15.589 < 2e-16 ***
## acceleration -7.104e+00 1.118e+00 -6.353 6.00e-10 ***
## year -6.792e-01 2.340e-01 -2.902 0.003917 **
## horsepower -1.620e-01 4.459e-02 -3.633 0.000318 ***
## displacement:weight 1.029e-05 4.968e-06 2.071 0.039049 *
## acceleration:year 9.203e-02 1.450e-02 6.345 6.28e-10 ***
## weight:horsepower 3.414e-05 1.207e-05 2.829 0.004910 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.812 on 383 degrees of freedom
## Multiple R-squared: 0.8729, Adjusted R-squared: 0.8702
## F-statistic: 328.7 on 8 and 383 DF, p-value: < 2.2e-16
The interaction between
acceleration:yearanddisplacement:weightare statistically significant, suggesting that the effect of engine size on fuel economy is moderated by vehicle weight, and that the effect of acceleration has changed over time. Including interactions improves model fit but reduces interpretability.
# Fit models with different transformations of horsepower (within full model)
trans_models <- list(
"Linear" = lm(mpg ~ . - name, data = Auto),
"Log" = lm(mpg ~ . - name - horsepower + log(horsepower), data = Auto),
"Square Root" = lm(mpg ~ . - name - horsepower + sqrt(horsepower), data = Auto),
"Quadratic" = lm(mpg ~ . - name - horsepower + I(horsepower^2), data = Auto)
)
# Extract fit statistics
trans_stats <- do.call(rbind, lapply(names(trans_models), function(name) {
s <- summary(trans_models[[name]])
data.frame(
Transformation = name,
Adj_R_Squared = round(s$adj.r.squared, 4),
RSE = round(s$sigma, 4)
)
}))
knitr::kable(trans_stats, row.names = FALSE,
caption = "Model Fit Comparison Across Horsepower Transformations (Full Model)")| Transformation | Adj_R_Squared | RSE |
|---|---|---|
| Linear | 0.8182 | 3.3277 |
| Log | 0.8340 | 3.1801 |
| Square Root | 0.8237 | 3.2772 |
| Quadratic | 0.8193 | 3.3179 |
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
smooth_line <- function(model, xvar, transform_fn, col = "red") {
x_seq <- seq(min(xvar), max(xvar), length.out = 300)
newdata <- data.frame(
cylinders = mean(Auto$cylinders),
displacement = mean(Auto$displacement),
horsepower = x_seq,
weight = mean(Auto$weight),
acceleration = mean(Auto$acceleration),
year = mean(Auto$year),
origin = mean(Auto$origin),
name = Auto$name[1] # dummy value to satisfy the model
)
y_hat <- predict(model, newdata = newdata)
lines(transform_fn(x_seq), y_hat, col = col, lwd = 2)
}
# --- Plot 1: Linear ---
plot(Auto$horsepower, Auto$mpg,
main = paste0("Linear (horsepower) | Adj R² = ",
round(summary(trans_models$Linear)$adj.r.squared, 3)),
xlab = "horsepower", ylab = "mpg",
col = adjustcolor("steelblue", 0.5), pch = 16)
# --- Plot 2: Log ---
plot(log(Auto$horsepower), Auto$mpg,
main = paste0("log(horsepower) | Adj R² = ",
round(summary(trans_models$Log)$adj.r.squared, 3)),
xlab = "log(horsepower)", ylab = "mpg",
col = adjustcolor("steelblue", 0.5), pch = 16)
# --- Plot 3: Square Root ---
plot(sqrt(Auto$horsepower), Auto$mpg,
main = paste0("sqrt(horsepower) | Adj R² = ",
round(summary(trans_models$`Square Root`)$adj.r.squared, 3)),
xlab = "sqrt(horsepower)", ylab = "mpg",
col = adjustcolor("darkgreen", 0.5), pch = 16)
# --- Plot 4: Quadratic ---
plot(Auto$horsepower^2, Auto$mpg,
main = paste0("horsepower² | Adj R² = ",
round(summary(trans_models$Quadratic)$adj.r.squared, 3)),
xlab = "horsepower²", ylab = "mpg",
col = adjustcolor("tomato", 0.5), pch = 16)BostonThis section uses the Boston dataset to predict per
capita crime rate (crim) using simple and multiple linear
regression, and explores non-linear associations via polynomial
regression.
results <- lapply(predictors, function(var) {
fit <- lm(Boston$crim ~ Boston[[var]])
s <- summary(fit)
data.frame(
Predictor = var,
Coefficient = round(coef(fit)[2], 4),
R_Squared = round(s$r.squared, 4),
P_value = signif(s$coefficients[2, 4], 4)
)
})
results_df <- do.call(rbind, results)
results_df$Significant <- ifelse(results_df$P_value < 0.05, "Yes", "No")
knitr::kable(results_df, row.names = FALSE,
caption = "Simple Linear Regression Results: crim vs. Each Predictor")| Predictor | Coefficient | R_Squared | P_value | Significant |
|---|---|---|---|---|
| zn | -0.0739 | 0.0402 | 0.0000055 | Yes |
| indus | 0.5098 | 0.1653 | 0.0000000 | Yes |
| chas | -1.8928 | 0.0031 | 0.2094000 | No |
| nox | 31.2485 | 0.1772 | 0.0000000 | Yes |
| rm | -2.6841 | 0.0481 | 0.0000006 | Yes |
| age | 0.1078 | 0.1244 | 0.0000000 | Yes |
| dis | -1.5509 | 0.1441 | 0.0000000 | Yes |
| rad | 0.6179 | 0.3913 | 0.0000000 | Yes |
| tax | 0.0297 | 0.3396 | 0.0000000 | Yes |
| ptratio | 1.1520 | 0.0841 | 0.0000000 | Yes |
| lstat | 0.5488 | 0.2076 | 0.0000000 | Yes |
| medv | -0.3632 | 0.1508 | 0.0000000 | Yes |
Almost all predictors show a statistically significant association with
crim. Notably,rad(accessibility to radial highways) andtax(property tax rate) have the highest R² values, suggesting that areas with better highway access and higher tax burdens tend to have higher crime rates — possibly reflecting urban density effects.
sig_preds <- results_df$Predictor[results_df$Significant == "Yes"]
par(mfrow = c(3, 5), mar = c(3, 3, 2, 1))
for (var in sig_preds) {
plot(Boston[[var]], Boston$crim,
xlab = var, ylab = "crim",
pch = 16, col = adjustcolor("steelblue", 0.4), cex = 0.6,
main = paste("crim ~", var))
abline(lm(Boston$crim ~ Boston[[var]]), col = "red", lwd = 2)
}##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
In the multiple regression context, most individually-significant predictors lose their significance — a classic sign of multicollinearity. Only
zn,dis,rad,black, andmedvretain significance (p < 0.05). This suggests that many simple associations in (a) were confounded by correlated predictors.
uni_coefs <- setNames(results_df$Coefficient, results_df$Predictor)
multi_coefs <- coef(lm_boston)[-1]
common <- intersect(names(uni_coefs), names(multi_coefs))
plot(uni_coefs[common], multi_coefs[common],
xlab = "Simple Regression Coefficients",
ylab = "Multiple Regression Coefficients",
main = "Univariate vs. Multiple Regression Coefficients",
pch = 19, col = "steelblue", cex = 1.2)
abline(0, 1, col = "red", lty = 2, lwd = 1.5)
text(uni_coefs[common], multi_coefs[common],
labels = common, cex = 0.65, pos = 3)The coefficients differ substantially between the two model types, particularly for
noxanddis. The divergence from the diagonal confirms that simple regression coefficients are biased when predictors are correlated — the multiple regression model controls for confounders and provides more reliable estimates.
poly_results <- lapply(predictors, function(var) {
if (length(unique(Boston[[var]])) < 4) return(NULL)
fit <- lm(Boston$crim ~ poly(Boston[[var]], 3))
s <- summary(fit)
pvals <- s$coefficients[-1, 4]
data.frame(
Predictor = var,
P_quadratic = signif(pvals[2], 4),
P_cubic = signif(pvals[3], 4)
)
})
poly_df <- do.call(rbind, poly_results)
poly_df$NonLinear <- ifelse(poly_df$P_quadratic < 0.05 | poly_df$P_cubic < 0.05, "Yes", "No")
knitr::kable(poly_df, row.names = FALSE,
caption = "Polynomial Regression: Evidence of Non-linearity")| Predictor | P_quadratic | P_cubic | NonLinear |
|---|---|---|---|
| zn | 0.0044210 | 0.229500 | Yes |
| indus | 0.0010860 | 0.000000 | Yes |
| nox | 0.0000774 | 0.000000 | Yes |
| rm | 0.0015090 | 0.508600 | Yes |
| age | 0.0000023 | 0.006680 | Yes |
| dis | 0.0000000 | 0.000000 | Yes |
| rad | 0.0091210 | 0.482300 | Yes |
| tax | 0.0000037 | 0.243900 | Yes |
| ptratio | 0.0024050 | 0.006301 | Yes |
| lstat | 0.0378000 | 0.129900 | Yes |
| medv | 0.0000000 | 0.000000 | Yes |
Several predictors — including
indus,nox,age,dis, andmedv— show statistically significant quadratic or cubic terms, indicating non-linear relationships with crime rate. This suggests that linear models may be insufficient and that more flexible methods (e.g., splines or GAMs) could yield better predictive performance.
Weekly DatasetThis section uses the Weekly stock market dataset to
practice logistic regression for binary classification. The response
variable is Direction (Up/Down), and we evaluate model
performance using confusion matrices.
## Dataset dimensions: 1089 9
## Class distribution:
##
## Down Up
## 484 605
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly[, -9],
col = ifelse(Weekly$Direction == "Up", "steelblue", "tomato"),
pch = 16, cex = 0.4,
main = "Weekly Data — Blue = Up, Red = Down")plot(Weekly$Year, Weekly$Volume,
type = "l", col = "steelblue", lwd = 1.5,
xlab = "Year", ylab = "Volume",
main = "Trading Volume Over Time (1990–2010)")Key patterns: Trading volume increased substantially from 1990 to the early 2000s before declining slightly. The lag variables show no obvious visual separation between Up and Down weeks, suggesting that predicting market direction from recent returns is inherently difficult — consistent with the efficient market hypothesis.
glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly,
family = binomial)
summary(glm_full)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Only Lag2 is statistically significant (p ≈ 0.030). All other predictors, including
Volumeand the remaining lag terms, do not contribute meaningfully when controlling for each other. The positive coefficient onLag2suggests that a positive return two weeks prior slightly increases the probability of an Up week.
prob_full <- predict(glm_full, type = "response")
pred_full <- ifelse(prob_full > 0.5, "Up", "Down")
conf_full <- table(Predicted = pred_full, Actual = Weekly$Direction)
knitr::kable(conf_full, caption = "Confusion Matrix — Full Data Logistic Regression")| Down | Up | |
|---|---|---|
| Down | 54 | 48 |
| Up | 430 | 557 |
accuracy_full <- mean(pred_full == Weekly$Direction)
cat("Overall Accuracy:", round(accuracy_full, 4), "\n")## Overall Accuracy: 0.5611
The model achieves an overall accuracy of ~56.1%, only marginally better than a naive classifier that always predicts “Up” (which would achieve ~55.6%). The confusion matrix reveals severe asymmetry: the model correctly identifies most Up weeks (true positive rate ~92%) but performs poorly on Down weeks (true positive rate ~11%). This reflects the class imbalance in the dataset and the model’s tendency to predict the majority class.
train <- Weekly$Year <= 2008
test_set <- Weekly[!train, ]
glm_lag2 <- glm(Direction ~ Lag2,
data = Weekly,
family = binomial,
subset = train)
prob_test <- predict(glm_lag2, newdata = test_set, type = "response")
pred_test <- ifelse(prob_test > 0.5, "Up", "Down")
conf_test <- table(Predicted = pred_test, Actual = test_set$Direction)
knitr::kable(conf_test, caption = "Confusion Matrix — Hold-Out Test Set (2009–2010)")| Down | Up | |
|---|---|---|
| Down | 9 | 5 |
| Up | 34 | 56 |
accuracy_test <- mean(pred_test == test_set$Direction)
cat("Test Set Accuracy:", round(accuracy_test, 4), "\n")## Test Set Accuracy: 0.625
The held-out test accuracy is approximately 62.5%, an improvement over the full-data model. Using only
Lag2as a predictor reduces overfitting and generalizes better to unseen data. The model still tends to over-predict “Up” weeks, but achieves a better balance on the 2009–2010 test period. This exercise illustrates the importance of proper train/test splitting for honest model evaluation.