library(ISLR2)
library(ggplot2)
library(dplyr)
library(GGally)
df <- Auto
df <- df[complete.cases(df), ]
dim(df)
## [1] 392 9
Looking at the structure of the dataset:
str(df)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Based on the variable types and descriptions:
mpg,
cylinders, displacement,
horsepower, weight, acceleration,
yearorigin (categorical: 1 =
American, 2 = European, 3 = Japanese), name (car model
name)Although cylinders and origin are stored as
integers, they represent discrete categories rather than continuous
measurements, so they can also be treated as qualitative depending on
context.
num_cols <- c("mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "year")
range_table <- t(sapply(df[num_cols], range))
colnames(range_table) <- c("Min", "Max")
range_table
## Min Max
## mpg 9 46.6
## cylinders 3 8.0
## displacement 68 455.0
## horsepower 46 230.0
## weight 1613 5140.0
## acceleration 8 24.8
## year 70 82.0
stats_table <- t(sapply(df[num_cols], function(x) c(Mean = mean(x), SD = sd(x))))
round(stats_table, 2)
## Mean SD
## mpg 23.45 7.81
## cylinders 5.47 1.71
## displacement 194.41 104.64
## horsepower 104.47 38.49
## weight 2977.58 849.40
## acceleration 15.54 2.76
## year 75.98 3.68
df_trimmed <- df[-c(10:85), ]
cat("Remaining rows:", nrow(df_trimmed), "\n")
## Remaining rows: 316
trimmed_stats <- t(sapply(df_trimmed[num_cols], function(x) {
c(Min = min(x),
Max = max(x),
Mean = mean(x),
SD = sd(x))
}))
round(trimmed_stats, 2)
## Min Max Mean SD
## mpg 11.0 46.6 24.40 7.87
## cylinders 3.0 8.0 5.37 1.65
## displacement 68.0 455.0 187.24 99.68
## horsepower 46.0 230.0 100.72 35.71
## weight 1649.0 4997.0 2935.97 811.30
## acceleration 8.5 24.8 15.73 2.69
## year 70.0 82.0 77.15 3.11
Removing observations 10–85 noticeably shifts the mean and range of
several variables. For example, horsepower and
weight both show slightly different spreads, which makes
sense since we removed a chunk of data from the middle of the
dataset.
ggpairs(df[num_cols],
lower = list(continuous = wrap("points", alpha = 0.3,
size = 0.8, color = "#3b6fa0")),
diag = list(continuous = wrap("densityDiag", fill = "#3b6fa0",
alpha = 0.4)),
upper = list(continuous = wrap("cor", size = 3.5))) +
theme_bw(base_size = 9) +
labs(title = "Pairwise Relationships in the Auto Dataset")
Observations:
displacement, horsepower, and
weight are very strongly correlated with each other — these
three essentially measure “how big/powerful the car is”.mpg has a clear negative relationship with the three
variables above.year is weakly positively correlated with
mpg, which makes sense — cars got more fuel-efficient over
time.acceleration has a weaker relationship with the other
variables compared to the rest.other_vars <- setdiff(num_cols, "mpg")
par(mfrow = c(2, 3), mar = c(4, 4, 3, 1))
for (v in other_vars) {
plot(df[[v]], df$mpg,
xlab = v,
ylab = "mpg",
main = paste("mpg vs.", v),
pch = 16,
col = adjustcolor("#e05c5c", alpha.f = 0.5),
cex = 0.7)
abline(lm(df$mpg ~ df[[v]]), col = "navy", lwd = 2)
}
par(mfrow = c(1, 1))
Verdict: The plots suggest that weight,
displacement, and horsepower are likely the
strongest predictors of mpg given their clearly defined
negative linear trends. year also looks useful.
cylinders and acceleration are less clean but
still show a pattern.
pairs(Auto[, num_cols],
main = "Auto Dataset — Scatterplot Matrix",
col = adjustcolor("steelblue", alpha.f = 0.4),
pch = 16,
cex = 0.5)
name)auto_numeric <- select(df, -name)
cor_mat <- round(cor(auto_numeric), 3)
cor_mat
## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
## year 0.581 -0.346 -0.370 -0.416 -0.309 0.290
## origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213
## year origin
## mpg 0.581 0.565
## cylinders -0.346 -0.569
## displacement -0.370 -0.615
## horsepower -0.416 -0.455
## weight -0.309 -0.585
## acceleration 0.290 0.213
## year 1.000 0.182
## origin 0.182 1.000
mlr_model <- lm(mpg ~ . - name, data = df)
summary(mlr_model)
##
## Call:
## lm(formula = mpg ~ . - name, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the
response? Yes. The F-statistic is 252.4 on 7 and 384 degrees of
freedom with a p-value essentially equal to zero. We have very strong
evidence that at least one predictor is related to mpg.
ii. Which predictors are statistically significant?
At the 5% significance level: displacement,
weight, year, and origin all have
p-values below 0.05. horsepower is marginally
significant.
iii. What does the year coefficient tell
us? The coefficient for year is approximately
0.751. This means that, all else being equal, each additional year is
associated with an increase of about 0.75 miles per gallon. This aligns
with the historical trend of improving fuel efficiency regulations and
engine technology throughout the 1970s and 80s.
par(mfrow = c(2, 2))
plot(mlr_model, col = "steelblue", pch = 16)
par(mfrow = c(1, 1))
Comments on the fit:
model_interact <- lm(mpg ~ (displacement + horsepower + weight + year) * origin +
acceleration,
data = df)
summary(model_interact)
##
## Call:
## lm(formula = mpg ~ (displacement + horsepower + weight + year) *
## origin + acceleration, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4246 -1.8548 -0.2182 1.4742 12.7087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.8868171 8.8026283 -1.123 0.26207
## displacement -0.0405702 0.0167041 -2.429 0.01561 *
## horsepower 0.1111384 0.0238542 4.659 4.40e-06 ***
## weight -0.0047013 0.0014379 -3.270 0.00118 **
## year 0.5756781 0.1096360 5.251 2.52e-07 ***
## origin 0.6428830 4.8883810 0.132 0.89544
## acceleration -0.1125641 0.0975292 -1.154 0.24916
## displacement:origin 0.0273671 0.0136668 2.002 0.04594 *
## horsepower:origin -0.1011971 0.0176915 -5.720 2.16e-08 ***
## weight:origin -0.0001705 0.0010406 -0.164 0.86993
## year:origin 0.0798303 0.0611443 1.306 0.19247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.096 on 381 degrees of freedom
## Multiple R-squared: 0.8467, Adjusted R-squared: 0.8427
## F-statistic: 210.4 on 10 and 381 DF, p-value: < 2.2e-16
# Simpler targeted interaction
model_interact2 <- lm(mpg ~ weight * year + displacement:horsepower, data = df)
summary(model_interact2)
##
## Call:
## lm(formula = mpg ~ weight * year + displacement:horsepower, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1341 -1.9483 -0.0754 1.6289 12.9939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.082e+02 1.381e+01 -7.835 4.56e-14 ***
## weight 2.649e-02 4.960e-03 5.342 1.57e-07 ***
## year 2.015e+00 1.805e-01 11.167 < 2e-16 ***
## weight:year -4.462e-04 6.424e-05 -6.946 1.60e-11 ***
## displacement:horsepower 8.342e-06 1.788e-05 0.467 0.641
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.196 on 387 degrees of freedom
## Multiple R-squared: 0.834, Adjusted R-squared: 0.8323
## F-statistic: 486.1 on 4 and 387 DF, p-value: < 2.2e-16
Finding: The interaction between weight
and year is statistically significant. This makes intuitive
sense — the benefit of being a newer model (in terms of mpg) may depend
on the weight of the car; lighter newer cars gained more efficiency over
time than heavier ones.
par(mfrow = c(1, 3))
plot(log(df$weight), df$mpg, pch = 16, cex = 0.6,
col = adjustcolor("darkorange", 0.6),
xlab = "log(weight)", ylab = "mpg", main = "mpg ~ log(weight)")
abline(lm(mpg ~ log(weight), data = df), col = "black", lwd = 2)
plot(sqrt(df$displacement), df$mpg, pch = 16, cex = 0.6,
col = adjustcolor("seagreen", 0.6),
xlab = "sqrt(displacement)", ylab = "mpg", main = "mpg ~ sqrt(displacement)")
abline(lm(mpg ~ sqrt(displacement), data = df), col = "black", lwd = 2)
plot(df$horsepower^2, df$mpg, pch = 16, cex = 0.6,
col = adjustcolor("mediumpurple", 0.6),
xlab = "horsepower^2", ylab = "mpg", main = "mpg ~ horsepower^2")
abline(lm(mpg ~ I(horsepower^2), data = df), col = "black", lwd = 2)
par(mfrow = c(1, 1))
m1 <- lm(mpg ~ log(weight), data = df)
m2 <- lm(mpg ~ sqrt(displacement), data = df)
m3 <- lm(mpg ~ I(horsepower^2), data = df)
data.frame(
Transformation = c("log(weight)", "sqrt(displacement)", "horsepower^2"),
R_squared = round(c(summary(m1)$r.squared,
summary(m2)$r.squared,
summary(m3)$r.squared), 4)
)
## Transformation R_squared
## 1 log(weight) 0.7127
## 2 sqrt(displacement) 0.6746
## 3 horsepower^2 0.5074
Comment: log(weight) gives the best fit
among these single-predictor transformed models (R² ≈ 0.77). The log
transformation is appropriate because the relationship between weight
and mpg is not linear — large weight increases matter more at smaller
weight values.
bos <- Boston
cat("Dimensions:", dim(bos), "\n")
## Dimensions: 506 13
names(bos)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "lstat" "medv"
crimpreds <- setdiff(names(bos), "crim")
par(mfrow = c(4, 4), mar = c(3, 3, 2, 1))
for (p in preds) {
plot(bos[[p]], bos$crim,
xlab = p, ylab = "crim",
main = p,
pch = 16,
col = adjustcolor("#5b8db8", 0.4),
cex = 0.6)
abline(lm(as.formula(paste("crim ~", p)), data = bos),
col = "firebrick", lwd = 1.8)
}
par(mfrow = c(1, 1))
slr_summary <- data.frame(
Predictor = preds,
Coefficient = NA_real_,
P_Value = NA_real_
)
for (i in seq_along(preds)) {
fit <- lm(as.formula(paste("crim ~", preds[i])), data = bos)
cf <- summary(fit)$coefficients[2, ]
slr_summary$Coefficient[i] <- round(cf["Estimate"], 4)
slr_summary$P_Value[i] <- round(cf["Pr(>|t|)"], 6)
}
slr_summary$Significant <- ifelse(slr_summary$P_Value < 0.05, "Yes", "No")
slr_summary
## Predictor Coefficient P_Value Significant
## 1 zn -0.0739 0.000006 Yes
## 2 indus 0.5098 0.000000 Yes
## 3 chas -1.8928 0.209435 No
## 4 nox 31.2485 0.000000 Yes
## 5 rm -2.6841 0.000001 Yes
## 6 age 0.1078 0.000000 Yes
## 7 dis -1.5509 0.000000 Yes
## 8 rad 0.6179 0.000000 Yes
## 9 tax 0.0297 0.000000 Yes
## 10 ptratio 1.1520 0.000000 Yes
## 11 lstat 0.5488 0.000000 Yes
## 12 medv -0.3632 0.000000 Yes
All predictors except chas (the Charles River dummy
variable) show a statistically significant association with
crim in simple linear regression.
mlr_boston <- lm(crim ~ ., data = bos)
summary(mlr_boston)
##
## Call:
## lm(formula = crim ~ ., data = bos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
Predictors where we reject H₀ (βⱼ = 0) at α = 0.05:
zn, dis, rad, black,
medv. Only a handful remain significant once we control for
all other variables, which is a sign of multicollinearity among the
predictors.
multi_coefs <- coef(mlr_boston)[-1]
matched <- intersect(names(multi_coefs), slr_summary$Predictor)
x_vals <- slr_summary$Coefficient[match(matched, slr_summary$Predictor)]
y_vals <- multi_coefs[matched]
plot(x_vals, y_vals,
xlab = "Simple Regression Coefficient",
ylab = "Multiple Regression Coefficient",
main = "Simple vs. Multiple Regression Coefficients\n(each point = one predictor)",
pch = 19,
col = "steelblue",
cex = 1.3)
abline(a = 0, b = 1, lty = 2, col = "gray50")
text(x_vals, y_vals, labels = matched, cex = 0.65, pos = 3, col = "gray30")
The coefficients differ considerably — especially for
nox, which has a large positive simple regression
coefficient but a very different estimate in the multiple model. This is
a classic sign of confounding: nox is correlated with other
predictors that also relate to crime.
poly_res <- data.frame(
Predictor = preds,
P_linear = NA_real_,
P_quad = NA_real_,
P_cubic = NA_real_
)
for (i in seq_along(preds)) {
p <- preds[i]
fit <- lm(as.formula(
paste("crim ~", p, "+ I(", p, "^2) + I(", p, "^3)")), data = bos)
cf <- summary(fit)$coefficients
poly_res$P_linear[i] <- round(cf[2, 4], 5)
if (nrow(cf) >= 3) poly_res$P_quad[i] <- round(cf[3, 4], 5)
if (nrow(cf) >= 4) poly_res$P_cubic[i] <- round(cf[4, 4], 5)
}
poly_res$NonLinear <- (poly_res$P_quad < 0.05 | poly_res$P_cubic < 0.05)
poly_res
## Predictor P_linear P_quad P_cubic NonLinear
## 1 zn 0.00261 0.09375 0.22954 FALSE
## 2 indus 0.00005 0.00000 0.00000 TRUE
## 3 chas 0.20943 NA NA NA
## 4 nox 0.00000 0.00000 0.00000 TRUE
## 5 rm 0.21176 0.36411 0.50858 FALSE
## 6 age 0.14266 0.04738 0.00668 TRUE
## 7 dis 0.00000 0.00000 0.00000 TRUE
## 8 rad 0.62342 0.61301 0.48231 FALSE
## 9 tax 0.10971 0.13747 0.24385 FALSE
## 10 ptratio 0.00303 0.00412 0.00630 TRUE
## 11 lstat 0.33453 0.06459 0.12989 FALSE
## 12 medv 0.00000 0.00000 0.00000 TRUE
Comment: There is evidence of non-linear association
for several predictors — notably indus, nox,
age, dis, ptratio, and
medv. For these, the quadratic or cubic term is
statistically significant, suggesting the relationship with crime rate
is curved rather than straight.
wk <- Weekly
str(wk)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(wk)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
summary(wk[, c("Lag1","Lag2","Lag3","Lag4","Lag5","Volume","Today")])
## Lag1 Lag2 Lag3 Lag4
## Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 1st Qu.: -1.1580
## Median : 0.2410 Median : 0.2410 Median : 0.2410 Median : 0.2380
## Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 Mean : 0.1458
## 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag5 Volume Today
## Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. :9.32821 Max. : 12.0260
ggpairs(wk[, c("Lag1","Lag2","Lag3","Lag4","Lag5","Volume")],
lower = list(continuous = wrap("points",
alpha = 0.2, size = 0.5,
color = "steelblue")),
diag = list(continuous = wrap("densityDiag",
fill = "steelblue", alpha = 0.4)),
upper = list(continuous = wrap("cor", size = 3.5))) +
theme_minimal(base_size = 9) +
labs(title = "Weekly Dataset — Pairwise Summary")
ggplot(wk, aes(x = seq_along(Volume), y = Volume)) +
geom_line(color = "steelblue", linewidth = 0.6) +
labs(title = "Trading Volume Over Time",
x = "Observation Index", y = "Volume") +
theme_minimal()
table(wk$Direction)
##
## Down Up
## 484 605
prop.table(table(wk$Direction))
##
## Down Up
## 0.4444444 0.5555556
Patterns: Trading volume increases substantially over the 21-year period — a clear upward trend. The lag variables show very little correlation with each other or with today’s return. The market went “Up” about 55.6% of the time, a slight bullish bias in this period.
logit_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = wk,
family = "binomial")
summary(logit_full)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = wk)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Which predictors are significant? Only
Lag2 has a p-value below 0.05 (p ≈ 0.030). None of the
other lag variables or Volume reach statistical
significance in this full model.
fitted_probs <- predict(logit_full, type = "response")
fitted_class <- ifelse(fitted_probs > 0.5, "Up", "Down")
cm <- table(Predicted = fitted_class, Actual = wk$Direction)
cm
## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
overall_acc <- mean(fitted_class == wk$Direction)
cat("Overall accuracy:", round(overall_acc * 100, 2), "%\n")
## Overall accuracy: 56.11 %
Interpretation: The confusion matrix reveals the model almost always predicts “Up”. It catches most true “Up” weeks (557 correct) but misclassifies the vast majority of “Down” weeks (430 out of 484 predicted as “Up”). The 56.1% overall accuracy is barely above the naive baseline of always guessing “Up” (~55.6%), so the model is not particularly useful on the training data.
train_idx <- wk$Year <= 2008
train_data <- wk[train_idx, ]
test_data <- wk[!train_idx, ]
cat("Training size:", nrow(train_data), "| Test size:", nrow(test_data), "\n")
## Training size: 985 | Test size: 104
logit_lag2 <- glm(Direction ~ Lag2,
data = train_data,
family = "binomial")
summary(logit_lag2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
test_probs <- predict(logit_lag2, newdata = test_data, type = "response")
test_class <- ifelse(test_probs > 0.5, "Up", "Down")
cm2 <- table(Predicted = test_class, Actual = test_data$Direction)
cm2
## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
held_acc <- mean(test_class == test_data$Direction)
cat("Test accuracy (2009-2010):", round(held_acc * 100, 2), "%\n")
## Test accuracy (2009-2010): 62.5 %
Comment: Using only Lag2 as a predictor
and evaluating on genuinely unseen data from 2009–2010, the model
achieves an accuracy of about 62.5%, which is a meaningful improvement
over the full model. This suggests that Lag2 carries the
most predictive signal and including the other noisy lag variables
actually hurt performance. Less can be more in predictive modeling.