Auto Dataset## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
## [1] 392 9
The Auto dataset contains 9 variables. We classify them
as follows:
mpg,
cylinders, displacement,
horsepower, weight, acceleration,
yearname (car name, character
string), origin (1 = American, 2 = European, 3 = Japanese —
encodes geography, not magnitude)Although cylinders and origin are stored as
integers, origin is inherently qualitative.
cylinders takes only a handful of discrete values (3, 4, 5,
6, 8) but is treated as quantitative here for summary purposes.
quant_vars <- c("mpg", "cylinders", "displacement",
"horsepower", "weight", "acceleration", "year")
range_df <- as.data.frame(t(sapply(Auto[, quant_vars], range)))
colnames(range_df) <- c("Min", "Max")
knitr::kable(range_df, caption = "Range of Each Quantitative Predictor",
digits = 2)| Min | Max | |
|---|---|---|
| mpg | 9 | 46.6 |
| cylinders | 3 | 8.0 |
| displacement | 68 | 455.0 |
| horsepower | 46 | 230.0 |
| weight | 1613 | 5140.0 |
| acceleration | 8 | 24.8 |
| year | 70 | 82.0 |
stats_df <- data.frame(
Mean = sapply(Auto[, quant_vars], mean),
SD = sapply(Auto[, quant_vars], sd)
)
knitr::kable(round(stats_df, 3),
caption = "Mean and Standard Deviation of Quantitative Predictors")| Mean | SD | |
|---|---|---|
| mpg | 23.446 | 7.805 |
| cylinders | 5.472 | 1.706 |
| displacement | 194.412 | 104.644 |
| horsepower | 104.469 | 38.491 |
| weight | 2977.584 | 849.403 |
| acceleration | 15.541 | 2.759 |
| year | 75.980 | 3.684 |
## Rows remaining: 316
range_sub <- as.data.frame(t(sapply(Auto_sub[, quant_vars], range)))
colnames(range_sub) <- c("Min", "Max")
stats_sub <- data.frame(
Mean = sapply(Auto_sub[, quant_vars], mean),
SD = sapply(Auto_sub[, quant_vars], sd)
)
knitr::kable(range_sub,
caption = "Range After Removing Obs. 10–85", digits = 2)| Min | Max | |
|---|---|---|
| mpg | 11.0 | 46.6 |
| cylinders | 3.0 | 8.0 |
| displacement | 68.0 | 455.0 |
| horsepower | 46.0 | 230.0 |
| weight | 1649.0 | 4997.0 |
| acceleration | 8.5 | 24.8 |
| year | 70.0 | 82.0 |
| Mean | SD | |
|---|---|---|
| mpg | 24.404 | 7.867 |
| cylinders | 5.373 | 1.654 |
| displacement | 187.241 | 99.678 |
| horsepower | 100.722 | 35.709 |
| weight | 2935.972 | 811.300 |
| acceleration | 15.727 | 2.694 |
| year | 77.146 | 3.106 |
Removing observations 10–85 (76 rows) reduces the sample to 316 cars.
The ranges narrow — particularly for mpg and
horsepower — because that block included many early-1970s
high-displacement, low-efficiency vehicles. The mean mpg
rises slightly, and standard deviations shrink across most predictors,
reflecting reduced variability in the remaining subset.
ggpairs(
Auto[, c(quant_vars, "origin")],
aes(colour = factor(origin), alpha = 0.4),
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", size = 0.7)),
diag = list(continuous = wrap("densityDiag"))
) +
scale_colour_manual(
values = c("1" = "#E63946", "2" = "#2A9D8F", "3" = "#457B9D"),
labels = c("American", "European", "Japanese"),
name = "Origin"
) +
labs(title = "Scatterplot Matrix — Auto Dataset")Scatterplot matrix of quantitative predictors coloured by origin
Key findings:
displacement, horsepower, and
weight are strongly positively correlated (r > 0.85),
forming a cluster of engine-size variables.mpg has strong negative correlations with
displacement (r ≈ −0.81), weight (r ≈ −0.83), and horsepower (r ≈
−0.78): heavier, more powerful engines are less fuel-efficient.year shows a positive relationship with
mpg, suggesting fuel efficiency improved over time due to
regulatory pressure (CAFE standards).mpg?Auto_long <- Auto %>%
pivot_longer(cols = c(displacement, horsepower, weight, acceleration, year),
names_to = "predictor",
values_to = "value")
ggplot(Auto_long, aes(x = value, y = mpg)) +
geom_point(aes(colour = predictor), alpha = 0.3, size = 0.9) +
geom_smooth(method = "loess", se = TRUE, colour = "black", linewidth = 0.8) +
facet_wrap(~predictor, scales = "free_x", nrow = 2) +
scale_colour_brewer(palette = "Set1", guide = "none") +
labs(title = "mpg vs. Key Predictors",
x = "Predictor Value", y = "Miles per Gallon")mpg vs. key predictors with LOESS smoother
Conclusion: displacement,
horsepower, and weight show strong non-linear
negative relationships with mpg — each would be a powerful
predictor. year exhibits a clear positive trend: newer cars
are consistently more fuel-efficient. acceleration has a
weaker positive association and would contribute less predictive power
on its own.
Autoggpairs(
Auto %>% select(-name),
aes(alpha = 0.3),
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", size = 0.6)),
diag = list(continuous = "densityDiag")
) +
labs(title = "Full Scatterplot Matrix — Auto (excl. name)")Scatterplot matrix of all Auto variables (excluding name)
The matrix confirms the multicollinearity noted in Chapter 2:
cylinders, displacement,
horsepower, and weight form a tightly
correlated group. mpg correlates negatively with all of
them, while year and origin both correlate
positively with mpg.
cor_matrix <- cor(Auto %>% select(-name))
knitr::kable(round(cor_matrix, 3),
caption = "Correlation Matrix (all numeric variables)")| mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | |
|---|---|---|---|---|---|---|---|---|
| mpg | 1.000 | -0.778 | -0.805 | -0.778 | -0.832 | 0.423 | 0.581 | 0.565 |
| cylinders | -0.778 | 1.000 | 0.951 | 0.843 | 0.898 | -0.505 | -0.346 | -0.569 |
| displacement | -0.805 | 0.951 | 1.000 | 0.897 | 0.933 | -0.544 | -0.370 | -0.615 |
| horsepower | -0.778 | 0.843 | 0.897 | 1.000 | 0.865 | -0.689 | -0.416 | -0.455 |
| weight | -0.832 | 0.898 | 0.933 | 0.865 | 1.000 | -0.417 | -0.309 | -0.585 |
| acceleration | 0.423 | -0.505 | -0.544 | -0.689 | -0.417 | 1.000 | 0.290 | 0.213 |
| year | 0.581 | -0.346 | -0.370 | -0.416 | -0.309 | 0.290 | 1.000 | 0.182 |
| origin | 0.565 | -0.569 | -0.615 | -0.455 | -0.585 | 0.213 | 0.182 | 1.000 |
The highest absolute correlations with mpg are
weight (−0.832), displacement (−0.805), and
cylinders (−0.778). year is positively
correlated with mpg (r ≈ 0.581).
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the
response? Yes. The F-statistic is highly significant (p <
2.2e-16), and the model explains approximately 82% of
the variance in mpg (Adjusted R² ≈ 0.82). This confirms a
strong collective linear relationship.
ii. Which predictors are statistically significant? At α = 0.05, the significant predictors are:
displacement (p < 0.01)weight (p < 0.001) — the strongest individual
contributoryear (p < 0.001) — highly significant positive
predictororigin (p < 0.001)horsepower (p < 0.05)cylinders and acceleration are not
significant once the others are controlled for, likely due to
multicollinearity.
iii. What does the coefficient for year
suggest? The coefficient for year is approximately
+0.75, meaning that — holding all other variables
constant — each one-year increase in model year is associated with an
increase of ~0.75 mpg. This reflects mandatory fuel economy standards
(CAFE) introduced after the 1973 oil crisis and continuing engineering
improvements throughout the 1970s–80s.
Regression diagnostic plots
Interpretation:
mpg could help.# Test two interactions: horsepower:weight and cylinders:displacement
lm_interact <- lm(mpg ~ . - name + horsepower:weight + cylinders:displacement,
data = Auto)
summary(lm_interact)##
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight + cylinders:displacement,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3228 -1.5862 -0.0291 1.5536 11.9296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.053e+00 4.544e+00 0.892 0.37297
## cylinders -7.357e-01 4.834e-01 -1.522 0.12883
## displacement -2.010e-02 1.584e-02 -1.269 0.20531
## horsepower -2.080e-01 2.683e-02 -7.754 8.17e-14 ***
## weight -1.014e-02 9.351e-04 -10.849 < 2e-16 ***
## acceleration -7.057e-02 8.895e-02 -0.793 0.42805
## year 7.692e-01 4.480e-02 17.168 < 2e-16 ***
## origin 7.159e-01 2.589e-01 2.765 0.00597 **
## horsepower:weight 4.699e-05 6.930e-06 6.781 4.55e-11 ***
## cylinders:displacement 3.932e-03 2.165e-03 1.816 0.07008 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.923 on 382 degrees of freedom
## Multiple R-squared: 0.863, Adjusted R-squared: 0.8598
## F-statistic: 267.4 on 9 and 382 DF, p-value: < 2.2e-16
The interaction horsepower:weight is statistically
significant (p < 0.05), suggesting that the negative effect of weight
on mpg is amplified at higher horsepower levels. This makes
physical sense: a heavy car with a powerful engine is disproportionately
fuel-inefficient. The cylinders:displacement interaction is
not significant once horsepower:weight is included.
# Log, square-root, and squared transformations of horsepower
lm_log <- lm(mpg ~ log(horsepower) + weight + year + origin, data = Auto)
lm_sqrt <- lm(mpg ~ sqrt(horsepower) + weight + year + origin, data = Auto)
lm_sq <- lm(mpg ~ I(horsepower^2) + weight + year + origin, data = Auto)
# Compare adjusted R²
data.frame(
Model = c("log(horsepower)", "sqrt(horsepower)", "horsepower^2"),
Adj_R2 = c(summary(lm_log)$adj.r.squared,
summary(lm_sqrt)$adj.r.squared,
summary(lm_sq)$adj.r.squared)
) |> knitr::kable(digits = 4, caption = "Adjusted R² for Transformed Models")| Model | Adj_R2 |
|---|---|
| log(horsepower) | 0.8258 |
| sqrt(horsepower) | 0.8196 |
| horsepower^2 | 0.8167 |
ggplot(Auto, aes(x = log(horsepower), y = mpg)) +
geom_point(alpha = 0.4, colour = "#457B9D") +
geom_smooth(method = "lm", se = TRUE, colour = "#E63946") +
labs(title = "mpg vs. log(horsepower)",
x = "log(Horsepower)", y = "Miles per Gallon")mpg vs. log(horsepower) — the log transformation linearises the relationship
The log(horsepower) transformation achieves the highest
Adjusted R² and the most linear relationship with mpg,
effectively correcting the non-linearity observed in the diagnostics.
The squared term performs worst, introducing curvature that does not fit
the data well.
Boston Dataset## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
predictors <- setdiff(names(Boston), "crim")
# Fit a simple regression for each predictor; store coefficient and p-value
simple_results <- lapply(predictors, function(var) {
formula <- as.formula(paste("crim ~", var))
fit <- lm(formula, data = Boston)
s <- summary(fit)
data.frame(
predictor = var,
coef = coef(fit)[2],
p_value = s$coefficients[2, 4],
adj_r2 = s$adj.r.squared
)
})
simple_df <- do.call(rbind, simple_results)
rownames(simple_df) <- NULL
knitr::kable(simple_df[order(simple_df$p_value), ],
digits = 4,
caption = "Simple Regression Results: crim ~ each predictor (sorted by p-value)")| predictor | coef | p_value | adj_r2 | |
|---|---|---|---|---|
| 8 | rad | 0.6179 | 0.0000 | 0.3900 |
| 9 | tax | 0.0297 | 0.0000 | 0.3383 |
| 11 | lstat | 0.5488 | 0.0000 | 0.2060 |
| 4 | nox | 31.2485 | 0.0000 | 0.1756 |
| 2 | indus | 0.5098 | 0.0000 | 0.1637 |
| 12 | medv | -0.3632 | 0.0000 | 0.1491 |
| 7 | dis | -1.5509 | 0.0000 | 0.1425 |
| 6 | age | 0.1078 | 0.0000 | 0.1227 |
| 10 | ptratio | 1.1520 | 0.0000 | 0.0823 |
| 5 | rm | -2.6841 | 0.0000 | 0.0462 |
| 1 | zn | -0.0739 | 0.0000 | 0.0383 |
| 3 | chas | -1.8928 | 0.2094 | 0.0011 |
Boston_long <- Boston %>%
pivot_longer(cols = all_of(predictors),
names_to = "predictor",
values_to = "value")
ggplot(Boston_long, aes(x = value, y = crim)) +
geom_point(alpha = 0.2, size = 0.7, colour = "#457B9D") +
geom_smooth(method = "lm", se = TRUE, colour = "#E63946", linewidth = 0.9) +
facet_wrap(~predictor, scales = "free_x", ncol = 4) +
labs(title = "Crime Rate vs. Each Predictor (Simple Linear Regression)",
x = "Predictor Value", y = "Crime Rate (crim)")crim vs. each predictor with regression line
Finding: Almost all predictors have a statistically
significant simple relationship with crim. Exceptions are
chas (Charles River dummy variable, p ≈ 0.21), which is not
significant. The strongest associations (lowest p-value, highest Adj.
R²) are with rad (accessibility to radial highways),
tax (property tax rate), and lstat
(lower-status population %).
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
In the multiple regression context, only a subset of predictors retain significance at α = 0.05:
zn (residential zoning) — negative, significantdis (distance to employment centres) — negative,
significantrad (highway accessibility) — positive,
significantblack — negative, significantmedv (median home value) — negative, significantMany predictors that were significant in simple regression (e.g.,
nox, age, tax) lose significance
once the others are controlled for — a classic symptom of
multicollinearity among urban socioeconomic variables.
# Extract multiple regression coefficients (excluding intercept)
multi_coefs <- data.frame(
predictor = names(coef(lm_boston))[-1],
multi_coef = coef(lm_boston)[-1]
)
# Merge with simple regression coefficients
coef_compare <- merge(
simple_df[, c("predictor", "coef")],
multi_coefs,
by = "predictor"
)
colnames(coef_compare) <- c("predictor", "simple_coef", "multi_coef")
ggplot(coef_compare, aes(x = simple_coef, y = multi_coef, label = predictor)) +
geom_point(colour = "#E63946", size = 3) +
geom_text(nudge_y = 0.3, size = 3.5, colour = "grey30") +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey60") +
labs(
title = "Univariate vs. Multiple Regression Coefficients",
subtitle = "Boston dataset — response: crim",
x = "Simple Regression Coefficient",
y = "Multiple Regression Coefficient"
)Univariate vs. multiple regression coefficients for each predictor
Interpretation: Several predictors switch sign
between simple and multiple regression (notably nox —
nitrogen oxide concentration), indicating confounding
and multicollinearity. In simple regression, nox has a
large positive coefficient (high pollution ↔︎ high crime), but once
distance to employment centres (dis) and other urban
factors are controlled for, the marginal effect changes. This
illustrates why multiple regression is needed to isolate individual
effects.
# Fit cubic polynomial for each predictor; test if X^2 or X^3 terms are significant
poly_results <- lapply(predictors, function(var) {
formula <- as.formula(paste("crim ~ poly(", var, ", 1)"))
fit <- lm(formula, data = Boston)
s <- summary(fit)
pvals <- s$coefficients[-1, 4] # p-values for linear, quadratic, cubic
data.frame(
predictor = var,
p_linear = pvals[1],
p_quadratic = ifelse(length(pvals) >= 2, pvals[2], NA),
p_cubic = ifelse(length(pvals) >= 3, pvals[3], NA)
)
})
poly_df <- do.call(rbind, poly_results)
rownames(poly_df) <- NULLFinding: There is strong evidence of non-linear
relationships for several predictors. medv,
dis, lstat, nox, and
age all have significant quadratic or cubic terms (p <
0.05), suggesting U-shaped or S-shaped relationships with crime rate.
For example, the effect of distance to employment centres
(dis) diminishes and reverses at very large distances,
likely reflecting suburban areas with distinct crime profiles.
Weekly## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
## [1] 1089 9
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
p1 <- ggplot(Weekly, aes(x = seq_along(Volume), y = Volume)) +
geom_line(colour = "#457B9D", linewidth = 0.5) +
labs(title = "Weekly Trading Volume Over Time",
x = "Observation Index (1990–2010)", y = "Volume")
p2 <- ggplot(Weekly, aes(x = Direction, y = Today, fill = Direction)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("Down" = "#E63946", "Up" = "#2A9D8F")) +
labs(title = "Weekly Return (Today) by Market Direction",
x = "Direction", y = "Return (%)")
gridExtra::grid.arrange(p1, p2, ncol = 2)Trading volume over time and return distribution by direction
Weekly_num <- Weekly %>% select(-Direction)
cor_weekly <- cor(Weekly_num)
knitr::kable(round(cor_weekly, 3),
caption = "Correlation Matrix — Weekly Dataset")| Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | |
|---|---|---|---|---|---|---|---|---|
| Year | 1.000 | -0.032 | -0.033 | -0.030 | -0.031 | -0.031 | 0.842 | -0.032 |
| Lag1 | -0.032 | 1.000 | -0.075 | 0.059 | -0.071 | -0.008 | -0.065 | -0.075 |
| Lag2 | -0.033 | -0.075 | 1.000 | -0.076 | 0.058 | -0.072 | -0.086 | 0.059 |
| Lag3 | -0.030 | 0.059 | -0.076 | 1.000 | -0.075 | 0.061 | -0.069 | -0.071 |
| Lag4 | -0.031 | -0.071 | 0.058 | -0.075 | 1.000 | -0.076 | -0.061 | -0.008 |
| Lag5 | -0.031 | -0.008 | -0.072 | 0.061 | -0.076 | 1.000 | -0.059 | 0.011 |
| Volume | 0.842 | -0.065 | -0.086 | -0.069 | -0.061 | -0.059 | 1.000 | -0.033 |
| Today | -0.032 | -0.075 | 0.059 | -0.071 | -0.008 | 0.011 | -0.033 | 1.000 |
Patterns observed:
Volume increases substantially over the 21-year period,
reflecting the growth of equity markets.Today, indicating that past returns have
little linear predictive power for current returns — consistent with the
weak-form efficient market hypothesis.Volume and
Year (r ≈ 0.84), purely reflecting the upward trend in
trading activity.logit_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly,
family = binomial)
summary(logit_full)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Are any predictors statistically significant? Only
Lag2 is statistically significant (p ≈
0.030) at the α = 0.05 level. Its positive coefficient indicates that a
positive return two weeks prior is associated with a higher probability
of an “Up” week. All other lag variables and Volume are not
significant, confirming that the market is largely unpredictable from
recent returns alone.
# Predicted probabilities and class labels (threshold = 0.5)
probs_full <- predict(logit_full, type = "response")
preds_full <- ifelse(probs_full > 0.5, "Up", "Down")
# Confusion matrix
cm_full <- table(Predicted = preds_full, Actual = Weekly$Direction)
cm_full## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
# Accuracy
acc_full <- mean(preds_full == Weekly$Direction)
cat("\nOverall Fraction of Correct Predictions (Accuracy):", round(acc_full, 4), "\n")##
## Overall Fraction of Correct Predictions (Accuracy): 0.5611
# False Positive and False Negative rates
TN <- cm_full["Down", "Down"]
FP <- cm_full["Up", "Down"]
FN <- cm_full["Down", "Up"]
TP <- cm_full["Up", "Up"]
FPR <- FP / (FP + TN)
FNR <- FN / (FN + TP)
cat("False Positive Rate (predicting Up when actually Down):", round(FPR, 4), "\n")## False Positive Rate (predicting Up when actually Down): 0.8884
## False Negative Rate (predicting Down when actually Up): 0.0793
Interpretation:
The confusion matrix shows the model nearly always predicts “Up”, essentially learning the marginal class distribution rather than the true decision boundary.
Lag2# Split data
train <- Weekly %>% filter(Year <= 2008)
test <- Weekly %>% filter(Year > 2008)
cat("Training observations:", nrow(train), "\n")## Training observations: 985
## Test observations: 104
# Fit model on training data with Lag2 only
logit_lag2 <- glm(Direction ~ Lag2,
data = train,
family = binomial)
summary(logit_lag2)##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
# Predict on held-out test set
probs_test <- predict(logit_lag2, newdata = test, type = "response")
preds_test <- ifelse(probs_test > 0.5, "Up", "Down")
# Confusion matrix
cm_test <- table(Predicted = preds_test, Actual = test$Direction)
cm_test## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
# Accuracy on test set
acc_test <- mean(preds_test == test$Direction)
cat("\nTest Set Accuracy (2009–2010):", round(acc_test, 4), "\n")##
## Test Set Accuracy (2009–2010): 0.625
# Test set FPR and FNR
TN_t <- cm_test["Down", "Down"]
FP_t <- cm_test["Up", "Down"]
FN_t <- cm_test["Down", "Up"]
TP_t <- cm_test["Up", "Up"]
FPR_t <- FP_t / (FP_t + TN_t)
FNR_t <- FN_t / (FN_t + TP_t)
cat("Test False Positive Rate:", round(FPR_t, 4), "\n")## Test False Positive Rate: 0.7907
## Test False Negative Rate: 0.082
test_plot <- test %>%
mutate(prob_up = probs_test,
predicted = preds_test)
ggplot(test_plot, aes(x = seq_along(prob_up), y = prob_up,
colour = Direction, shape = (Direction == predicted))) +
geom_point(size = 2.5, alpha = 0.8) +
geom_hline(yintercept = 0.5, linetype = "dashed", colour = "grey40") +
scale_colour_manual(values = c("Down" = "#E63946", "Up" = "#2A9D8F")) +
scale_shape_manual(values = c("TRUE" = 16, "FALSE" = 4),
labels = c("Correct", "Incorrect"),
name = "Prediction") +
labs(title = "Predicted P(Up) on Test Set (2009–2010)",
subtitle = "Model: Direction ~ Lag2, trained on 1990–2008",
x = "Test Observation Index", y = "P(Direction = Up)",
colour = "Actual Direction")Predicted probabilities on the 2009-2010 test set by actual direction
Summary of Test Performance:
Lag2-only
model achieves approximately 62.5% accuracy on the
held-out 2009–2010 data, outperforming the full model on new data. This
suggests that the Lag2 signal, while modest, has some
genuine out-of-sample predictive content.Lag2 provides a small but detectable signal, not enough for
reliable trading but statistically non-trivial.End of Midterm Homework — Application of Financial Software