This exercise involves the Auto data set. We first
ensure that missing values have been removed.
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
## [1] 392 9
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
## mpg cylinders displacement horsepower weight acceleration
## "numeric" "integer" "numeric" "integer" "integer" "numeric"
## year origin name
## "integer" "integer" "factor"
Quantitative predictors: mpg,
cylinders, displacement,
horsepower, weight, acceleration,
year
Qualitative predictors: origin,
name
Note: origin is coded as numeric (1, 2, 3) but
represents categorical groups (1 = American, 2 = European, 3 =
Japanese), so it is qualitative. cylinders could also be
considered qualitative since it takes only a few discrete values, but we
treat it as quantitative here as is conventional.
quant_vars <- c("mpg", "cylinders", "displacement", "horsepower",
"weight", "acceleration", "year")
sapply(Auto[, quant_vars], range)## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
range_df <- data.frame(
Variable = quant_vars,
Min = sapply(Auto[, quant_vars], min),
Max = sapply(Auto[, quant_vars], max)
)
knitr::kable(range_df, row.names = FALSE, caption = "Range of Quantitative Predictors")| Variable | Min | Max |
|---|---|---|
| mpg | 9 | 46.6 |
| cylinders | 3 | 8.0 |
| displacement | 68 | 455.0 |
| horsepower | 46 | 230.0 |
| weight | 1613 | 5140.0 |
| acceleration | 8 | 24.8 |
| year | 70 | 82.0 |
stats_df <- data.frame(
Variable = quant_vars,
Mean = round(sapply(Auto[, quant_vars], mean), 2),
SD = round(sapply(Auto[, quant_vars], sd), 2)
)
knitr::kable(stats_df, row.names = FALSE, caption = "Mean and Standard Deviation")| Variable | Mean | SD |
|---|---|---|
| mpg | 23.45 | 7.81 |
| cylinders | 5.47 | 1.71 |
| displacement | 194.41 | 104.64 |
| horsepower | 104.47 | 38.49 |
| weight | 2977.58 | 849.40 |
| acceleration | 15.54 | 2.76 |
| year | 75.98 | 3.68 |
## Original observations: 392
## After removal: 316
stats_sub <- data.frame(
Variable = quant_vars,
Range_Min = sapply(Auto_sub[, quant_vars], min),
Range_Max = sapply(Auto_sub[, quant_vars], max),
Mean = round(sapply(Auto_sub[, quant_vars], mean), 2),
SD = round(sapply(Auto_sub[, quant_vars], sd), 2)
)
knitr::kable(stats_sub, row.names = FALSE,
caption = "Statistics After Removing Observations 10-85")| Variable | Range_Min | Range_Max | Mean | SD |
|---|---|---|---|---|
| mpg | 11.0 | 46.6 | 24.40 | 7.87 |
| cylinders | 3.0 | 8.0 | 5.37 | 1.65 |
| displacement | 68.0 | 455.0 | 187.24 | 99.68 |
| horsepower | 46.0 | 230.0 | 100.72 | 35.71 |
| weight | 1649.0 | 4997.0 | 2935.97 | 811.30 |
| acceleration | 8.5 | 24.8 | 15.73 | 2.69 |
| year | 70.0 | 82.0 | 77.15 | 3.11 |
Compared to part (c), removing observations 10–85 changes the
statistics slightly. For some variables (e.g.,
displacement, horsepower,
weight), the means increase because many of the removed
observations were smaller, fuel-efficient cars.
par(mfrow = c(2, 3))
plot(Auto$displacement, Auto$mpg, pch = 20, col = "steelblue",
xlab = "Displacement", ylab = "MPG", main = "MPG vs Displacement")
plot(Auto$horsepower, Auto$mpg, pch = 20, col = "darkorange",
xlab = "Horsepower", ylab = "MPG", main = "MPG vs Horsepower")
plot(Auto$weight, Auto$mpg, pch = 20, col = "darkgreen",
xlab = "Weight", ylab = "MPG", main = "MPG vs Weight")
plot(Auto$acceleration, Auto$mpg, pch = 20, col = "purple",
xlab = "Acceleration", ylab = "MPG", main = "MPG vs Acceleration")
plot(Auto$year, Auto$mpg, pch = 20, col = "red",
xlab = "Year", ylab = "MPG", main = "MPG vs Year")
boxplot(mpg ~ origin, data = Auto, col = c("coral", "lightgreen", "lightblue"),
xlab = "Origin (1=US, 2=EU, 3=JP)", ylab = "MPG",
main = "MPG by Origin")Findings:
mpg and displacement, horsepower,
and weight — heavier, more powerful cars tend to have lower
fuel efficiency.mpg and year — newer cars tend to be more fuel
efficient.acceleration shows a weak positive relationship with
mpg.origin 3 and 2) tend to
have higher mpg than American cars (origin
1).displacement, horsepower, and
weight are highly correlated with each other.# Correlation of quantitative predictors with mpg
cor_with_mpg <- cor(Auto[, quant_vars])[, "mpg"]
sort(abs(cor_with_mpg), decreasing = TRUE)## mpg weight displacement horsepower cylinders year
## 1.0000000 0.8322442 0.8051269 0.7784268 0.7776175 0.5805410
## acceleration
## 0.4233285
Based on the scatterplots and correlations:
weight (r = -0.832) has the strongest
negative correlation with mpg and is the most useful
predictor.displacement (r = -0.805) and
horsepower (r = -0.778) are also strongly
negatively correlated.year (r = 0.581) is positively
correlated, indicating newer cars are more efficient.cylinders is also useful but is highly
correlated with displacement.acceleration has the weakest
relationship and may be less useful on its own.The qualitative variable origin also
appears useful, as shown by the boxplot in part (e).
pairs(Auto[, -which(names(Auto) == "name")], pch = 20, col = "steelblue",
main = "Scatterplot Matrix — Auto Data (excluding name)")The scatterplot matrix reveals several strong linear and non-linear relationships among the variables.
auto_numeric <- Auto[, -which(names(Auto) == "name")]
cor_matrix <- cor(auto_numeric)
round(cor_matrix, 3)## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
## year 0.581 -0.346 -0.370 -0.416 -0.309 0.290
## origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213
## year origin
## mpg 0.581 0.565
## cylinders -0.346 -0.569
## displacement -0.370 -0.615
## horsepower -0.416 -0.455
## weight -0.309 -0.585
## acceleration 0.290 0.213
## year 1.000 0.182
## origin 0.182 1.000
corrplot(cor_matrix, method = "color", type = "upper",
addCoef.col = "black", number.cex = 0.7,
tl.col = "black", tl.srt = 45,
title = "Correlation Matrix of Auto Data",
mar = c(0, 0, 2, 0))Key observations:
displacement, horsepower,
weight, and cylinders are highly positively
correlated with each other (r > 0.8).mpg.year has a moderate positive correlation with
mpg.##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
Yes. The F-statistic is very large with a p-value near zero (p <
2.2e-16), so we can reject the null hypothesis that all regression
coefficients are zero. There is a statistically significant relationship
between the predictors and mpg.
ii. Which predictors have a statistically significant relationship to the response?
coef_summary <- summary(lm.fit)$coefficients
sig_predictors <- rownames(coef_summary)[coef_summary[, "Pr(>|t|)"] < 0.05]
sig_predictors## [1] "(Intercept)" "displacement" "weight" "year" "origin"
Looking at the p-values, displacement,
weight, year, and origin appear
to have statistically significant relationships with mpg (p
< 0.05).
iii. What does the coefficient for the year
variable suggest?
## year
## 0.7507727
The coefficient for year is positive (approximately
0.7508), suggesting that, holding all other predictors fixed,
each additional year is associated with an increase of about
0.75 mpg. This indicates that cars have become more
fuel-efficient over time.
Comments:
# Identify high leverage points
hatvalues_lm <- hatvalues(lm.fit)
p <- length(coef(lm.fit))
n <- nrow(Auto)
high_leverage <- which(hatvalues_lm > 2 * p / n)
cat("High leverage observations:", head(high_leverage, 10), "\n")## High leverage observations: 7 8 9 13 14 26 27 28 29 94
## Number of high leverage points: 17
# Test several interaction terms
lm.fit.int1 <- lm(mpg ~ . - name + displacement:weight, data = Auto)
summary(lm.fit.int1)##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.589 -1.617 -0.184 1.541 12.001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.876e+00 4.511e+00 0.638 0.524147
## cylinders -2.955e-02 2.881e-01 -0.103 0.918363
## displacement 5.950e-03 6.750e-03 0.881 0.378610
## horsepower -2.313e-01 2.363e-02 -9.791 < 2e-16 ***
## weight -1.121e-02 7.285e-04 -15.393 < 2e-16 ***
## acceleration -9.019e-02 8.855e-02 -1.019 0.309081
## year 7.695e-01 4.494e-02 17.124 < 2e-16 ***
## origin 8.344e-01 2.513e-01 3.320 0.000986 ***
## horsepower:weight 5.529e-05 5.227e-06 10.577 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.931 on 383 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.859
## F-statistic: 298.6 on 8 and 383 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = mpg ~ . - name + year:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6072 -2.0439 -0.0596 1.7121 12.3368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.492e+00 9.044e+00 0.939 0.348353
## cylinders -5.042e-01 3.192e-01 -1.579 0.115082
## displacement 1.567e-02 7.530e-03 2.081 0.038060 *
## horsepower -1.399e-02 1.364e-02 -1.025 0.305786
## weight -6.352e-03 6.449e-04 -9.851 < 2e-16 ***
## acceleration 9.185e-02 9.766e-02 0.941 0.347546
## year 4.189e-01 1.125e-01 3.723 0.000226 ***
## origin -1.405e+01 4.699e+00 -2.989 0.002978 **
## year:origin 1.989e-01 6.030e-02 3.298 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared: 0.8264, Adjusted R-squared: 0.8228
## F-statistic: 227.9 on 8 and 383 DF, p-value: < 2.2e-16
# Model with multiple interactions
lm.fit.int4 <- lm(mpg ~ (. - name)^2, data = Auto)
sig_int <- summary(lm.fit.int4)$coefficients
sig_interactions <- rownames(sig_int)[sig_int[, "Pr(>|t|)"] < 0.05]
sig_interactions## [1] "displacement" "acceleration" "origin"
## [4] "displacement:year" "acceleration:year" "acceleration:origin"
Findings: Several interaction terms appear to be
statistically significant. The interaction between
displacement:weight is significant, as is
acceleration:year and acceleration:origin
among others. This suggests that the effect of one predictor on
mpg depends on the level of another predictor.
# Log transformation
lm.fit.log <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +
acceleration + year + factor(origin), data = Auto)
summary(lm.fit.log)##
## Call:
## lm(formula = mpg ~ log(displacement) + log(horsepower) + log(weight) +
## acceleration + year + factor(origin), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1406 -1.9581 -0.0011 1.6383 13.0738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.11202 9.87221 11.052 < 2e-16 ***
## log(displacement) 0.68018 1.12701 0.604 0.546516
## log(horsepower) -5.43488 1.57006 -3.462 0.000597 ***
## log(weight) -14.86606 2.26541 -6.562 1.72e-10 ***
## acceleration -0.18354 0.10294 -1.783 0.075386 .
## year 0.74088 0.04821 15.368 < 2e-16 ***
## factor(origin)2 1.64083 0.55679 2.947 0.003405 **
## factor(origin)3 1.87792 0.54392 3.453 0.000617 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.073 on 384 degrees of freedom
## Multiple R-squared: 0.8478, Adjusted R-squared: 0.845
## F-statistic: 305.6 on 7 and 384 DF, p-value: < 2.2e-16
# Square root transformation
lm.fit.sqrt <- lm(mpg ~ sqrt(displacement) + sqrt(horsepower) + sqrt(weight) +
acceleration + year + factor(origin), data = Auto)
summary(lm.fit.sqrt)##
## Call:
## lm(formula = mpg ~ sqrt(displacement) + sqrt(horsepower) + sqrt(weight) +
## acceleration + year + factor(origin), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8811 -1.9810 -0.0654 1.8271 13.2758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.09260 4.96699 0.824 0.410474
## sqrt(displacement) 0.31252 0.16782 1.862 0.063328 .
## sqrt(horsepower) -0.63396 0.30172 -2.101 0.036276 *
## sqrt(weight) -0.67451 0.07897 -8.541 3.15e-16 ***
## acceleration -0.03662 0.10201 -0.359 0.719838
## year 0.75929 0.05027 15.105 < 2e-16 ***
## factor(origin)2 2.17072 0.56392 3.849 0.000139 ***
## factor(origin)3 2.34062 0.54877 4.265 2.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.192 on 384 degrees of freedom
## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8327
## F-statistic: 279 on 7 and 384 DF, p-value: < 2.2e-16
# Quadratic terms
lm.fit.sq <- lm(mpg ~ displacement + I(displacement^2) +
horsepower + I(horsepower^2) +
weight + I(weight^2) +
acceleration + year + factor(origin), data = Auto)
summary(lm.fit.sq)##
## Call:
## lm(formula = mpg ~ displacement + I(displacement^2) + horsepower +
## I(horsepower^2) + weight + I(weight^2) + acceleration + year +
## factor(origin), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9905 -1.6264 -0.0559 1.4085 12.2330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.115e+00 4.756e+00 0.865 0.387448
## displacement -1.261e-02 1.648e-02 -0.765 0.444903
## I(displacement^2) 3.581e-05 3.214e-05 1.114 0.266004
## horsepower -1.789e-01 4.080e-02 -4.386 1.50e-05 ***
## I(horsepower^2) 4.943e-04 1.381e-04 3.581 0.000387 ***
## weight -1.261e-02 2.511e-03 -5.022 7.89e-07 ***
## I(weight^2) 1.283e-06 3.344e-07 3.837 0.000146 ***
## acceleration -1.526e-01 1.003e-01 -1.521 0.129089
## year 7.860e-01 4.645e-02 16.923 < 2e-16 ***
## factor(origin)2 1.286e+00 5.344e-01 2.406 0.016601 *
## factor(origin)3 1.353e+00 5.197e-01 2.603 0.009600 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.901 on 381 degrees of freedom
## Multiple R-squared: 0.8654, Adjusted R-squared: 0.8619
## F-statistic: 245 on 10 and 381 DF, p-value: < 2.2e-16
# Diagnostic plots for log model
par(mfrow = c(2, 2))
plot(lm.fit.log, main = "Log-transformed Model")Findings:
horsepower and
weight are statistically significant, confirming a
non-linear relationship.This problem uses the Boston data set to predict per
capita crime rate (crim).
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
predictors <- names(Boston)[-1] # All except 'crim'
par(mfrow = c(4, 4))
slr_results <- data.frame(
Predictor = character(),
Coefficient = numeric(),
P_Value = numeric(),
R_Squared = numeric(),
Significant = character(),
stringsAsFactors = FALSE
)
for (pred in predictors) {
fit <- lm(as.formula(paste("crim ~", pred)), data = Boston)
s <- summary(fit)
slr_results <- rbind(slr_results, data.frame(
Predictor = pred,
Coefficient = round(coef(fit)[2], 4),
P_Value = signif(s$coefficients[2, 4], 4),
R_Squared = round(s$r.squared, 4),
Significant = ifelse(s$coefficients[2, 4] < 0.05, "Yes", "No")
))
plot(Boston[[pred]], Boston$crim, pch = 20, col = "steelblue",
xlab = pred, ylab = "crim",
main = paste("crim vs", pred))
abline(fit, col = "red", lwd = 2)
}
par(mfrow = c(1, 1))knitr::kable(slr_results, row.names = FALSE,
caption = "Simple Linear Regression Results for Each Predictor")| Predictor | Coefficient | P_Value | R_Squared | Significant |
|---|---|---|---|---|
| zn | -0.0739 | 0.0000055 | 0.0402 | Yes |
| indus | 0.5098 | 0.0000000 | 0.1653 | Yes |
| chas | -1.8928 | 0.2094000 | 0.0031 | No |
| nox | 31.2485 | 0.0000000 | 0.1772 | Yes |
| rm | -2.6841 | 0.0000006 | 0.0481 | Yes |
| age | 0.1078 | 0.0000000 | 0.1244 | Yes |
| dis | -1.5509 | 0.0000000 | 0.1441 | Yes |
| rad | 0.6179 | 0.0000000 | 0.3913 | Yes |
| tax | 0.0297 | 0.0000000 | 0.3396 | Yes |
| ptratio | 1.1520 | 0.0000000 | 0.0841 | Yes |
| black | -0.0363 | 0.0000000 | 0.1483 | Yes |
| lstat | 0.5488 | 0.0000000 | 0.2076 | Yes |
| medv | -0.3632 | 0.0000000 | 0.1508 | Yes |
Findings: For most predictors, there is a
statistically significant association with crim (p <
0.05). The predictor chas (Charles River dummy variable) is
the only one that is not statistically significant.
Among the significant predictors, rad (index of
accessibility to radial highways) and tax (property tax
rate) have the highest R² values.
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
coef_all <- summary(lm.fit.all)$coefficients
sig_multi <- rownames(coef_all)[coef_all[, "Pr(>|t|)"] < 0.05]
cat("Significant predictors in multiple regression:\n")## Significant predictors in multiple regression:
## [1] "zn" "dis" "rad" "black" "medv"
In the multiple regression, we can reject H₀: βⱼ = 0 for only a few
predictors (typically zn, dis,
rad, black, and medv). Many
predictors that were significant in simple regression are no longer
significant here due to multicollinearity.
# Get simple regression coefficients
simple_coefs <- sapply(predictors, function(pred) {
fit <- lm(as.formula(paste("crim ~", pred)), data = Boston)
coef(fit)[2]
})
# Get multiple regression coefficients
multi_coefs <- coef(lm.fit.all)[-1] # Exclude intercept
# Create comparison plot
plot(simple_coefs, multi_coefs,
pch = 19, col = "steelblue", cex = 1.5,
xlab = "Simple Regression Coefficients",
ylab = "Multiple Regression Coefficients",
main = "Simple vs. Multiple Regression Coefficients")
text(simple_coefs, multi_coefs, labels = predictors,
pos = 3, cex = 0.8)
abline(0, 1, lty = 2, col = "red")
abline(h = 0, lty = 3, col = "gray")
abline(v = 0, lty = 3, col = "gray")Findings: The coefficients from simple and multiple
regression differ substantially for several predictors. For example,
nox has a very large positive coefficient in simple
regression but a much smaller (possibly negative) coefficient in
multiple regression. This discrepancy occurs because the simple
regression coefficient captures both the direct effect and indirect
effects through correlated predictors. The multiple regression adjusts
for other variables and isolates the partial effect.
par(mfrow = c(4, 4))
nonlinear_results <- data.frame(
Predictor = character(),
Linear_P = numeric(),
Quadratic_P = numeric(),
Cubic_P = numeric(),
Evidence_Nonlinear = character(),
stringsAsFactors = FALSE
)
for (pred in predictors) {
# Skip chas (binary variable)
if (pred == "chas") {
next
}
fit_poly <- lm(as.formula(paste("crim ~ poly(", pred, ", 3)")), data = Boston)
s <- summary(fit_poly)
pvals <- s$coefficients[, "Pr(>|t|)"]
nonlinear_results <- rbind(nonlinear_results, data.frame(
Predictor = pred,
Linear_P = signif(pvals[2], 4),
Quadratic_P = signif(pvals[3], 4),
Cubic_P = signif(pvals[4], 4),
Evidence_Nonlinear = ifelse(pvals[3] < 0.05 | pvals[4] < 0.05, "Yes", "No")
))
# Plot with polynomial fit
x_range <- seq(min(Boston[[pred]]), max(Boston[[pred]]), length.out = 100)
pred_vals <- predict(fit_poly, newdata = setNames(data.frame(x_range), pred))
plot(Boston[[pred]], Boston$crim, pch = 20, col = "steelblue",
xlab = pred, ylab = "crim",
main = paste("crim vs", pred, "(cubic fit)"))
lines(x_range, pred_vals, col = "red", lwd = 2)
}
par(mfrow = c(1, 1))knitr::kable(nonlinear_results, row.names = FALSE,
caption = "Evidence of Non-linear Association (Polynomial Degree 3)")| Predictor | Linear_P | Quadratic_P | Cubic_P | Evidence_Nonlinear |
|---|---|---|---|---|
| zn | 4.7e-06 | 0.0044210 | 0.229500 | Yes |
| indus | 0.0e+00 | 0.0010860 | 0.000000 | Yes |
| nox | 0.0e+00 | 0.0000774 | 0.000000 | Yes |
| rm | 5.0e-07 | 0.0015090 | 0.508600 | Yes |
| age | 0.0e+00 | 0.0000023 | 0.006680 | Yes |
| dis | 0.0e+00 | 0.0000000 | 0.000000 | Yes |
| rad | 0.0e+00 | 0.0091210 | 0.482300 | Yes |
| tax | 0.0e+00 | 0.0000037 | 0.243900 | Yes |
| ptratio | 0.0e+00 | 0.0024050 | 0.006301 | Yes |
| black | 0.0e+00 | 0.4566000 | 0.543600 | No |
| lstat | 0.0e+00 | 0.0378000 | 0.129900 | Yes |
| medv | 0.0e+00 | 0.0000000 | 0.000000 | Yes |
Findings: There is evidence of non-linear
association for several predictors. Specifically, predictors such as
indus, nox, age,
dis, ptratio, and medv show
significant quadratic or cubic terms (p < 0.05), indicating that a
simple linear model may not adequately capture the relationship between
these predictors and crim.
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
## [1] 1089 9
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
# Correlation matrix (numeric variables only)
cor_weekly <- cor(Weekly[, -which(names(Weekly) == "Direction")])
round(cor_weekly, 3)## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## Year 1.000 -0.032 -0.033 -0.030 -0.031 -0.031 0.842 -0.032
## Lag1 -0.032 1.000 -0.075 0.059 -0.071 -0.008 -0.065 -0.075
## Lag2 -0.033 -0.075 1.000 -0.076 0.058 -0.072 -0.086 0.059
## Lag3 -0.030 0.059 -0.076 1.000 -0.075 0.061 -0.069 -0.071
## Lag4 -0.031 -0.071 0.058 -0.075 1.000 -0.076 -0.061 -0.008
## Lag5 -0.031 -0.008 -0.072 0.061 -0.076 1.000 -0.059 0.011
## Volume 0.842 -0.065 -0.086 -0.069 -0.061 -0.059 1.000 -0.033
## Today -0.032 -0.075 0.059 -0.071 -0.008 0.011 -0.033 1.000
corrplot(cor_weekly, method = "color", type = "upper",
addCoef.col = "black", number.cex = 0.7,
tl.col = "black", tl.srt = 45,
title = "Correlation Matrix — Weekly Data",
mar = c(0, 0, 2, 0))par(mfrow = c(2, 3))
for (var in c("Lag1", "Lag2", "Lag3", "Lag4", "Lag5")) {
boxplot(Weekly[[var]] ~ Weekly$Direction, col = c("coral", "lightgreen"),
xlab = "Direction", ylab = var,
main = paste(var, "by Direction"))
}
par(mfrow = c(1, 1))plot(Weekly$Volume, type = "l", col = "steelblue",
xlab = "Observation", ylab = "Volume",
main = "Trading Volume Over Time")Patterns observed:
Year and
Volume (r ≈ 0.84), indicating that trading volume has
increased over time.Up and Down directions based on the
boxplots.glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
summary(glm.fit)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
coef_glm <- summary(glm.fit)$coefficients
sig_glm <- rownames(coef_glm)[coef_glm[, "Pr(>|z|)"] < 0.05]
cat("Statistically significant predictors:\n")## Statistically significant predictors:
## [1] "(Intercept)" "Lag2"
Findings: Only Lag2
appears to be statistically significant (p < 0.05). Its positive
coefficient suggests that a positive return two weeks ago is associated
with a higher probability of the market going up this week. None of the
other lag variables or Volume are statistically significant.
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down")
# Confusion matrix
confusion <- table(Predicted = glm.pred, Actual = Weekly$Direction)
confusion## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
# Overall accuracy
accuracy <- mean(glm.pred == Weekly$Direction)
cat("Overall accuracy:", round(accuracy * 100, 2), "%\n")## Overall accuracy: 56.11 %
# Detailed metrics
TP <- confusion["Up", "Up"]
TN <- confusion["Down", "Down"]
FP <- confusion["Up", "Down"]
FN <- confusion["Down", "Up"]
cat("\nTrue Positive (correctly predicted Up):", TP, "\n")##
## True Positive (correctly predicted Up): 557
## True Negative (correctly predicted Down): 54
## False Positive (predicted Up, actual Down): 430
## False Negative (predicted Down, actual Up): 48
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
cat("\nSensitivity (True Positive Rate):", round(sensitivity * 100, 2), "%\n")##
## Sensitivity (True Positive Rate): 92.07 %
## Specificity (True Negative Rate): 11.16 %
Interpretation: The overall accuracy is about 56.1%, which is slightly better than random guessing (50%). However, the confusion matrix reveals that the model has very high sensitivity (correctly predicting “Up”) but very low specificity (poorly predicting “Down”). The model tends to predict “Up” for most observations, which makes sense since the market went up more often than down during this period. The model’s ability to predict downturns is very poor — it makes many false positive errors (predicting “Up” when the actual direction is “Down”).
# Training data: 1990 to 2008
train <- Weekly$Year <= 2008
Weekly.train <- Weekly[train, ]
Weekly.test <- Weekly[!train, ]
cat("Training set size:", nrow(Weekly.train), "\n")## Training set size: 985
## Test set size: 104
## Test set years: 2009 2010
# Fit logistic regression using only Lag2
glm.fit2 <- glm(Direction ~ Lag2, data = Weekly.train, family = binomial)
summary(glm.fit2)##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
# Predict on test data
glm.probs2 <- predict(glm.fit2, newdata = Weekly.test, type = "response")
glm.pred2 <- ifelse(glm.probs2 > 0.5, "Up", "Down")
# Confusion matrix for test data
confusion2 <- table(Predicted = glm.pred2, Actual = Weekly.test$Direction)
confusion2## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
# Test set accuracy
accuracy2 <- mean(glm.pred2 == Weekly.test$Direction)
cat("Test set accuracy:", round(accuracy2 * 100, 2), "%\n")## Test set accuracy: 62.5 %
# Detailed metrics
TP2 <- confusion2["Up", "Up"]
TN2 <- confusion2["Down", "Down"]
FP2 <- confusion2["Up", "Down"]
FN2 <- confusion2["Down", "Up"]
sensitivity2 <- TP2 / (TP2 + FN2)
specificity2 <- TN2 / (TN2 + FP2)
cat("Sensitivity:", round(sensitivity2 * 100, 2), "%\n")## Sensitivity: 91.8 %
## Specificity: 20.93 %
Findings: Using only Lag2 as a
predictor and evaluating on the held-out test set (2009–2010), the model
achieves an accuracy of approximately 62.5%. This is an improvement over
the full model’s apparent accuracy because:
Lag2), which reduces noise from irrelevant variables.## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Asia/Shanghai
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MASS_7.3-60.2 corrplot_0.95 GGally_2.4.0 ggplot2_4.0.2 ISLR2_1.3-2
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 dplyr_1.2.0 compiler_4.4.1
## [5] tidyselect_1.2.1 ggstats_0.13.0 tidyr_1.3.2 jquerylib_0.1.4
## [9] scales_1.4.0 yaml_2.3.10 fastmap_1.2.0 R6_2.6.1
## [13] generics_0.1.3 knitr_1.51 tibble_3.3.1 bslib_0.10.0
## [17] pillar_1.10.2 RColorBrewer_1.1-3 rlang_1.1.7 cachem_1.1.0
## [21] xfun_0.52 sass_0.4.10 S7_0.2.0 otel_0.2.0
## [25] cli_3.6.5 withr_3.0.2 magrittr_2.0.3 digest_0.6.37
## [29] grid_4.4.1 rstudioapi_0.17.1 lifecycle_1.0.5 vctrs_0.7.1
## [33] evaluate_1.0.5 glue_1.8.0 farver_2.1.2 rmarkdown_2.31
## [37] purrr_1.2.1 tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1