Assignment #2 Deep Stats

## Multiple Linear Regression on the Auto Data Set
library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.5.3

data(Auto)
Auto <- na.omit(Auto)

(a) Scatterplot matrix

pairs(Auto)

(b) Correlation matrix

cor(Auto[, -9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Multiple linear regression

auto_lm <- lm(mpg ~ . - name, data = Auto)

summary(auto_lm)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

(d) Diagnostic plots

par(mfrow = c(2, 2))
plot(auto_lm)

par(mfrow = c(1, 1))

The Residuals vs Fitted plot displays a curved pattern rather than a completely random scatter, indicating that the relationship between the predictors and mpg may not be perfectly linear. The Scale-Location plot indicates that the variance of the residuals is not completely constant across fitted values, providing some evidence of heteroscedasticity.

(e) Interaction effects

interaction_lm <- lm(mpg ~ (cylinders + displacement + horsepower + weight + acceleration + year + origin)^2, data = Auto)
summary(interaction_lm)

## 
## Call:
## lm(formula = mpg ~ (cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin)^2, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Several interaction terms appear to be statistically significant. In particular, the interactions between displacement and year, acceleration and year, and acceleration and origin all have p-values below 0.05.Overall, the presence of significant interaction terms indicates that some predictor variables do not act independently when explaining variation in miles per gallon.

(f) Transformations

transform_lm <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +sqrt(acceleration) + year + origin + I(horsepower^2) + I(weight^2), data = Auto)

summary(transform_lm)

## 
## Call:
## lm(formula = mpg ~ log(displacement) + log(horsepower) + log(weight) + 
##     sqrt(acceleration) + year + origin + I(horsepower^2) + I(weight^2), 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1230 -1.5110 -0.1676  1.4112 12.1136 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.598e+02  1.919e+01   8.328 1.48e-15 ***
## log(displacement)  -8.725e-01  1.018e+00  -0.857   0.3921    
## log(horsepower)    -1.242e+01  2.075e+00  -5.984 4.99e-09 ***
## log(weight)        -1.653e+01  3.579e+00  -4.620 5.26e-06 ***
## sqrt(acceleration) -2.016e+00  7.827e-01  -2.576   0.0104 *  
## year                7.679e-01  4.485e-02  17.123  < 2e-16 ***
## origin              6.528e-01  2.598e-01   2.513   0.0124 *  
## I(horsepower^2)     2.182e-04  5.148e-05   4.238 2.83e-05 ***
## I(weight^2)         2.614e-07  1.376e-07   1.901   0.0581 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.899 on 383 degrees of freedom
## Multiple R-squared:  0.8649, Adjusted R-squared:  0.8621 
## F-statistic: 306.5 on 8 and 383 DF,  p-value: < 2.2e-16

Although the transformed model performed well, the interaction model achieved a slightly higher adjusted R², suggesting that interactions among predictors may explain variation in mpg more effectively than the transformations considered here.Overall, the transformed model suggests that allowing nonlinear relationships improves the flexibility of the regression model and provides evidence that some predictors affect mpg in a nonlinear manner.

par(mfrow = c(2, 2))
plot(transform_lm)

par(mfrow = c(1, 1))

Exercise 10: Carseats Data

library(ISLR2)

data(Carseats)
str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

(a) Multiple regression model

model_full <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_full)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Interpretation of coefficients

The fitted model uses Sales as the response variable and Price, Urban, and US as predictors. Since Urban and US are qualitative variables.

The intercept represents the predicted sales when Price = 0, Urban = No, and US = No.
The coefficient for Price represents the expected change in sales for a one-unit increase in price, holding Urban and US constant.
The coefficient for UrbanYes represents the difference in sales between urban and non-urban stores, holding price and US status constant.
The coefficient for USYes represents the difference in sales between stores in the US and stores outside the US, holding price and urban status constant.

(c) Model equation

coef(model_full)

## (Intercept)       Price    UrbanYes       USYes 
## 13.04346894 -0.05445885 -0.02191615  1.20057270

The model can be written as:

\[ \widehat{Sales} = 13.04 - 0.054(Price) - 0.022(UrbanYes) + 1.201(USYes) \]

where:

UrbanYes = 1 if the store is urban, and 0 otherwise.
USYes = 1 if the store is in the US, and 0 otherwise.

(d) Significant predictors

Based on the regression output, we reject \(H_0: \beta_j = 0\) for predictors with small p-values.

summary(model_full)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

The predictors Price and US appear to be statistically significant. The variable Urban does not appear to be statistically significant because its p-value is large.

(e) Smaller model

model_small <- lm(Sales ~ Price + US, data = Carseats)
summary(model_small)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) Model fit comparison

summary(model_full)$r.squared

## [1] 0.2392754

summary(model_full)$adj.r.squared

## [1] 0.2335123

summary(model_small)$r.squared

## [1] 0.2392629

summary(model_small)$adj.r.squared

## [1] 0.2354305

The full model and smaller model fit the data similarly. Since Urban was not statistically significant, removing it does not meaningfully reduce model performance. The smaller model is preferred because it is simpler and uses only predictors with evidence of association with Sales.

(g) 95% confidence intervals

confint(model_small)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Outliers and high leverage observations

par(mfrow = c(2, 2))
plot(model_small)

par(mfrow = c(1, 1))

# Studentized residuals
stud_res <- rstudent(model_small)
which(abs(stud_res) > 3)

## named integer(0)

# Leverage values
lev <- hatvalues(model_small)
which(lev > 2 * mean(lev))

##  43 126 156 157 160 166 172 175 192 204 209 270 273 314 316 357 366 368 384 387 
##  43 126 156 157 160 166 172 175 192 204 209 270 273 314 316 357 366 368 384 387

The diagnostic plots and leverage values can be used to check for unusual observations. Points with very large studentized residuals may be possible outliers, while observations with high leverage may have unusual predictor values. Overall, the diagnostic plots should be examined to determine whether any observations are unusually influential.

Simple Linear Regression Without an Intercept

(a)

For regression of \(Y\) onto \(X\) without an intercept,

\[ \hat{\beta}_{Y|X} = \frac{\sum x_i y_i}{\sum x_i^2} \]

For regression of \(X\) onto \(Y\) without an intercept,

\[ \hat{\beta}_{X|Y} = \frac{\sum x_i y_i}{\sum y_i^2} \]

These two coefficient estimates are equal when:

\[ \sum x_i^2 = \sum y_i^2 \]

In other words, the two estimates are the same when the sum of squared values for \(X\) equals the sum of squared values for \(Y\).

(b) Example where the estimates are different

set.seed(1)

x <- rnorm(100)
y <- 2 * x + rnorm(100)

# Regression of Y onto X without intercept
fit_y_on_x <- lm(y ~ x + 0)

# Regression of X onto Y without intercept
fit_x_on_y <- lm(x ~ y + 0)

coef(fit_y_on_x)

##        x 
## 1.993876

coef(fit_x_on_y)

##         y 
## 0.3911145

The two coefficient estimates are different in this example because \(\sum x_i^2\) is not equal to \(\sum y_i^2\).

sum(x^2)

## [1] 81.05509

sum(y^2)

## [1] 413.2135

(c) Example where the estimates are the same

set.seed(2)

x <- rnorm(100)
y <- x

# Regression of Y onto X without intercept
fit_y_on_x_same <- lm(y ~ x + 0)

# Regression of X onto Y without intercept
fit_x_on_y_same <- lm(x ~ y + 0)

coef(fit_y_on_x_same)

## x 
## 1

coef(fit_x_on_y_same)

## y 
## 1

In this example, \(x\) and \(y\) are identical, so: