Problem 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classification and KNN regression differ in the type of response they predict and how predictions are made. KNN classification is used when the response variable is categorical. The prediction is made by looking at the closest k values and chooses the most common class. KNN regression is used when the response variable is numerical. The average of the neighboring k values is taken to make the prediction. Classification model evaluation is typically assessed using accuracy, precision, recall, or a confusion matrix, while regression is evaluated using mean squared error (MSE) or R-squared. KNN classification results in distinct class boundaries, whereas regression produces smooth, continuous predictions. Additionally, classification models can be sensitive to class imbalance. One class dominates and skews predictions and may create ties when neighbors are evenly split among classes. KNN regression is not affected by class imbalance because it does not rely on majority class voting, but averages continuous values.

Problem 9

This question involves the use of multiple linear regression on the Auto data set.

library(ISLR)
library(GGally)
library(ggplot2)
attach(Auto)

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

auto_new <- Auto[, 1:7]
cor(auto_new)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

auto_lm <- lm(mpg ~ ., data = auto_new)
summary(auto_lm)
## 
## Call:
## lm(formula = mpg ~ ., data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6927 -2.3864 -0.0801  2.0291 14.3607 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.454e+01  4.764e+00  -3.051  0.00244 ** 
## cylinders    -3.299e-01  3.321e-01  -0.993  0.32122    
## displacement  7.678e-03  7.358e-03   1.044  0.29733    
## horsepower   -3.914e-04  1.384e-02  -0.028  0.97745    
## weight       -6.795e-03  6.700e-04 -10.141  < 2e-16 ***
## acceleration  8.527e-02  1.020e-01   0.836  0.40383    
## year          7.534e-01  5.262e-02  14.318  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 385 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8063 
## F-statistic: 272.2 on 6 and 385 DF,  p-value: < 2.2e-16

Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

There is a statistically significant relationship between the predictors and response variable mpg. The p-value is very low at less than 2.2e-16, indicating at least one predictors is associated with mpg.

ii. Which predictors appear to have a statistically significant relationship to the response?

Looking back at the model output, weight and year show strong statistical significance. Their p-values are both less than 2e-16 and marked with three asterisks.

iii. What does the coeffcient for the year variable suggest?

For each one-year increase in the car’s model year, miles per gallon will increase by approximately 0.7534. In other words, as a new model is released every year, it’s miles per gallon will be 0.7534 higher than the previous year.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(auto_lm)

There are a few concerns with the fit of this model. The residuals versus fitted plot reveals some potential non-linearity. We can also see some unusually large outliers on this plot as well. The Q-Q plot is mostly normally distributed. However, some points deviate from the line on the right side, indicating there are large outliers. Lastly, the residuals versus leverage plot identifies many high-leverage points.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Below I have explored a number of interactions within the model. I’ve come to the conclusion that the relationship between predictors and mpg depends on other variables. Many interactions appear to be statistically significant:

interaction_1 <- lm(mpg ~ . + year:weight, data = auto_new)
summary(interaction_1)
## 
## Call:
## lm(formula = mpg ~ . + year:weight, data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5341 -2.0577 -0.0967  1.6299 12.6653 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.233e+02  1.367e+01  -9.025  < 2e-16 ***
## cylinders     3.626e-02  3.087e-01   0.117    0.907    
## displacement  2.544e-03  6.798e-03   0.374    0.708    
## horsepower   -1.615e-02  1.287e-02  -1.255    0.210    
## weight        3.271e-02  4.740e-03   6.900 2.15e-11 ***
## acceleration  1.529e-01  9.424e-02   1.623    0.105    
## year          2.178e+00  1.762e-01  12.356  < 2e-16 ***
## weight:year  -5.214e-04  6.203e-05  -8.405 8.44e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.161 on 384 degrees of freedom
## Multiple R-squared:  0.8389, Adjusted R-squared:  0.836 
## F-statistic: 285.6 on 7 and 384 DF,  p-value: < 2.2e-16
interaction_2 <- lm(mpg ~ . + horsepower:acceleration, data = auto_new)
summary(interaction_2)
## 
## Call:
## lm(formula = mpg ~ . + horsepower:acceleration, data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3439 -1.8710 -0.0569  1.8487 13.3207 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -3.246e+01  5.011e+00  -6.477 2.87e-10 ***
## cylinders                2.656e-01  3.190e-01   0.833   0.4055    
## displacement            -1.940e-02  7.704e-03  -2.519   0.0122 *  
## horsepower               1.558e-01  2.402e-02   6.487 2.70e-10 ***
## weight                  -3.901e-03  7.285e-04  -5.354 1.48e-07 ***
## acceleration             1.094e+00  1.618e-01   6.763 5.05e-11 ***
## year                     7.583e-01  4.903e-02  15.465  < 2e-16 ***
## horsepower:acceleration -1.358e-02  1.762e-03  -7.708 1.11e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.201 on 384 degrees of freedom
## Multiple R-squared:  0.8348, Adjusted R-squared:  0.8318 
## F-statistic: 277.2 on 7 and 384 DF,  p-value: < 2.2e-16
interaction_3 <- lm(mpg ~ . + cylinders:weight, data = auto_new)
summary(interaction_3)
## 
## Call:
## lm(formula = mpg ~ . + cylinders:weight, data = auto_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5978  -1.9357  -0.1113   1.6254  13.0384 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.8911583  4.9231384   2.212   0.0275 *  
## cylinders        -5.3506277  0.5765724  -9.280   <2e-16 ***
## displacement      0.0088614  0.0065444   1.354   0.1765    
## horsepower       -0.0240166  0.0125236  -1.918   0.0559 .  
## weight           -0.0159580  0.0010825 -14.742   <2e-16 ***
## acceleration      0.1048820  0.0907620   1.156   0.2486    
## year              0.7854098  0.0469005  16.746   <2e-16 ***
## cylinders:weight  0.0016390  0.0001617  10.139   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.055 on 384 degrees of freedom
## Multiple R-squared:  0.8495, Adjusted R-squared:  0.8468 
## F-statistic: 309.7 on 7 and 384 DF,  p-value: < 2.2e-16
interaction_4 <- lm(mpg ~ . + acceleration:weight, data = auto_new)
summary(interaction_4)
## 
## Call:
## lm(formula = mpg ~ . + acceleration:weight, data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4682 -2.1692 -0.0169  1.7400 12.9914 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.488e+01  5.905e+00  -7.601 2.27e-13 ***
## cylinders           -6.820e-02  3.109e-01  -0.219  0.82649    
## displacement        -7.281e-03  7.112e-03  -1.024  0.30662    
## horsepower          -3.282e-02  1.353e-02  -2.425  0.01578 *  
## weight               5.037e-03  1.643e-03   3.065  0.00233 ** 
## acceleration         1.814e+00  2.416e-01   7.508 4.24e-13 ***
## year                 7.876e-01  4.916e-02  16.020  < 2e-16 ***
## weight:acceleration -6.509e-04  8.365e-05  -7.781 6.72e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.197 on 384 degrees of freedom
## Multiple R-squared:  0.8352, Adjusted R-squared:  0.8322 
## F-statistic: 278.1 on 7 and 384 DF,  p-value: < 2.2e-16
interaction_5 <- lm(mpg ~ . + displacement:weight, data = auto_new)
summary(interaction_5)
## 
## Call:
## lm(formula = mpg ~ . + displacement:weight, data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6095 -1.7946 -0.0321  1.5551 12.5628 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -3.622e+00  4.240e+00  -0.854   0.3936    
## cylinders            2.176e-01  2.919e-01   0.745   0.4565    
## displacement        -7.882e-02  9.946e-03  -7.925  2.5e-14 ***
## horsepower          -2.809e-02  1.224e-02  -2.295   0.0223 *  
## weight              -1.105e-02  6.914e-04 -15.977  < 2e-16 ***
## acceleration         6.796e-02  8.845e-02   0.768   0.4428    
## year                 7.885e-01  4.571e-02  17.249  < 2e-16 ***
## displacement:weight  2.428e-05  2.142e-06  11.334  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.978 on 384 degrees of freedom
## Multiple R-squared:  0.8571, Adjusted R-squared:  0.8545 
## F-statistic: 328.9 on 7 and 384 DF,  p-value: < 2.2e-16
interaction_6 <- lm(mpg ~ . + year:horsepower, data = auto_new)
summary(interaction_6)
## 
## Call:
## lm(formula = mpg ~ . + year:horsepower, data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8434 -1.9803 -0.0588  1.6753 12.5900 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.971e+01  1.032e+01  -9.658   <2e-16 ***
## cylinders        2.484e-01  3.083e-01   0.806    0.421    
## displacement    -5.640e-03  6.842e-03  -0.824    0.410    
## horsepower       8.686e-01  9.646e-02   9.005   <2e-16 ***
## weight          -5.480e-03  6.256e-04  -8.760   <2e-16 ***
## acceleration    -3.150e-02  9.357e-02  -0.337    0.737    
## year             1.895e+00  1.344e-01  14.098   <2e-16 ***
## horsepower:year -1.206e-02  1.327e-03  -9.087   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.121 on 384 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.8402 
## F-statistic: 294.6 on 7 and 384 DF,  p-value: < 2.2e-16
interaction_7 <- lm(mpg ~ . + displacement:horsepower, data = auto_new)
summary(interaction_7)
## 
## Call:
## lm(formula = mpg ~ . + displacement:horsepower, data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2387 -1.6434 -0.0255  1.4588 12.9850 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3.400e-01  4.258e+00   0.080  0.93641    
## cylinders                7.965e-01  2.991e-01   2.663  0.00808 ** 
## displacement            -8.672e-02  1.010e-02  -8.588 2.23e-16 ***
## horsepower              -2.022e-01  2.062e-02  -9.807  < 2e-16 ***
## weight                  -3.067e-03  6.522e-04  -4.703 3.58e-06 ***
## acceleration            -2.307e-01  9.115e-02  -2.531  0.01176 *  
## year                     7.382e-01  4.500e-02  16.404  < 2e-16 ***
## displacement:horsepower  5.588e-04  4.676e-05  11.951  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.937 on 384 degrees of freedom
## Multiple R-squared:  0.861,  Adjusted R-squared:  0.8584 
## F-statistic: 339.7 on 7 and 384 DF,  p-value: < 2.2e-16
interaction_8 <- lm(mpg ~ . + year:cylinders, data = auto_new)
summary(interaction_8)
## 
## Call:
## lm(formula = mpg ~ . + year:cylinders, data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9542 -2.1263 -0.0914  1.8070 13.3174 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -9.213e+01  1.297e+01  -7.102 6.00e-12 ***
## cylinders       1.521e+01  2.455e+00   6.197 1.49e-09 ***
## displacement   -2.150e-04  7.113e-03  -0.030    0.976    
## horsepower     -1.195e-02  1.330e-02  -0.899    0.369    
## weight         -6.546e-03  6.391e-04 -10.242  < 2e-16 ***
## acceleration    1.268e-01  9.736e-02   1.302    0.194    
## year            1.760e+00  1.654e-01  10.640  < 2e-16 ***
## cylinders:year -1.996e-01  3.126e-02  -6.384 4.97e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.271 on 384 degrees of freedom
## Multiple R-squared:  0.8276, Adjusted R-squared:  0.8244 
## F-statistic: 263.3 on 7 and 384 DF,  p-value: < 2.2e-16
interaction_9 <- lm(mpg ~ . + cylinders:horsepower, data = auto_new)
summary(interaction_9)
## 
## Call:
## lm(formula = mpg ~ . + cylinders:horsepower, data = auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6700 -1.7839 -0.0665  1.4760 12.4198 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          15.2786145  4.8861467   3.127   0.0019 ** 
## cylinders            -4.4626857  0.4631642  -9.635  < 2e-16 ***
## displacement         -0.0102587  0.0065626  -1.563   0.1188    
## horsepower           -0.3256486  0.0309855 -10.510  < 2e-16 ***
## weight               -0.0039166  0.0006328  -6.189 1.56e-09 ***
## acceleration         -0.1842980  0.0914657  -2.015   0.0446 *  
## year                  0.7401290  0.0455736  16.240  < 2e-16 ***
## cylinders:horsepower  0.0429016  0.0037692  11.382  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.974 on 384 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8548 
## F-statistic: 329.8 on 7 and 384 DF,  p-value: < 2.2e-16

(f) Try a few different transformations of the variables, such as \(log(X)\), \(√X\), \(X^2\). Comment on your findings.

To start, the pairs(Auto) function gave some insight on which predictors to consider for transformation. The scatterplot matrix suggested weight, horsepower, and displacement have nonlinear and skewed relationships with mpg.

auto_log <- auto_new

auto_log$log_weight <- log(auto_new$weight)
auto_log$log_horsepower <- log(auto_new$horsepower)

model_log <- lm(mpg ~ . + log_weight + log_horsepower, data = auto_log)
summary(model_log)
## 
## Call:
## lm(formula = mpg ~ . + log_weight + log_horsepower, data = auto_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5282 -1.6266 -0.1555  1.6021 12.7377 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    225.666934  31.890853   7.076 7.10e-12 ***
## cylinders       -0.122498   0.289851  -0.423   0.6728    
## displacement    -0.006184   0.006804  -0.909   0.3640    
## horsepower       0.119829   0.028465   4.210 3.19e-05 ***
## weight           0.003913   0.001800   2.174   0.0303 *  
## acceleration    -0.217486   0.100795  -2.158   0.0316 *  
## year             0.773966   0.045562  16.987  < 2e-16 ***
## log_weight     -24.575145   5.689745  -4.319 2.00e-05 ***
## log_horsepower -18.384528   3.527522  -5.212 3.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.937 on 383 degrees of freedom
## Multiple R-squared:  0.8613, Adjusted R-squared:  0.8584 
## F-statistic: 297.3 on 8 and 383 DF,  p-value: < 2.2e-16
auto_sqrt <- auto_new

auto_sqrt$sqrt_weight <- sqrt(auto_new$weight)
auto_sqrt$sqrt_displacement <- sqrt(auto_new$displacement)

model_sqrt <- lm(mpg ~ . + sqrt_weight + sqrt_displacement, data = auto_sqrt)
summary(model_sqrt)
## 
## Call:
## lm(formula = mpg ~ . + sqrt_weight + sqrt_displacement, data = auto_sqrt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.546  -1.707  -0.088   1.533  12.460 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       52.243031  10.802332   4.836 1.92e-06 ***
## cylinders         -0.033668   0.304569  -0.111 0.912038    
## displacement       0.093917   0.025990   3.614 0.000342 ***
## horsepower        -0.033643   0.013203  -2.548 0.011219 *  
## weight             0.012290   0.004442   2.767 0.005931 ** 
## acceleration       0.007888   0.089637   0.088 0.929920    
## year               0.797226   0.046036  17.317  < 2e-16 ***
## sqrt_weight       -1.929397   0.509414  -3.787 0.000177 ***
## sqrt_displacement -2.704823   0.787241  -3.436 0.000655 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.987 on 383 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8535 
## F-statistic: 285.8 on 8 and 383 DF,  p-value: < 2.2e-16
auto_sq <- auto_new

auto_sq$sq_weight <- auto_new$weight^2
auto_sq$sq_horsepower <- auto_new$horsepower^2

model_sq <- lm(mpg ~ . + sq_weight + sq_horsepower, data = auto_sq)
summary(model_sq)
## 
## Call:
## lm(formula = mpg ~ . + sq_weight + sq_horsepower, data = auto_sq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4313 -1.6631 -0.0658  1.5147 12.6518 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.927e+00  4.527e+00   1.972   0.0493 *  
## cylinders      2.562e-01  2.991e-01   0.857   0.3922    
## displacement  -7.373e-03  7.001e-03  -1.053   0.2930    
## horsepower    -2.017e-01  4.031e-02  -5.003 8.60e-07 ***
## weight        -1.467e-02  2.099e-03  -6.990 1.23e-11 ***
## acceleration  -1.825e-01  1.016e-01  -1.796   0.0733 .  
## year           7.776e-01  4.562e-02  17.043  < 2e-16 ***
## sq_weight      1.601e-06  2.793e-07   5.731 2.02e-08 ***
## sq_horsepower  6.231e-04  1.299e-04   4.797 2.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.939 on 383 degrees of freedom
## Multiple R-squared:  0.8611, Adjusted R-squared:  0.8582 
## F-statistic: 296.9 on 8 and 383 DF,  p-value: < 2.2e-16

All three model transformations improved model performance. The original multiple R-squared was 80.93%. The log transformation improved to 86.13%, square root to 85.65%, and lastly the squared transformation improved the multiple R-squared to 86.11%. Accounting for nonlinear relationships significantly improved all three models. Specifically log and squared transformations showed the greatest improvement.

Problem 10

This question should be answered using the Carseats data set.

library(ISLR)
attach(Carseats)

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

fit <- lm(Sales ~ Price + Urban + US)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coeffcient in the model. Be careful—some of the variables in the model are qualitative!

In the output table above, Price and US are significant predictors of Sales. For every $1 increase in price, sales decreases by $54. US sales are $1,200 higher than sales not in the US. Urban has no effect on Sales (p = 0.936).

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

\(Sales = 13.043469 - 0.054459Price - 0.021916Urban_{Yes} + 1.200573US_{Yes}\)

(d) For which of the predictors can you reject the null hypothesis \(H_0 : \beta_j = 0\)?

We can reject the null hypothesis for Price and US.

(e) On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit_sm <- lm(Sales ~ Price + US)
summary(fit_sm)
## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) ft the data?

The models do not fit the data well. Each explains about 23.93% of the variance in Sales.

(g) Using the model from (e), obtain 95% confidence intervals for the coeffcient(s).

confint(fit_sm)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(fit_sm)

To determine evidence of outliers or high leverage observations in the model from (e), we can use residual plots and influence measures. In the residuals versus leverage plot above, several observations lie beyond the Cook’s distance threshold. These points indicate high leverage observations in the model.

summary(influence.measures(fit_sm))
## Potentially influential observations of
##   lm(formula = Sales ~ Price + US) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

The influence measures revealed multiple observations with large values. This included 26, 29, 43, 50, and many others shown above.

outlying.obs <- c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats_sm <- Carseats[-outlying.obs,]
fit2 <- lm(Sales ~ Price + US, data = Carseats_sm)
summary(fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats_sm)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.263 -1.605 -0.039  1.590  5.428 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.925232   0.665259  19.429  < 2e-16 ***
## Price       -0.053973   0.005511  -9.794  < 2e-16 ***
## USYes        1.255018   0.248856   5.043 7.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared:  0.2387, Adjusted R-squared:  0.2347 
## F-statistic: 58.64 on 2 and 374 DF,  p-value: < 2.2e-16

To check for these observation’s impact on the model, I removed them to observe any changes. The result is a decrease in multiple r-square from 23.93% to 23.87%. Their removal did not enhance the model’s performance; instead, these observations likely represent meaningful variation in the data rather than anomalies.

Problem 12

This problem involves simple linear regression without an intercept.

(a) Recall that the coeffcient estimate \(\hat{\beta}\) for the linear regression of \(Y\) onto \(X\) without an intercept is given by (3.38). Under what circumstance is the coeffcient estimate for the regression of \(X\) onto \(Y\) the same as the coeffcient estimate for the regression of \(Y\) onto \(X\)?

The only circumstance where the coefficient estimates are the same is when the sum of squares of \(X\) equals the sum of squares of \(Y\). In other words, the denominators for regressing \(Y\) on \(X\) and \(X\) on \(Y\) must be equal.

(b) Generate an example in R with \(n = 100\) observations in which the coeffcient estimate for the regression of \(X\) onto \(Y\) is different from the coeffcient estimate for the regression of \(Y\) onto \(X\)

First, we need to generate the 100 observations.

set.seed(1)

n <- 100
X <- rnorm(n)
Y <- 2 * X + rnorm(n) # Add noise to Y to mimic imperfections

Next, create the regression models of Y onto \(X\) and \(X\) onto Y and coefficient estimates. Remember there is no intercept.

# Y onto X
model_y_x <- lm(Y ~ X + 0)
coef_y_x <- coef(model_y_x)

# X onto Y
model_x_y <- lm(X ~ Y + 0)
coef_x_y <- coef(model_x_y)
## X = 1.993876 
## Y = 0.3911145

The results show differing coefficients, proving that two regressions produce different estimates when the sum of squares of \(X\) and \(Y\) are not equal.

(c) Generate an example in R with \(n = 100\) observations in which the coeffcient estimate for the regression of \(X\) onto \(Y\) is the same as the coeffcient estimate for the regression of \(Y\) onto \(X\).

Again, generate 100 observations, but this time \(Y\) and \(X\) are equal to one another.

set.seed(2)

n <- 100
X <- rnorm(n)
Y <- X  

Next, create the two regression models and coefficient estimates.

model2_y_x <- lm(Y ~ X + 0)
model2_x_y <- lm(X ~ Y + 0)

# Coefficients
coef2_y_x <- coef(model2_y_x)
coef2_x_y <- coef(model2_x_y)
## X = 1 
## Y = 1

The results show the same coefficient estimates. This confirms that is the sum of squares of \(X\) and \(Y\) are equal, the coefficient estimates of the regression will be equal as well.