Assignment 2

Problem 2.

Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classification, the model assigns a class label based on the majority vote among the nearest neighbors, making it suitable for categorical outcomes.
KNN regression, the algorithm predicts a continuous value by averaging the numerical outputs of the nearest neighbors, making it effective for continuous target variables.

Problem 9.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

plot(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Auto1 <- Auto
Auto1$name = NULL
cor(Auto1)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

Auto_lm <- lm(mpg ~ . -name, data =Auto)
summary(Auto_lm)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

There is a strong relationship between the predictors and mpg. The very low p-value (< 2.2e-16) means that at least one of the predictors significantly affects mpg. Also, the R-squared value (0.8215) shows that about 82% of the variation in mpg is explained by the predictors.

ii. Which predictors appear to have a statistically significant relationship to the response?

Displacement
Weight
Year
Origin

iii. What does the coefficient for the year variable suggest?

The coefficient for the year variable indicates that it is statistically significant and suggests that, assuming all other variables remain constant, mpg increases by 0.75 for each additional year.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(Auto_lm)

The Residuals vs Fitted plot suggests that the errors might not have constant spread, which can be a problem. The Q-Q plot shows that some points don’t follow a normal pattern, meaning there could be outliers. The Scale-Location plot confirms that the spread of errors changes, which means the model might not be treating all values equally. The Residuals vs Leverage plot shows that observation 14 has high leverage, meaning it could have too much influence on the model.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Auto_lm2 <- lm(mpg ~.:., data = Auto1)
summary(Auto_lm2)

## 
## Call:
## lm(formula = mpg ~ .:., data = Auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Auto_lm3 <- lm(mpg ~.*., data = Auto1)
summary(Auto_lm3)

## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

There are interactions between both symbols * and : due to the p-value. The statistically significant interactions are displacement:year, acceleration:year, and acceleration:origin.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

Auto_lm4 <- lm(mpg ~ weight + I(sqrt(weight)), data = Auto)
summary(Auto_lm4)

## 
## Call:
## lm(formula = mpg ~ weight + I(sqrt(weight)), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5660  -2.6552  -0.4161   1.7373  16.1001 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     109.218284  11.573797   9.437  < 2e-16 ***
## weight            0.013191   0.003828   3.446 0.000631 ***
## I(sqrt(weight))  -2.314535   0.424250  -5.456  8.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.181 on 389 degrees of freedom
## Multiple R-squared:  0.7145, Adjusted R-squared:  0.713 
## F-statistic: 486.7 on 2 and 389 DF,  p-value: < 2.2e-16

Both predictors (weight and I(sqrt(weight))) are statistically significant at a very low p-value, indicating that both have meaningful contributions to predicting mpg. The model explains about 71.45% of the variation in mpg.

Auto_lm5 <- lm(mpg ~ weight+ I(weight^2), data = Auto)
summary(Auto_lm5)

## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6246  -2.7134  -0.3485   1.8267  16.0866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.226e+01  2.993e+00  20.800  < 2e-16 ***
## weight      -1.850e-02  1.972e-03  -9.379  < 2e-16 ***
## I(weight^2)  1.697e-06  3.059e-07   5.545 5.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.7137 
## F-statistic: 488.3 on 2 and 389 DF,  p-value: < 2.2e-16

(weight) and the squared term (I(weight^2)) are statistically significant, which suggests a relationship between weight and mpg. The model explains 71.51% of the variation in mpg, a very similar R-squared value to the first model, indicating a comparable model fit.

Problem 10.

library(ISLR2)
attach(Carseats)
head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

summary(Carseats)

##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
##

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

Sale_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Sale_lm)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

coef(Sale_lm)

## (Intercept)       Price    UrbanYes       USYes 
## 13.04346894 -0.05445885 -0.02191615  1.20057270

Coefficient for ‘price’: -0.054459
This means that for every dollar increase in the price of my Carseats, store’s sales decrease by $54 on average.
Coefficient for ‘US = Yes’: 1.200573
This means, on average, stores in the US sell $1,200 more Carseats compared to stores outside the US.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

$Sales = 13.04 - 0.05Price - 0.22Urban + 1.2US$

(d) For which of the predictors can you reject the null hypothesis H0 : βj =0?

The null hypothesis can be rejected for the Price and US variables, since their p-values is 2e-16 and 4.86e-06.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

Sale_lm2 <- lm(Sales ~ Price + US, data = Carseats)
summary(Sale_lm2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

Both models (a) and (e) performed poorly, with the adjusted R-squared for Model A being 0.2335 and for Model E being 0.2354. Both models explained around 23% of the variance. Even after removing the “Urban” variable, it did not affect the model.

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(Sale_lm2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(Sale_lm2)

There are some possible outliers in the Residuals vs Fitted plot and slight deviations in the Q-Q plot. A few high-leverage points appear in the Residuals vs Leverage plot, but they don’t seem very influential.

Problem 12.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimates for both X and Y are equal when we set up and compare their respective quotient equations. This happens when the sum of X^2 is equal to the sum of Y^2, ensuring both variables have the same spread.

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(99)
x <- rnorm(100)
y <- 3 * x + rnorm(100)

df1 <- lm(y~x)
summary(df1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7247 -0.7219 -0.1007  0.7467  2.2239 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07963    0.10765   -0.74    0.461    
## x            3.07747    0.11931   25.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.069 on 98 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8703 
## F-statistic: 665.3 on 1 and 98 DF,  p-value: < 2.2e-16

df2 <- lm(x~y)
summary(df2)

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72090 -0.24436 -0.00792  0.23757  0.75347 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.009198   0.032735   0.281    0.779    
## y           0.283223   0.010980  25.794   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3244 on 98 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8703 
## F-statistic: 665.3 on 1 and 98 DF,  p-value: < 2.2e-16

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(99)
x <- rnorm(100)
y <- x


df3 <- lm(y~x)
summary(df3)

## Warning in summary.lm(df3): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.974e-15  2.010e-17  3.680e-17  6.080e-17  3.933e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -4.441e-17  4.142e-17 -1.072e+00    0.286    
## x            1.000e+00  4.591e-17  2.178e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.115e-16 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.744e+32 on 1 and 98 DF,  p-value: < 2.2e-16

df4 <- lm(x~y)
summary(df4)

## Warning in summary.lm(df4): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.974e-15  2.010e-17  3.680e-17  6.080e-17  3.933e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -4.441e-17  4.142e-17 -1.072e+00    0.286    
## y            1.000e+00  4.591e-17  2.178e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.115e-16 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.744e+32 on 1 and 98 DF,  p-value: < 2.2e-16