Assignment 2

Question 2: Carefully explain the differences between the KNN classifier and KNN regression methods.

Answer: Classifier predicts a class label similar to blue/red or spam/not spam. Regression predicts a continous value such as house price or temperature. They differ in their evaluation metrics as well. Classification relies on accuracy, precision, recall, and F1 score while regression relies on MSE, RMSE, and mean squared error.

Question 9: This question involves the use of multiple linear regression on the Auto data set.

Produce a scatterplot matrix which includes all of the variables in the data set.

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.3.3

pairs(~ ., data = Auto,
      main = "Auto Dataset Scatterplot Matrix")

B. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

auto_cor <- cor(subset(Auto, select = -name))
print(auto_cor)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

fit <- lm(mpg ~ . - name, data = Auto)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response? Yes, there is a relationship between some of the predictors and the response. These being displacement, weight, year and origin as they all have p-values less than 0.05.
Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year and origin as they all have p-values less than 0.05 making them statistically significant
What does the coefficient for the year variable suggest? If we hold all other variables constant, for every one unit increase to year, mpg will increase about .75.

D. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(fit)

Residuals vs fitted graph shows a U-shaped curve rather than random points spread out. In the Normal Q-Q plot, it does show outliers. In the residuals vs leverage graph, it does show that observation 14 is all the way to right making it an outlier as well with high leverage.

E. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

fit1 <- lm(mpg ~ displacement:year, data = Auto)
fit2 <- lm(mpg ~ horsepower * weight, data = Auto)
fit3 <- lm(mpg ~ acceleration:year, data = Auto)

summary(fit1)

## 
## Call:
## lm(formula = mpg ~ displacement:year, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1566  -3.0276  -0.6339   2.5802  20.1066 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.530e+01  5.314e-01   66.42   <2e-16 ***
## displacement:year -8.100e-04  3.227e-05  -25.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.832 on 390 degrees of freedom
## Multiple R-squared:  0.6177, Adjusted R-squared:  0.6168 
## F-statistic: 630.2 on 1 and 390 DF,  p-value: < 2.2e-16

summary(fit2)

## 
## Call:
## lm(formula = mpg ~ horsepower * weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7725  -2.2074  -0.2708   1.9973  14.7314 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.356e+01  2.343e+00  27.127  < 2e-16 ***
## horsepower        -2.508e-01  2.728e-02  -9.195  < 2e-16 ***
## weight            -1.077e-02  7.738e-04 -13.921  < 2e-16 ***
## horsepower:weight  5.355e-05  6.649e-06   8.054 9.93e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared:  0.7484, Adjusted R-squared:  0.7465 
## F-statistic: 384.8 on 3 and 388 DF,  p-value: < 2.2e-16

summary(fit3)

## 
## Call:
## lm(formula = mpg ~ acceleration:year, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4083  -4.9868  -0.9834   4.6751  22.5613 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.561770   1.758112   1.457    0.146    
## acceleration:year 0.017642   0.001458  12.103   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.663 on 390 degrees of freedom
## Multiple R-squared:  0.273,  Adjusted R-squared:  0.2712 
## F-statistic: 146.5 on 1 and 390 DF,  p-value: < 2.2e-16

All interactions noted in the lm above show a statistically significant interaction.

F. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

fit_log <- lm(log(mpg) ~ cylinders + log(displacement) + log(horsepower) + log(weight) + acceleration + year + origin, data = Auto)

fit_sqrt <- lm(mpg ~ cylinders + sqrt(displacement) + sqrt(horsepower) + sqrt(weight) + acceleration + year + origin, data = Auto)

fit_quad <- lm(mpg ~ cylinders + displacement + horsepower + I(horsepower^2) + weight + I(weight^2) + acceleration + year + origin, data = Auto)


summary(fit_log)

## 
## Call:
## lm(formula = log(mpg) ~ cylinders + log(displacement) + log(horsepower) + 
##     log(weight) + acceleration + year + origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40203 -0.06561 -0.00048  0.05823  0.38672 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.939090   0.376026  18.454  < 2e-16 ***
## cylinders         -0.013813   0.010580  -1.306   0.1925    
## log(displacement)  0.003877   0.052652   0.074   0.9413    
## log(horsepower)   -0.254083   0.057923  -4.387 1.49e-05 ***
## log(weight)       -0.599957   0.081224  -7.386 9.47e-13 ***
## acceleration      -0.008336   0.003800  -2.194   0.0289 *  
## year               0.029594   0.001743  16.983  < 2e-16 ***
## origin             0.023370   0.010359   2.256   0.0246 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1139 on 384 degrees of freedom
## Multiple R-squared:  0.8899, Adjusted R-squared:  0.8879 
## F-statistic: 443.4 on 7 and 384 DF,  p-value: < 2.2e-16

summary(fit_sqrt)

## 
## Call:
## lm(formula = mpg ~ cylinders + sqrt(displacement) + sqrt(horsepower) + 
##     sqrt(weight) + acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4030 -1.9807 -0.1672  1.7124 12.9777 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.97092    4.94260   1.006   0.3152    
## cylinders           0.11130    0.32178   0.346   0.7296    
## sqrt(displacement)  0.14430    0.22341   0.646   0.5187    
## sqrt(horsepower)   -0.64976    0.30327  -2.143   0.0328 *  
## sqrt(weight)       -0.63983    0.07765  -8.240 2.75e-15 ***
## acceleration       -0.04568    0.10247  -0.446   0.6560    
## year                0.73646    0.04927  14.946  < 2e-16 ***
## origin              1.13268    0.28152   4.023 6.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.21 on 384 degrees of freedom
## Multiple R-squared:  0.8339, Adjusted R-squared:  0.8309 
## F-statistic: 275.5 on 7 and 384 DF,  p-value: < 2.2e-16

summary(fit_quad)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + I(horsepower^2) + 
##     weight + I(weight^2) + acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8713 -1.6140 -0.1788  1.4667 12.0738 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.110e+00  4.586e+00   1.332  0.18359    
## cylinders        1.600e-01  2.981e-01   0.537  0.59164    
## displacement    -9.982e-04  7.271e-03  -0.137  0.89087    
## horsepower      -2.086e-01  3.999e-02  -5.216 3.01e-07 ***
## I(horsepower^2)  6.217e-04  1.286e-04   4.833 1.96e-06 ***
## weight          -1.339e-02  2.125e-03  -6.303 8.07e-10 ***
## I(weight^2)      1.420e-06  2.835e-07   5.010 8.35e-07 ***
## acceleration    -1.830e-01  1.006e-01  -1.818  0.06979 .  
## year             7.724e-01  4.522e-02  17.081  < 2e-16 ***
## origin           7.372e-01  2.530e-01   2.914  0.00378 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.91 on 382 degrees of freedom
## Multiple R-squared:  0.8642, Adjusted R-squared:  0.861 
## F-statistic:   270 on 9 and 382 DF,  p-value: < 2.2e-16

We can see that non-linear transformations are highly justified. Using the original R-squared, we can see that all three transformation models exceed that of the original. Using the squared transformation, it also allows us to address the parablic shape seen in the residual vs fitted graph of the original.

This question should be answered using the Carseats data set.

data("Carseats")
head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

car_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(car_fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

B. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Price - for every one unit increase in Price, Sales decreases by .0545.

Urban - stores located in urban areas are about 22 unites lower than in non-urban stores.

US - stores located in the US see sales increase 1.2006 compared to stores outside the US.

Write out the model in equation form, being careful to handle the qualitative variables properly.

y = 13.04-.05(price)-.02(Urban)+1.2(US)

D. For which of the predictors can you reject the null hypothesis H0 : βj =0?

We can reject the null hypothesis for Price and US.

E. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

new_model <- lm(Sales ~ Price + US, data = Carseats)

summary(new_model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

F. How well do the models in (a) and (e) fit the data?

Both models only fit about 24% of the data.

Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(new_model)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

H. Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(new_model)

There is signs of leverage based on the graph, but the outliers fall on the regression line for Q-Q residuals.

This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate ˆβ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The first circumstance is if the sum of squares is equal. The coefficient estimates will be exactly the same if the sum of squares of the X values is equal to the sum of squares of the Y values. The other circumstance is if the numerator is 0. X and y would be orthogonal.

B. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(42)

X <- rnorm(100, mean = 0, sd = 1)
Y <- 2 * X + rnorm(100, mean = 0, sd = 0.5)

model_Y_onto_X <- lm(Y ~ X - 1)
beta_Y_onto_X  <- coef(model_Y_onto_X)

model_X_onto_Y <- lm(X ~ Y - 1)
beta_X_onto_Y  <- coef(model_X_onto_Y)

print(beta_X_onto_Y)

##         Y 
## 0.4746934

print(beta_Y_onto_X)

##        X 
## 2.012243

C. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X

set.seed(123)

X <- rnorm(100, mean = 5, sd = 2)

Y <- rev(X)

model_Y_onto_X2 <- lm(Y ~ X - 1)
beta_Y_onto_X2  <- coef(model_Y_onto_X2)

model_X_onto_Y2 <- lm(X ~ Y - 1)
beta_X_onto_Y2  <- coef(model_X_onto_Y2)

print(beta_Y_onto_X2)

##         X 
## 0.9087063

print(beta_X_onto_Y2)

##         Y 
## 0.9087063

Assignment 2

Jason Villa

2026-06-23