Problem 2: Carefully explain the differences between the KNN classifier and KNN regression methods.

Both the KNN classifier and the KNN regression approaches employ data from the K neighbors that are closest to the prediction point \(x_0\) in order to make predictions. KNN classification uses categorical (qualitative) response factors to assign the prediction point to a class, whereas KNN regression uses quantitative response variables to estimate the numerical value of the response. This is where they diverge.In KNN classification the solution is derived by identifying the neighborhood of \(x_0\) and then estimating the conditional probability \(P(Y=j|X=x_0)\) for class \(j\) as the proportion of points in the neighborhood whose response values equal \(j\). The KNN regression strategy is used to resolve regression concerns by once more choosing the \(x_0\) neighborhood and then estimating \(f(x_0)\) as the mean of all the training responses in the neighborhood.

Problem 9: This question involves the use of multiple linear regression on the Auto data set.

library(ISLR2)

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

Auto$name = as.factor(Auto$name)
pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
cor(Auto[1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

auto.mlr = lm(mpg ~ . -name, data=Auto)
summary(auto.mlr)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?
As a result of their linked significant p-values, many predictors are related to the response. The probability that the coefficient will take a value of 0 is indicated by the p-value. The standard p-value cutoff is 0.05. If the probability is less than 0.05, then there is extremely little possibility that it will be 0.
ii. Which predictors appear to have a statistically significant relationship to the response?
Generally, we assume that a variable is significant and has some relationship with the predictor if the p-value for that variable is less than 0.05. So, in this case, all predictors are statistically significant except “cylinders”, “horsepower” and “acceleration”.
iii. What does the coefficient for the year variable suggest?
The year coefficient is 0.7507, or around 3/4. This explains the relationship between the year and mpg. It implies that the mpg increases by 3 every four years.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(auto.mlr)

A slight nonlinearity in the data is indicated by the plot of residuals vs fitted values. One high leverage point (point 14) and a few outliers (higher than 2 or lower than -2) can be seen on the plot of standardized residuals vs leverage.
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

interact.fit = lm(mpg ~ . -name + horsepower*displacement, data=Auto)
origin.hp = lm(mpg ~ . -name + horsepower*origin, data=Auto)
summary(origin.hp)
## 
## Call:
## lm(formula = mpg ~ . - name + horsepower * origin, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.277 -1.875 -0.225  1.570 12.080 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.196e+01  4.396e+00  -4.996 8.94e-07 ***
## cylinders         -5.275e-01  3.028e-01  -1.742   0.0823 .  
## displacement      -1.486e-03  7.607e-03  -0.195   0.8452    
## horsepower         8.173e-02  1.856e-02   4.404 1.38e-05 ***
## weight            -4.710e-03  6.555e-04  -7.186 3.52e-12 ***
## acceleration      -1.124e-01  9.617e-02  -1.168   0.2434    
## year               7.327e-01  4.780e-02  15.328  < 2e-16 ***
## origin             7.695e+00  8.858e-01   8.687  < 2e-16 ***
## horsepower:origin -7.955e-02  1.074e-02  -7.405 8.44e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.116 on 383 degrees of freedom
## Multiple R-squared:  0.8438, Adjusted R-squared:  0.8406 
## F-statistic: 258.7 on 8 and 383 DF,  p-value: < 2.2e-16

Based on the highest correlation the two most significant terms are displacement and horsepower horsepower and origin

inter.fit = lm(mpg ~ .-name + horsepower:origin + horsepower:displacement, data=Auto)
summary(inter.fit)
## 
## Call:
## lm(formula = mpg ~ . - name + horsepower:origin + horsepower:displacement, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7222 -1.5251 -0.0968  1.3553 12.8419 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -4.706e+00  4.686e+00  -1.004   0.3159    
## cylinders                5.142e-01  3.139e-01   1.638   0.1022    
## displacement            -6.970e-02  1.143e-02  -6.098 2.63e-09 ***
## horsepower              -1.540e-01  3.547e-02  -4.342 1.81e-05 ***
## weight                  -3.084e-03  6.478e-04  -4.761 2.73e-06 ***
## acceleration            -2.276e-01  9.099e-02  -2.501   0.0128 *  
## year                     7.349e-01  4.460e-02  16.478  < 2e-16 ***
## origin                   2.281e+00  1.090e+00   2.092   0.0371 *  
## horsepower:origin       -1.918e-02  1.278e-02  -1.500   0.1343    
## displacement:horsepower  4.665e-04  6.127e-05   7.614 2.10e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.908 on 382 degrees of freedom
## Multiple R-squared:  0.8644, Adjusted R-squared:  0.8612 
## F-statistic: 270.6 on 9 and 382 DF,  p-value: < 2.2e-16

Adding more interactions, decreases the significance of previous significant values.
(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)

Problem 10: This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
carseat.fit = lm(Sales ~ Price + Urban + US, data=Carseats)
summary(carseat.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
As per the multiple regression model derived in 10(a), thePrice variable’s coefficient can be understood that, with all other predictors held constant, a rise in price of one dollar typically results in a drop in sales of 54.45 units. When all other variables are held constant, the Urban variable’s coefficient can be interpreted as on an average, urban locations sell 21.91 less units than rural locations. When all other predictors are held constant, the US variable’s coefficient can be understood as, on an average, US stores sell 1200.57 more units than non-US stores.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales=13.0434689+(−0.054459)×Price+(−0.021916)×Urban+(1.200573)×US+ε
where US=1 if the store is in the US and 0 otherwise, and Urban=1 if the store is in an urban area and 0 otherwise.
(d) For which of the predictors can you reject the null hypothesis \(H_0 : β_j = 0\)?
For the variables Price and US we can rule out the null hypothesis.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

carseat.fit2 = lm(Sales ~ Price + US, data=Carseats)
summary(carseat.fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?
As per the models (a) and (e), we can see that the \(R^2\) compared to the bigger model, the smaller model is slightly better. 23.93% variability is explained in the model.
(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(carseat.fit2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(carseat.fit2)

One observation can be found on the graph’s to the extreme right. This indicates that it has a very high leverage. A few others have considerable leverage as well.

Problem 12: This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate \(\hat{\beta}\) for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
When there is no irreducible error and there is a perfect linear relationship between X and Y (Y=X), the coefficent estimate for regression of X onto Y will be the same as the coefficent estimate for regression of Y onto X.
coefficient estimate of Y onto X is:
\(\begin{aligned} \hat{\beta} = \frac{\Sigma_ix_iy_i}{\Sigma_jx^2_j} \\ \end{aligned}\)
coefficient estimate of X onto Y is:
\(\hat{\beta}'\) = \(\frac{\Sigma_ix_iy_i}{\Sigma_jy^2_j}\)
coefficients are the same if:
\(\Sigma_jx^2_j\) = \(\Sigma_jy^2_j\)
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(0)
x=rnorm(100)
y = 2 * x + rnorm(100)
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6391 -0.8650 -0.2032  0.5898  2.7879 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   2.1374     0.1092   19.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9589 on 99 degrees of freedom
## Multiple R-squared:  0.7948, Adjusted R-squared:  0.7927 
## F-statistic: 383.4 on 1 and 99 DF,  p-value: < 2.2e-16
summary(fit.X)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22971 -0.24830  0.04216  0.34170  0.71230 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.37185    0.01899   19.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4 on 99 degrees of freedom
## Multiple R-squared:  0.7948, Adjusted R-squared:  0.7927 
## F-statistic: 383.4 on 1 and 99 DF,  p-value: < 2.2e-16

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

y = x
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6.121e-16 -3.665e-17 -8.400e-19  4.368e-17  2.976e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 1.000e+00  1.058e-17 9.449e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.297e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.928e+33 on 1 and 99 DF,  p-value: < 2.2e-16
summary(fit.X)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6.121e-16 -3.665e-17 -8.400e-19  4.368e-17  2.976e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## y 1.000e+00  1.058e-17 9.449e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.297e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.928e+33 on 1 and 99 DF,  p-value: < 2.2e-16