Exercise 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

The main difference between KNN classifier and KNN regression is that when using a classifer approach, assumes the outcome as the most prevalent class, whereas in KNN regression assumes the outcome of the average of the nearest neighboring values.

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Exercise 9

  1. Produce a scatterplot matrix which includes all of the variables in the data set.
library(ISLR)
data(Auto)
pairs(Auto)

  1. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
cor(Auto[1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
  1. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.
lm_w_mpg <- lm(mpg ~ . - name, data = Auto)
summary(lm_w_mpg)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictors and the response?

By looking at the p-value, 2.2e-16, of the model, we can conclude that there is a relationship between the predictors and the response.

  1. Which predictors appear to have a statistically significant relationship to the response?

Again, by examining the p-values, we can see that all predictors are significant except cylinders, horsepower, and acceleration because they are bigger than 0.05.

  1. What does the coefficient for the year variable suggest?

The coefficient of year represents that in increase in 1 year will have an increase of 0.750773.

  1. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2, 2))
plot(lm_w_mpg)

By looking at the residuals vs fitted plot there might be some non linearity in the data. Also, looking at the residuals vs leverage plot, we can see there are a few outliers at points 327, 394, and 14.

  1. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
part9e <- lm(mpg ~ horsepower * displacement+displacement * cylinders, data = Auto[, 1:8])
summary(part9e)
## 
## Call:
## lm(formula = mpg ~ horsepower * displacement + displacement * 
##     cylinders, data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1114  -2.1683  -0.4345   2.0054  18.2391 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.408e+01  2.564e+00  21.094  < 2e-16 ***
## horsepower              -2.318e-01  2.285e-02 -10.144  < 2e-16 ***
## displacement            -1.241e-01  1.444e-02  -8.592  < 2e-16 ***
## cylinders                1.224e-01  7.419e-01   0.165    0.869    
## horsepower:displacement  5.544e-04  8.214e-05   6.750 5.44e-11 ***
## displacement:cylinders   3.055e-03  2.957e-03   1.033    0.302    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.93 on 386 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.7464 
## F-statistic: 231.2 on 5 and 386 DF,  p-value: < 2.2e-16

From above, we can see that the interaction between horsepower and displacement is significant, but the interaction between displacement and cylinders is not.

  1. Try a few different transformations of the variables, such as log(X), √ X, X2. Comment on your findings.
par(mfrow = c(2, 2))
plot(log(Auto$displacement), Auto$mpg)
plot(sqrt(Auto$displacement), Auto$mpg)
plot((Auto$displacement)^2, Auto$mpg)

With these transformations of displacement, it looks like the log transformation plot looks the most linear.

Exercise 10

  1. Fit a multiple regression model to predict Sales using Price, Urban, and US.
data(Carseats)
carsests_model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carsests_model)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
  1. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The coefficient of the Price means that the price increase of $1 is a decrease of 54.4459 units in sales if all other predictors remain the same. The coefficient of the Urban says that the unit sales in urban location are 21.916 units less than in rural location if all other predictors are the same. The coefficient of the “US” variable says that the unit sales in a US store are 1200.5723 units more than in a non US store if all other predictors remainthe same.

  1. Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales=13.0434689 + (−0.054459) × Price + (−0.021916) × Urban + (1.200573) × US + ε

Urban = 1 if the store is located in an urban location and 0 if not. US = 1 if the store is located in the United States and 0 if not

  1. For which of the predictors can you reject the null hypothesis H0 : βj = 0?

We can reject the null hypothesis for the “Price” and “US” variables.

  1. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
model10e <- lm(Sales ~ Price + US, data = Carseats)
summary(model10e)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
  1. How well do the models in (a) and (e) fit the data?

According to the R squared about 23.93% of the variability is explained by the model.

  1. Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(model10e)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632
  1. Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(model10e)

Looking at the residuals vs levergae plot, there looks to be some outliers in the data.

Exercise 12

  1. Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

  2. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(100)

x <- 1:100
sum(x^2)
## [1] 338350
y <- 2 * x + rnorm(100, sd = 0.5)
sum(y^2)
## [1] 1353011
model.Y <- lm(y ~ x + 0)
model.X <- lm(x ~ y + 0)
summary(model.Y)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11910 -0.30013 -0.01178  0.34459  1.31061 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x 1.9996933  0.0008768    2281   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.51 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.201e+06 on 1 and 99 DF,  p-value: < 2.2e-16
summary(model.X)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6542 -0.1713  0.0067  0.1504  0.5607 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## y 0.5000672  0.0002193    2281   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2551 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.201e+06 on 1 and 99 DF,  p-value: < 2.2e-16
  1. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
modely2 <- lm(y ~ x + 0)
modelx2 <- lm(x ~ y + 0)
summary(modely2)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08
summary(modelx2)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08