Question 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

  • The KNN classifier is used for classification problems/categorical target variables. It assigns a class label based on the majority vote among the k nearest neighbors.

  • The KNN regression is used for continuous target variables to solve regression problems. It predicts the value of the output by averging target values of k nearest neighbors.

Question 9

pairs(Auto[, -9])

cor_matrix <- cor(Auto[, -9])
print(cor_matrix)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

mpg, weight, horsepower and displacement show multicollinearity with each other.

model1 <- lm(mpg ~ . -name, data = Auto)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
anova(model1)
## Analysis of Variance Table
## 
## Response: mpg
##               Df  Sum Sq Mean Sq   F value    Pr(>F)    
## cylinders      1 14403.1 14403.1 1300.6838 < 2.2e-16 ***
## displacement   1  1073.3  1073.3   96.9293 < 2.2e-16 ***
## horsepower     1   403.4   403.4   36.4301 3.731e-09 ***
## weight         1   975.7   975.7   88.1137 < 2.2e-16 ***
## acceleration   1     1.0     1.0    0.0872    0.7679    
## year           1  2419.1  2419.1  218.4609 < 2.2e-16 ***
## origin         1   291.1   291.1   26.2912 4.666e-07 ***
## Residuals    384  4252.2    11.1                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Is there a relationship between the predictors and the re- sponse?
  • Yes, given the p-value less than 0.05, the regression output strongly suggests that at least one predictor is significantly affects mpg.
  1. Which predictors appear to have a statistically significant relationship to the response?
  • displacement, weight, year and origin have statistically significant relationship to mpg because they have p-value of less than 0.05.
  1. What does the coefficient for the year variable suggest?
  • the coefficient of year is 0.750773, which means that for each additional year, mpg increases by approximately 0.75mpg, so newer cars tend to be more fuel efficient.
par(mfrow = c(2,2))
plot(model1)

Notable non-linearity in the residual vs fitted plot. The Q-Q plot shows normality with some skewed points at the right tail.

  1. Use the * and : symbols to ft linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Interactions are done between terms that have high correlation.

model_interact <- lm(mpg ~ cylinders * displacement + displacement * weight, data = Auto[, 1:8])
summary(model_interact)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     weight, data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2934  -2.5184  -0.3476   1.8399  17.7723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.262e+01  2.237e+00  23.519  < 2e-16 ***
## cylinders               7.606e-01  7.669e-01   0.992    0.322    
## displacement           -7.351e-02  1.669e-02  -4.403 1.38e-05 ***
## weight                 -9.888e-03  1.329e-03  -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03  3.426e-03  -0.872    0.384    
## displacement:weight     2.128e-05  5.002e-06   4.254 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7237 
## F-statistic: 205.8 on 5 and 386 DF,  p-value: < 2.2e-16

Only displacement:weight is statistically significant.

model_log <- lm(log(mpg) ~ . -name, data = Auto)
summary(model_log)
## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40955 -0.06533  0.00079  0.06785  0.33925 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.751e+00  1.662e-01  10.533  < 2e-16 ***
## cylinders    -2.795e-02  1.157e-02  -2.415  0.01619 *  
## displacement  6.362e-04  2.690e-04   2.365  0.01852 *  
## horsepower   -1.475e-03  4.935e-04  -2.989  0.00298 ** 
## weight       -2.551e-04  2.334e-05 -10.931  < 2e-16 ***
## acceleration -1.348e-03  3.538e-03  -0.381  0.70339    
## year          2.958e-02  1.824e-03  16.211  < 2e-16 ***
## origin        4.071e-02  9.955e-03   4.089 5.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared:  0.8795, Adjusted R-squared:  0.8773 
## F-statistic: 400.4 on 7 and 384 DF,  p-value: < 2.2e-16
model_sqrt <- lm(sqrt(mpg) ~ . -name, data = Auto)
summary(model_sqrt)
## 
## Call:
## lm(formula = sqrt(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98891 -0.18946  0.00505  0.16947  1.02581 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.075e+00  4.290e-01   2.506   0.0126 *  
## cylinders    -5.942e-02  2.986e-02  -1.990   0.0474 *  
## displacement  1.752e-03  6.942e-04   2.524   0.0120 *  
## horsepower   -2.512e-03  1.274e-03  -1.972   0.0493 *  
## weight       -6.367e-04  6.024e-05 -10.570  < 2e-16 ***
## acceleration  2.738e-03  9.131e-03   0.300   0.7644    
## year          7.381e-02  4.709e-03  15.675  < 2e-16 ***
## origin        1.217e-01  2.569e-02   4.735 3.09e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3074 on 384 degrees of freedom
## Multiple R-squared:  0.8561, Adjusted R-squared:  0.8535 
## F-statistic: 326.3 on 7 and 384 DF,  p-value: < 2.2e-16
model_2 <- lm((mpg)^2 ~ . -name, data = Auto)
summary(model_2)
## 
## Call:
## lm(formula = (mpg)^2 ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483.45 -141.87  -19.62  103.58 1042.84 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.878e+03  2.928e+02  -6.412 4.22e-10 ***
## cylinders    -1.436e+01  2.038e+01  -0.704  0.48157    
## displacement  1.328e+00  4.738e-01   2.802  0.00534 ** 
## horsepower   -3.587e-01  8.693e-01  -0.413  0.68009    
## weight       -3.522e-01  4.111e-02  -8.567 2.62e-16 ***
## acceleration  9.278e+00  6.232e+00   1.489  0.13740    
## year          4.081e+01  3.214e+00  12.698  < 2e-16 ***
## origin        9.509e+01  1.754e+01   5.422 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.8 on 384 degrees of freedom
## Multiple R-squared:  0.7292, Adjusted R-squared:  0.7243 
## F-statistic: 147.8 on 7 and 384 DF,  p-value: < 2.2e-16

When mpg is squares, the model has the lowest R^2 when compared to those when mpg has log transformation or square root.

Question 10

data("Carseats")
model2 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model2)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Price: for every $1 increase in price, Sales decrease by 0.054 units. The p-value indicates that the effect is significant

UrbanYes: the high p-value indicates that the variable is insignificant, and has no meaningful effect on sales

USYes: US stores on average sell 1.2 units more compared to those outside of the US. This effect is statistically significant, and suggests US stores perform better.

  1. Sales = 13.04 - 0.0545 * Price - 0.0219 * UrbanYes + 1.2006 * USYes

  2. Price and USYes have statistically significant effects on Sales, so we can reject the null hypothesis for those predictors.

model3 <- lm(Sales ~ Price + US, data = Carseats)
summary(model3)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

When comparing the model from (a) to the model in (e), the adjust R^2 slightly improves, the multuple R^2 remains the same and RSE slightly decreases. Removing Urban in the second model fits the data slightly, which is preferable as it retains model performance while removing an unnecessary variable.

confint(model3, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632
par(mfrow = c(2, 2))
plot(model3)

The residuals vs levarage plot for the model in (e) shows evidence of some outliers and high leverage observations.

Question 12

The coefficients are equal only if X and Y have the same sum of squares.

set.seed(123)
X <- rnorm(100, mean = 0, sd = 5)
Y <- rnorm(100, mean = 0, sd = 3)

#  Y onto X
model_YX <- lm(Y ~ X - 1)  

#  X onto Y
model_XY <- lm(X ~ Y - 1) 

summary(model_YX)
## 
## Call:
## lm(formula = Y ~ X - 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.003 -2.370 -0.540  1.408  9.529 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## X -0.03818    0.06385  -0.598    0.551
## 
## Residual standard error: 2.914 on 99 degrees of freedom
## Multiple R-squared:  0.003598,   Adjusted R-squared:  -0.006466 
## F-statistic: 0.3575 on 1 and 99 DF,  p-value: 0.5512
summary(model_XY)
## 
## Call:
## lm(formula = X ~ Y - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5274  -2.5380   0.1912   3.3317  11.1065 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## Y -0.09426    0.15764  -0.598    0.551
## 
## Residual standard error: 4.578 on 99 degrees of freedom
## Multiple R-squared:  0.003598,   Adjusted R-squared:  -0.006466 
## F-statistic: 0.3575 on 1 and 99 DF,  p-value: 0.5512
set.seed(123)
X <- rnorm(100, mean = 0, sd = 5)
Y <- X

#  Y onto X
model_YX <- lm(Y ~ X - 1)  

#  X onto Y
model_XY <- lm(X ~ Y - 1) 

summary(model_YX)
## Warning in summary.lm(model_YX): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = Y ~ X - 1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.366e-15 -1.360e-16  1.320e-17  2.015e-16  1.256e-14 
## 
## Coefficients:
##    Estimate Std. Error  t value Pr(>|t|)    
## X 1.000e+00  2.874e-17 3.48e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.312e-15 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.211e+33 on 1 and 99 DF,  p-value: < 2.2e-16
summary(model_XY)
## Warning in summary.lm(model_XY): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = X ~ Y - 1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.366e-15 -1.360e-16  1.320e-17  2.015e-16  1.256e-14 
## 
## Coefficients:
##    Estimate Std. Error  t value Pr(>|t|)    
## Y 1.000e+00  2.874e-17 3.48e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.312e-15 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.211e+33 on 1 and 99 DF,  p-value: < 2.2e-16