Assignment 2 - ISLR Chapter 3

Problem 2 The primary differences between the KNN classification and KNN regression models are as follows:

KNN classification is used for qualitative response variables and the purpose of this method is to predict which class an output variable belongs to by determining local probability. The KNN classifier first identifies the points (represented by K in the KNN classification formula) in the training data closest to \(x_{0}\) (represented by \(N_{0}\)). The conditional probability for class \(j\) is then estimated as the fraction of points in \(N_{0}\) whose response values equal \(j\). Lastly, KNN classifies the test observation \(x_{0}\) to the class with the largest probability.

KNN regression is used for quantitative response variables and the purpose of this method is to predict an output variable value by determining the local average. KNN regression first identifies the training observations (represented by K in the KNN regression formula) that are closest to \(x_{0}\) (represented by \(N_{0}\)), Secondly, it estimates \(f(x_{0})\) using the average of all the training responses in \(N_{0}\).

Problem 9 - Part A

In this chunk, I am reading in the data set and converting the variable “name” from character to factor so the pairs plot will generate.

auto <- read.csv("C:/Users/rdfar/OneDrive/Documents/UTSA/Summer 2020/Algorithms II/Homework/Auto.csv", na.strings="?")
auto <- na.omit(auto)
auto <- auto %>%
  mutate_all(type.convert) %>%
  mutate_if(is.character, as.factor)

In this chunk, I am generating a pairs plot of all variables in the auto data set.

pairs(auto)

Problem 9 - Part B

In this chunk, I am computing the matrix of correlations between the variables and removing the qualitative name variable.

cor(subset(auto, select = -name))

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Problem 9 - Part C

In this chunk, I am generating a multiple linear regression model with mpg as the response and all variables except name as the predictors.

lmfit <- lm(mpg~.-name, auto)
summary(lmfit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Problem 9 - Part C (I)

The results of the summary suggest that there is a relationship between the predictors and the response variable. Since the p-value for the F-statistic is small (< 2.2 e-16), this provides evidence to suggest that at least one or more predictors has an effect on the variable mpg.

Problem 9 - Part C (II)

The small p-values (< 0.05) for the coefficients in the summary table provide evidence to suggest that the variables displacement, weight, year and origin have a statistically significant relationship with the response variable mpg. Alternatively, the large p-values (> 0.05) for the variables cylinders, horsepower, and acceleration provides evidence to suggest that these variables do not share a statistically significant relationship with the variable mpg.

Problem 9 - Part C (III)

The coefficient for the year variables suggests that for every one unit increase in year, on average mpg increases by 0.750773. This further suggests that on average new cars are becoming more fuel efficient each year.

Problem 9 - Part D

The diagnostic plots for the linear fit suggest that the data is mostly normal except for some observations in the right tail of the Normal Q-Q plot. The Residuals vs Fitted plot is suggesting evidence of non-linearity in the data.

The residual plots suggest that there are some unusually large outliers in the data, such as the Residuals vs Leverage plot which shows several outliers below - 2 and above 2 on the Y axis. The leverage plot also suggests that observation 14 is a high leverage point in comparison to the other observations.

par(mfrow = c(2,2))
plot(lmfit)

Problem 9 - Part E

In the linear regression model with with the interaction effect of weight*displacement included, there is evidence to suggest that the interaction is statistically significant since the corresponding p-value is small (much less than 0.05).

lmfit_int1 <- lm(mpg~ year+origin+weight*displacement, auto)
summary(lmfit_int1)

## 
## Call:
## lm(formula = mpg ~ year + origin + weight * displacement, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6119  -1.7290  -0.0115   1.5609  12.5584 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -8.007e+00  3.798e+00  -2.108   0.0357 *  
## year                 8.194e-01  4.518e-02  18.136  < 2e-16 ***
## origin               3.567e-01  2.574e-01   1.386   0.1666    
## weight              -1.054e-02  6.530e-04 -16.146  < 2e-16 ***
## displacement        -7.148e-02  9.176e-03  -7.790 6.27e-14 ***
## weight:displacement  2.104e-05  2.214e-06   9.506  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.016 on 386 degrees of freedom
## Multiple R-squared:  0.8526, Adjusted R-squared:  0.8507 
## F-statistic: 446.5 on 5 and 386 DF,  p-value: < 2.2e-16

In the linear regression model with with the interaction effect of year*origin included, there is evidence to suggest that the interaction is statistically significant since the corresponding p-value is small (0.000904).

lmfit_int2 <- lm(mpg~ year*origin+weight+displacement, auto)
summary(lmfit_int2)

## 
## Call:
## lm(formula = mpg ~ year * origin + weight + displacement, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7541 -1.8722 -0.0936  1.6900 12.4650 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.927e+00  8.873e+00   0.893 0.372229    
## year          4.313e-01  1.130e-01   3.818 0.000157 ***
## origin       -1.449e+01  4.707e+00  -3.079 0.002225 ** 
## weight       -6.394e-03  5.526e-04 -11.571  < 2e-16 ***
## displacement  1.551e-03  4.859e-03   0.319 0.749735    
## year:origin   2.023e-01  6.047e-02   3.345 0.000904 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.303 on 386 degrees of freedom
## Multiple R-squared:  0.8232, Adjusted R-squared:  0.8209 
## F-statistic: 359.5 on 5 and 386 DF,  p-value: < 2.2e-16

In the linear regression model with with the interaction effect of year*weight included, there is evidence to suggest that the interaction is statistically significant since the corresponding p-value is small (much less than 0.05).

lmfit_int3 <- lm(mpg~ origin+year*weight+displacement, auto)
summary(lmfit_int3)

## 
## Call:
## lm(formula = mpg ~ origin + year * weight + displacement, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9402 -1.8736 -0.0966  1.5924 12.2125 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.076e+02  1.290e+01  -8.339 1.34e-15 ***
## origin        9.116e-01  2.547e-01   3.579 0.000388 ***
## year          1.962e+00  1.716e-01  11.436  < 2e-16 ***
## weight        2.605e-02  4.552e-03   5.722 2.12e-08 ***
## displacement -4.020e-04  4.558e-03  -0.088 0.929767    
## year:weight  -4.305e-04  5.967e-05  -7.214 2.89e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.145 on 386 degrees of freedom
## Multiple R-squared:  0.8397, Adjusted R-squared:  0.8376 
## F-statistic: 404.4 on 5 and 386 DF,  p-value: < 2.2e-16

Problem 9 - Part F

In this chunk, I generated four multiple regression models for mpg. The first model includes origin, year and weight as predictors. In the second model, I performed a log transformation of the weight variable. In the third model, I performed a square root transformation of the weight variable. Lastly, I squared the weight variable in the fourth model.

lmfit_trans1 <- lm(mpg~ origin+year+weight, auto)
lmfit_trans2 <- lm(mpg~ origin+year+log(weight), auto)
lmfit_trans3 <- lm(mpg~ origin+year+sqrt(weight), auto)
lmfit_trans4 <- lm(mpg~ origin+year+I(weight^2), auto)

summary(lmfit_trans1)

## 
## Call:
## lm(formula = mpg ~ origin + year + weight, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9440 -2.0948 -0.0389  1.7255 13.2722 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.805e+01  4.001e+00  -4.510 8.60e-06 ***
## origin       1.150e+00  2.591e-01   4.439 1.18e-05 ***
## year         7.571e-01  4.832e-02  15.668  < 2e-16 ***
## weight      -5.994e-03  2.541e-04 -23.588  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.348 on 388 degrees of freedom
## Multiple R-squared:  0.8175, Adjusted R-squared:  0.816 
## F-statistic: 579.2 on 3 and 388 DF,  p-value: < 2.2e-16

summary(lmfit_trans2)

## 
## Call:
## lm(formula = mpg ~ origin + year + log(weight), data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9120 -1.9384 -0.0257  1.5961 13.1033 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 115.76550    7.53529  15.363  < 2e-16 ***
## origin        0.75026    0.24722   3.035  0.00257 ** 
## year          0.77969    0.04477  17.417  < 2e-16 ***
## log(weight) -19.19080    0.72701 -26.397  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.123 on 388 degrees of freedom
## Multiple R-squared:  0.8411, Adjusted R-squared:  0.8398 
## F-statistic: 684.5 on 3 and 388 DF,  p-value: < 2.2e-16

summary(lmfit_trans3)

## 
## Call:
## lm(formula = mpg ~ origin + year + sqrt(weight), data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9430 -2.0520 -0.0312  1.7145 13.1901 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.90095    4.23678   0.213 0.831713    
## origin        0.92951    0.25244   3.682 0.000264 ***
## year          0.76605    0.04635  16.528  < 2e-16 ***
## sqrt(weight) -0.68713    0.02734 -25.131  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.222 on 388 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8296 
## F-statistic: 635.6 on 3 and 388 DF,  p-value: < 2.2e-16

summary(lmfit_trans4)

## 
## Call:
## lm(formula = mpg ~ origin + year + I(weight^2), data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8810 -2.2688 -0.0881  1.9049 13.3968 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.824e+01  4.189e+00  -6.743 5.66e-11 ***
## origin       1.661e+00  2.739e-01   6.064 3.15e-09 ***
## year         7.531e-01  5.282e-02  14.259  < 2e-16 ***
## I(weight^2) -8.503e-07  4.191e-08 -20.288  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.638 on 388 degrees of freedom
## Multiple R-squared:  0.7844, Adjusted R-squared:  0.7827 
## F-statistic: 470.5 on 3 and 388 DF,  p-value: < 2.2e-16

Of the four models generated, the model which includes the log transformation of weight has the highest \(R^{2}\) value (0.8411). Alternatively, the model which includes the squared transformation of weight has the lowest \(R^{2}\) value (0.7844). For each transformation performed, the models and the individual predictors remain significant but the percent of variability explained is slightly different.

Problem 10 - Part A

carseats <- Carseats

In this chunk, I am generating a multiple regression model to predict Sales using the Price, Urban and US variables.

lmfit2 <- lm(Sales~ Price+Urban+US, carseats)
summary(lmfit2)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Problem 10 - Part B

The interpretations of each coefficient in the model are as follows:

Price - There is evidence to suggest that this coefficient is statistically significant since the corresponding p-value is small (<2e-16). With the other predictors fixed, for every one unit increase in “Price,” on average “Sales” decrease by a value of 54.459.
UrbanYes - There is evidence to suggest that this coefficient is not statistically significant since the corresponding p-value is quite large (0.936). With the other predictors fixed, on average sales are 21.916 less in urban locations than in rural locations.
USYes - There is evidence to suggest that this coefficient is statistically significant since the corresponding p-value is small (4.86e-06). With the other predictors fixed, on average sales are 1200.573 more in US stores than in stores outside of the US.

Problem 10 - Part C

\(Sales = 13.043469 - 0.054459 x Price - 0.021916 x Urban + 1.200573 x US + \epsilon\)

Problem 10 - Part D

The null hypothesis can be rejected for Price and US since the corresponding p-value is statistically significant (much less than 0.05).

Problem 10 - Part E

In this chunk, I generated a multiple linear regression model with the insignificant predictor from problem A removed.

lmfit3 <- lm(Sales~ Price+US, carseats)
summary(lmfit3)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Problem 10 - Part F

Both models in problems A and E exhibit identical \(R^{2}\) values that suggest about 23.93% of the variability can be explained by the model, although the model in problem E is slightly better given that an insignificant predictor is removed and the residual standard error (RSE) value (2.469) is less than the RSE value in the problem A model (2.472).Additionally, the adjusted \(R^{2}\) value in problem E (0.2354) is slightly higher than value in Problem A (0.2335).

Problem 10 - Part G

In this chunk, I generated 95% confidence intervals for the coefficients in the problem E model.

confint(lmfit3)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Problem 10 - Part H

The Residuals vs Leverage plot below suggests that there are some outliers in the data since several points are show to be below - 2 and above 2 on the Y axis. The plot also suggests that there are high leverage points, in particular the point which exceeds 0.04 on the X axis scale.

par(mfrow = c(2,2))
plot(lmfit3)

Problem 12 - Part A

The coefficient for estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X when \(\Sigma x^{2}_{i}\) is equivalent to \(\Sigma y^{2}_{i}\)

Problem 12 - Part B

set.seed(42)
x <- rnorm(100)
y <- x + rnorm(100)

lmfitYX <- lm(y~x+0)
summary(lmfitYX)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9815 -0.5947 -0.0741  0.4498  2.7669 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.0245     0.0876    11.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9081 on 99 degrees of freedom
## Multiple R-squared:  0.5801, Adjusted R-squared:  0.5759 
## F-statistic: 136.8 on 1 and 99 DF,  p-value: < 2.2e-16

lmfitXY <- lm(x~y+0)
summary(lmfitXY)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68218 -0.35933  0.06067  0.55728  1.33612 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.56625    0.04842    11.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6751 on 99 degrees of freedom
## Multiple R-squared:  0.5801, Adjusted R-squared:  0.5759 
## F-statistic: 136.8 on 1 and 99 DF,  p-value: < 2.2e-16

Problem 12 - Part C

set.seed(42)
x2 <- rnorm(100)
y2 <- abs(x2)

lmfitYX2 <- lm(y~x+0)
lmfitXY2 <- lm(x~y+0)

summary(lmfitYX2)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9815 -0.5947 -0.0741  0.4498  2.7669 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.0245     0.0876    11.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9081 on 99 degrees of freedom
## Multiple R-squared:  0.5801, Adjusted R-squared:  0.5759 
## F-statistic: 136.8 on 1 and 99 DF,  p-value: < 2.2e-16

summary(lmfitXY2)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68218 -0.35933  0.06067  0.55728  1.33612 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.56625    0.04842    11.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6751 on 99 degrees of freedom
## Multiple R-squared:  0.5801, Adjusted R-squared:  0.5759 
## F-statistic: 136.8 on 1 and 99 DF,  p-value: < 2.2e-16

Assignment 2 - ISLR Chapter 3

Ryan Farias

6/14/2020