Assignment 2 STA 4143

options(repos = list(CRAN="http://cran.rstudio.com/"))

2.Carefully explain the differences between the KNN classifier and KNN regression methods.

The KNN Classifier focuses on predicting the class that an output belongs to by looking at its nearest neighbors whereas KNN Regression focuses on predicting a value based on the average of all the training responses that are closest to the prediction point.

9.This question involves the use of multiple linear regression on the Auto data set.

install.packages("ISLR")
library(ISLR)

9a.Produce a scatterplot matrix which includes all of the variables in the data set.

plot(Auto)

9b.Compute the matrix of correlations between the variables using the function cor().You will need to exclude the name variable, cor() which is qualitative.

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

cor(Auto[,-9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

9c.Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output. For instance:

i.Is there a relationship between the predictors and the response? There is a relationship between 4 of the predictor variables variables

ii.Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year and origin all have a p value that is less than .05 which means they have a statistically significant relationship

iii.What does the coefficient for the year variable suggest? There is a statistically significant relationship between mpg and year of the vehicle. Each year mpg increases my .75 with an error rate varying by .05.

lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

9d.Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

The residual vs fitted plot shows that the residuals follow a linear pattern. There are some outliers in between 30-35. The Normal QQ plot shows that the residuals are normally distributed except in the tail end of the line. The scale-Location plot shows the red line is mostly Horizontal across the the plot which means that the assumption of equal variance is likely met. The residuals vs leverage plot shows that no observations fall outside of Cooks dashed red line, this means there are no high leverage points in our regression model.

par(mfrow = c(2,2))
plot(lm.fit)

9e. Use the x and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

The interactions that are statistically significant are horsepower:weight and acceleration:year with a p value less that .05

lm.fit2 <- lm(mpg ~ cylinders*displacement + horsepower*weight + acceleration *year + cylinders*origin ,data=Auto)
summary(lm.fit2)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + horsepower * weight + 
##     acceleration * year + cylinders * origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6734 -1.5452 -0.0555  1.2971 11.3844 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.145e+02  1.850e+01   6.188 1.58e-09 ***
## cylinders              -1.088e+00  7.457e-01  -1.459  0.14534    
## displacement           -2.203e-02  1.574e-02  -1.400  0.16232    
## horsepower             -2.266e-01  2.590e-02  -8.749  < 2e-16 ***
## weight                 -9.916e-03  9.050e-04 -10.957  < 2e-16 ***
## acceleration           -6.815e+00  1.149e+00  -5.929 6.84e-09 ***
## year                   -6.300e-01  2.395e-01  -2.630  0.00888 ** 
## origin                 -1.670e+00  1.293e+00  -1.292  0.19713    
## cylinders:displacement  3.223e-03  2.310e-03   1.395  0.16381    
## horsepower:weight       4.927e-05  6.771e-06   7.276 1.98e-12 ***
## acceleration:year       8.773e-02  1.490e-02   5.888 8.62e-09 ***
## cylinders:origin        5.450e-01  2.991e-01   1.822  0.06917 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.79 on 380 degrees of freedom
## Multiple R-squared:  0.8759, Adjusted R-squared:  0.8723 
## F-statistic: 243.7 on 11 and 380 DF,  p-value: < 2.2e-16

9f. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

There are no high leverage points that affect the regression model based on the residuals vs leverage plot. The residual vs fitted plot shows that the residuals follow a somewhat linear pattern.

lm.fit3 <- lm(mpg ~ horsepower + I(horsepower^2), data=Auto)
summary(lm.fit3)

## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(lm.fit3)

10.a This question should be answered using the Carseats data set. Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.fit4 <- lm(Sales ~ Price + Urban + US, data=Carseats)
summary(lm.fit4)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

10b.Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

There is a significant relationship between price and sales there is no significant relationship between UrbanYes and sales which indicates that sales are not affected when the store is located in an urban location.Lastly, there is a significant relationship between USyes and sales, stores in the US have higher sales.

10c.Write out the model in equation form, being careful to handle the qualitative variables properly.

sales = 13.0434689 + (-0.0544588) x Price + (-0.0219162) x Urban + (1.2005727) x US ε Urban=1 if the store is in an urban location and 0 if not, and US=1 if the store is in the US and 0 if not.

10d.For which of the predictors can you reject the null hypothesis H0 : βj = 0?

The null hypothesis can be rejected for Price and USyes

10e.On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

smaller.fit <- lm(Sales ~ Price + US, data = Carseats)
summary(smaller.fit)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

10f. How well do the models in (a) and (e) fit the data?

They both fit the data well with minor changes from (a) and (e). The R^2 only increases slightly when Urban is removed indicating that it does not provide any real improvements to the model fit.

10g. Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(smaller.fit)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

10h.Is there evidence of outliers or high leverage observations in the model from (e)?

There is no evidence of any major outliers or high leverage points.

par(mfrow = c(2,2))
plot(smaller.fit)

##Problem 12

12a.This problem involves simple linear regression without an intercept. Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient of the estimate will be the same when Y=X.

12b.Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(15)
x1 <- rnorm(100)
y1 <- 2 * x1 + rnorm(100)

yinx <- lm(y1 ~ x1 + 0)
xiny <- lm(x1 ~ y1 + 0)
summary(yinx)

## 
## Call:
## lm(formula = y1 ~ x1 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58356 -0.68052  0.02862  0.61750  2.32039 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x1   2.0291     0.1111   18.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.105 on 99 degrees of freedom
## Multiple R-squared:  0.7711, Adjusted R-squared:  0.7687 
## F-statistic: 333.4 on 1 and 99 DF,  p-value: < 2.2e-16

summary(xiny)

## 
## Call:
## lm(formula = x1 ~ y1 + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8082 -0.2606  0.0386  0.3557  1.5124 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## y1  0.38000    0.02081   18.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4781 on 99 degrees of freedom
## Multiple R-squared:  0.7711, Adjusted R-squared:  0.7687 
## F-statistic: 333.4 on 1 and 99 DF,  p-value: < 2.2e-16

12c.Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(15)
x2 <- rnorm(100)
y2 <- rnorm(100)

yinx2 <- lm(y2 ~ x2 + 0)
xiny2 <- lm(x2 ~ y2 + 0)
summary(yinx2)

## 
## Call:
## lm(formula = y2 ~ x2 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58356 -0.68052  0.02862  0.61750  2.32039 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)
## x2  0.02909    0.11112   0.262    0.794
## 
## Residual standard error: 1.105 on 99 degrees of freedom
## Multiple R-squared:  0.0006919,  Adjusted R-squared:  -0.009402 
## F-statistic: 0.06854 on 1 and 99 DF,  p-value: 0.794

summary(xiny2)

## 
## Call:
## lm(formula = x2 ~ y2 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.42014 -0.53089  0.02137  0.86571  2.54250 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)
## y2  0.02378    0.09084   0.262    0.794
## 
## Residual standard error: 0.9989 on 99 degrees of freedom
## Multiple R-squared:  0.0006919,  Adjusted R-squared:  -0.009402 
## F-statistic: 0.06854 on 1 and 99 DF,  p-value: 0.794