Assignment2

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

data(Auto)
college=read.csv("Auto.csv", stringsAsFactors=TRUE)

Q2

Carefully explain the differences between the KNN classifier and KNN regression methods. KNN classifiers method is used to solve qualitative classification of X and put into specific groups. KNN regression methods are used to solve non-parametrically (quantitative response) fit many valued observations.

Q9

This question involves the use of multiple linear regression on the Auto data set.

A)Produce a scatterplot matrix which includes all of the variables in the data set.

library("ISLR")
pairs(Auto)

B)Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(Auto[, names(Auto) !="name"])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

C)Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

model = lm(mpg ~. -name, data = Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response? Yes although not all predictors have a statistically significant effect on the response. R-squared value tells us that 82% of changes in the response can be explained by the predictors.
Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year, origin
What does the coefficient for the year variable suggest? When all of the other predictor are held constant, the mpg value will increase each passing year. mpg increase by 1.43 each year.

D)Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(model)

The first graph shows that there is a non-linear relationship between the responce and the predictors; The second graph shows that the residuals are normally distributed and right skewed; The third graph shows that the constant variance of error assumption is not true for this model; The Third graphs shows that there are no leverage points. However, there on observation that stands out as a potential leverage point (labeled 14 on the graph)

E)Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

model = lm(mpg ~.-name+displacement:weight, data = Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16

model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+acceleration:horsepower, data=Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight + 
##     acceleration:horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3344 -1.6333  0.0188  1.4740 11.9723 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.725e+01  5.328e+00  -3.237  0.00131 ** 
## cylinders                6.354e-01  6.106e-01   1.041  0.29870    
## displacement            -6.805e-02  1.337e-02  -5.088 5.68e-07 ***
## horsepower               6.026e-02  2.601e-02   2.317  0.02105 *  
## weight                  -8.864e-03  1.097e-03  -8.084 8.43e-15 ***
## acceleration             6.257e-01  1.592e-01   3.931  0.00010 ***
## year                     7.845e-01  4.470e-02  17.549  < 2e-16 ***
## origin                   4.668e-01  2.595e-01   1.799  0.07284 .  
## cylinders:displacement  -1.337e-03  2.726e-03  -0.490  0.62415    
## displacement:weight      2.071e-05  3.638e-06   5.694 2.49e-08 ***
## horsepower:acceleration -7.467e-03  1.784e-03  -4.185 3.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.905 on 381 degrees of freedom
## Multiple R-squared:  0.865,  Adjusted R-squared:  0.8615 
## F-statistic: 244.2 on 10 and 381 DF,  p-value: < 2.2e-16

model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+year:origin+acceleration:horsepower, data=Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight + 
##     year:origin + acceleration:horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6504 -1.6476  0.0381  1.4254 12.7893 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.287e+00  9.074e+00   0.583 0.560429    
## cylinders                4.249e-01  6.079e-01   0.699 0.485011    
## displacement            -7.322e-02  1.334e-02  -5.490 7.38e-08 ***
## horsepower               5.252e-02  2.586e-02   2.031 0.042913 *  
## weight                  -8.689e-03  1.086e-03  -7.998 1.54e-14 ***
## acceleration             5.796e-01  1.582e-01   3.665 0.000283 ***
## year                     5.116e-01  9.976e-02   5.129 4.66e-07 ***
## origin                  -1.220e+01  4.161e+00  -2.933 0.003560 ** 
## cylinders:displacement  -4.368e-04  2.712e-03  -0.161 0.872156    
## displacement:weight      1.992e-05  3.608e-06   5.522 6.21e-08 ***
## year:origin              1.630e-01  5.341e-02   3.051 0.002440 ** 
## horsepower:acceleration -6.735e-03  1.781e-03  -3.781 0.000181 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.874 on 380 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8644 
## F-statistic: 227.7 on 11 and 380 DF,  p-value: < 2.2e-16

model = lm(mpg ~.-name-cylinders-acceleration+year:origin+displacement:weight+
                  displacement:weight+acceleration:horsepower+acceleration:weight, data=Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ . - name - cylinders - acceleration + year:origin + 
##     displacement:weight + displacement:weight + acceleration:horsepower + 
##     acceleration:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5074 -1.6324  0.0599  1.4577 12.7376 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.868e+01  7.796e+00   2.396 0.017051 *  
## displacement            -7.794e-02  9.026e-03  -8.636  < 2e-16 ***
## horsepower               8.719e-02  3.167e-02   2.753 0.006183 ** 
## weight                  -1.350e-02  1.287e-03 -10.490  < 2e-16 ***
## year                     4.911e-01  9.825e-02   4.998 8.83e-07 ***
## origin                  -1.262e+01  4.109e+00  -3.071 0.002288 ** 
## year:origin              1.686e-01  5.277e-02   3.195 0.001516 ** 
## displacement:weight      2.253e-05  2.184e-06  10.312  < 2e-16 ***
## horsepower:acceleration -9.164e-03  2.222e-03  -4.125 4.56e-05 ***
## weight:acceleration      2.784e-04  7.087e-05   3.929 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.861 on 382 degrees of freedom
## Multiple R-squared:  0.8687, Adjusted R-squared:  0.8656 
## F-statistic: 280.8 on 9 and 382 DF,  p-value: < 2.2e-16

After reviewing all of the models tested the last model tested is the only one that has all it’s variables showing as significant. R-squared estimates that 87% of changes in the response is explained by this set of predictors (interaction and single) No higher value was given from these trials.

Q10

This question should be answered using the Carseats data set.

A)Fit a multiple regression model to predict Sales using Price, Urban, and US.

library("ISLR")
?Carseats

## starting httpd help server ... done

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

lm.fit = lm(Sales ~ Price+Urban+US, data= Carseats)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

B)Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! when the price increases by $1000, the number of carseats sold will decrease by 54,459 The store’s sale is not affected if it is Urban or not A store in the US will sale 1200 more carseats on average, than any store that is abroad.

C)Write out the model in equation form, being careful to handle the qualitative variables properly. Sales=13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US+ε with Urban=1 if it’s an urban location, 0 if not, and US=1 if in the US and 0 if not.

D)For which of the predictors can you reject the null hypothesis Ho : βj = 0? ‘Urban’ because it’s p-value is not statistically significant with the value of 0.936.

E)On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.fit2 = lm(Sales ~ Price+US, data= Carseats)
summary(lm.fit2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

F)How well do the models in (a) and (e) fit the data? Comparing their respective R-square values(summary tables), the two models are poor in quality (only 24% change in response)

G)Using the model from (e), obtain 95 % confidence intervals for the coefficient(s)

confint(lm.fit2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

H)Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(lm.fit2)

The normal q-q pot and the residuals vs leverage plot tell us there are no evidence of such points.

Q12

This problem involves simple linear regression without an intercept.

A)Recall that the coefficient estimate β^ for the linear regression of Y onto X witout an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X ? The coefficient estimate for the regression of Y onto X is β^=∑ixiyi∑jx2j; The coefficient estimate for the regression of X onto Y is β^′=∑ixiyi∑jy2j. The coefficients are the same iff ∑jx2j=∑jy2j.

B)Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x <- 1:100
sum(x^2)

## [1] 338350

y <- 2 * x + rnorm(100, sd = 0.1)
sum(y^2)

## [1] 1353606

fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.223590 -0.062560  0.004426  0.058507  0.230926 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x 2.0001514  0.0001548   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16

summary(fit.X)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115418 -0.029231 -0.002186  0.031322  0.111795 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y 5.00e-01   3.87e-05   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16

C)Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x <- 1:100
sum(x^2)

## [1] 338350

y <- 100:1
sum(y^2)

## [1] 338350

fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

summary(fit.X)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

Assignment2

CJ Patteson

6/24/2021

Q2

Q9

Q10

Q12