DataMining Ch.3 HW

install.packages("ISLR", repos = "https://www.statlearning.com")

## Installing package into 'C:/Users/danny/OneDrive/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## Warning: unable to access index for repository https://www.statlearning.com/src/contrib:
##   cannot open URL 'https://www.statlearning.com/src/contrib/PACKAGES'

## Warning: package 'ISLR' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

## Warning: unable to access index for repository https://www.statlearning.com/bin/windows/contrib/4.1:
##   cannot open URL 'https://www.statlearning.com/bin/windows/contrib/4.1/PACKAGES'

library(ISLR)
str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

problem 2.

Carefully explain the differences between the KNN classifier and KNN regression methods.

The differences between the KNN classifier and KNN regression method are that in KNN classification the goal is to put a point or points into a class by looking at what class the points around it belong too and putting the points into the class with the highest local probability. In KNN regression, the goal is to look at a point or a group of points and determine the value of that point by using the values of the closest points around that point.

problem 9.

This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

Here is the scatterplot matrix:

pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

Here is the correlation matrix:

cor(Auto[sapply(Auto, is.numeric)])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

lm.fit <- lm(mpg~cylinders+displacement+horsepower+weight+acceleration+year+origin, data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

based on this summary of the multiple linear regression, with an F statistic of 252.4 and a corresponding p value of < 2.2e-16, at a 5% significance level we reject the null hypothesis that there is no relationship between the predictors and the response. This result suggests that at least one of the predictors has a relationship with the response variable.

ii. Which predictors appear to have a statistically significant relationship to the response?

Based on this summary of the multiple linear regression, we can see that displacement was significant at the .01 level, and weight, year, and origin were significant at the .001 level. The intercept was also significant the .001 level.

iii. What does the coefficient for the year variable suggest?

We can see from the multiple linear regression summary that the coefficient of the year variable is 0.750773. This suggests that for every unit increase in year, the mpg increased by 0.750773 for cars in this dataset.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lm.fit)

Based on the residual vs fitted plot above, it appears that there are a few large outliers in the data. Based on the leverage plot it appears that there are a few values with unusually high leverage.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm.fit2 <- lm(mpg~weight:year+horsepower+acceleration, data = Auto)
summary(lm.fit2)

## 
## Call:
## lm(formula = mpg ~ weight:year + horsepower + acceleration, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.754  -3.028  -0.607   2.531  16.785 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.891e+01  2.539e+00  19.266  < 2e-16 ***
## horsepower   -1.091e-01  1.512e-02  -7.215 2.85e-12 ***
## acceleration -2.338e-01  1.298e-01  -1.801   0.0725 .  
## weight:year  -4.632e-05  7.442e-06  -6.225 1.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.542 on 388 degrees of freedom
## Multiple R-squared:  0.6639, Adjusted R-squared:  0.6613 
## F-statistic: 255.5 on 3 and 388 DF,  p-value: < 2.2e-16

Based on this summary of the multiple regression model, it appears that there is a significant interaction between the weight of the car in pounds and the model year.

lm.fit3 <- lm(mpg~displacement*origin, data = Auto)
summary(lm.fit3)

## 
## Call:
## lm(formula = mpg ~ displacement * origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1742  -2.8223  -0.5893   2.2531  18.8420 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         28.41854    1.53883  18.468  < 2e-16 ***
## displacement        -0.01887    0.01082  -1.745  0.08183 .  
## origin               4.79247    1.13249   4.232  2.9e-05 ***
## displacement:origin -0.03476    0.01010  -3.442  0.00064 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.526 on 388 degrees of freedom
## Multiple R-squared:  0.6664, Adjusted R-squared:  0.6638 
## F-statistic: 258.3 on 3 and 388 DF,  p-value: < 2.2e-16

based on the summary of this multiple regression model, it appears that there is a significant interaction between the engine displacement and the origin of the car.

(f) Try a few different transformations of the variables, such as log(X), √X X^2. Comment on your findings.

lm.fit4 <- lm(mpg~log(cylinders+displacement+horsepower+weight+acceleration+year+origin), data = Auto)
summary(lm.fit4)

## 
## Call:
## lm(formula = mpg ~ log(cylinders + displacement + horsepower + 
##     weight + acceleration + year + origin), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6312  -2.6711  -0.4064   1.9455  16.1346 
## 
## Coefficients:
##                                                                                    Estimate
## (Intercept)                                                                        210.9189
## log(cylinders + displacement + horsepower + weight + acceleration + year + origin) -23.1930
##                                                                                    Std. Error
## (Intercept)                                                                            5.9362
## log(cylinders + displacement + horsepower + weight + acceleration + year + origin)     0.7339
##                                                                                    t value
## (Intercept)                                                                          35.53
## log(cylinders + displacement + horsepower + weight + acceleration + year + origin)  -31.60
##                                                                                    Pr(>|t|)
## (Intercept)                                                                          <2e-16
## log(cylinders + displacement + horsepower + weight + acceleration + year + origin)   <2e-16
##                                                                                       
## (Intercept)                                                                        ***
## log(cylinders + displacement + horsepower + weight + acceleration + year + origin) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.142 on 390 degrees of freedom
## Multiple R-squared:  0.7191, Adjusted R-squared:  0.7184 
## F-statistic: 998.6 on 1 and 390 DF,  p-value: < 2.2e-16

Based on this result, it appears that the log transformation made our model worse by decreasing the amount of variance in mpg we can explain using our predictors.

lm.fit5 <- lm(mpg~cylinders+displacement+horsepower+weight+acceleration+year+origin+I(cylinders^1/2)+I(displacement^1/2)+ I(horsepower^1/2) + I(weight^1/2) +I(acceleration^1/2) +I(year^1/2) +I(origin^1/2), data = Auto)
summary(lm.fit5)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin + I(cylinders^1/2) + I(displacement^1/2) + 
##     I(horsepower^1/2) + I(weight^1/2) + I(acceleration^1/2) + 
##     I(year^1/2) + I(origin^1/2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients: (7 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -17.218435   4.644294  -3.707  0.00024 ***
## cylinders            -0.493376   0.323282  -1.526  0.12780    
## displacement          0.019896   0.007515   2.647  0.00844 ** 
## horsepower           -0.016951   0.013787  -1.230  0.21963    
## weight               -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration          0.080576   0.098845   0.815  0.41548    
## year                  0.750773   0.050973  14.729  < 2e-16 ***
## origin                1.426141   0.278136   5.127 4.67e-07 ***
## I(cylinders^1/2)            NA         NA      NA       NA    
## I(displacement^1/2)         NA         NA      NA       NA    
## I(horsepower^1/2)           NA         NA      NA       NA    
## I(weight^1/2)               NA         NA      NA       NA    
## I(acceleration^1/2)         NA         NA      NA       NA    
## I(year^1/2)                 NA         NA      NA       NA    
## I(origin^1/2)               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

based on this result, it appears that using a square root transformation does not improve the model because the variance that we can explain using these predictors does not increase after the transformation.

lm.fit6 <- lm(mpg~displacement+cylinders+horsepower+weight+acceleration+year+origin+I(cylinders^2)+I(displacement^2)+ I(horsepower^2) + I(weight^2) +I(acceleration^2) +I(year^2) +I(origin^2), data = Auto)
summary(lm.fit6)

## 
## Call:
## lm(formula = mpg ~ displacement + cylinders + horsepower + weight + 
##     acceleration + year + origin + I(cylinders^2) + I(displacement^2) + 
##     I(horsepower^2) + I(weight^2) + I(acceleration^2) + I(year^2) + 
##     I(origin^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6457 -1.5810  0.0953  1.3132 12.2519 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.048e+02  6.920e+01   5.850 1.07e-08 ***
## displacement      -2.559e-02  2.250e-02  -1.137  0.25615    
## cylinders          9.603e-01  1.413e+00   0.679  0.49728    
## horsepower        -1.545e-01  4.153e-02  -3.719  0.00023 ***
## weight            -1.322e-02  2.681e-03  -4.929 1.24e-06 ***
## acceleration      -1.677e+00  5.552e-01  -3.021  0.00269 ** 
## year              -9.562e+00  1.840e+00  -5.196 3.34e-07 ***
## origin             2.534e+00  1.822e+00   1.391  0.16506    
## I(cylinders^2)    -4.655e-02  1.142e-01  -0.407  0.68392    
## I(displacement^2)  3.714e-05  3.882e-05   0.957  0.33933    
## I(horsepower^2)    3.448e-04  1.414e-04   2.438  0.01522 *  
## I(weight^2)        1.523e-06  3.643e-07   4.179 3.64e-05 ***
## I(acceleration^2)  4.519e-02  1.640e-02   2.756  0.00614 ** 
## I(year^2)          6.801e-02  1.209e-02   5.626 3.59e-08 ***
## I(origin^2)       -4.762e-01  4.446e-01  -1.071  0.28480    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.756 on 377 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.8753 
## F-statistic:   197 on 14 and 377 DF,  p-value: < 2.2e-16

Based on this result, it appears that using a square transformation improves the model by increasing the amount of variance in mpg we can explain by our set of predictors.

Problem 10.

This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.Carseats <- lm(Sales~Price+Urban+US, data = Carseats)
summary(lm.Carseats)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

based on the summary of the multiple regression model, we can see that the intercept is 13.043. This means if we ignore the other variables in the model, the unit sales are approximately 13,043. We can see that the coefficient of Price is -0.0545. This means that sales will decrease by approximately 54.5 seats for every unit increase in Price. The UrbanYes coefficient is -0.0219. this means that sales will decrease by approximately 21.9 seats if the store is in an Urban area as opposed to a rural area. The coefficient of USYes is 1.200. This means that sales will increase by approximately 1,200 seats if the store is located in the US as opposed to outside of the US.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

Here is the equation for the model:

y = 13.043469 -0.054459x1 -0.021916x2 +1.200573x3

where y represents the number of units sold in thousands, x1 represents the price the company charges for car seats at each site, x2 represents UrbanYes meaning that these stores come from urban areas, and x3 represents USYes meaning that these stores are in the US.

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

Based on the summary table above, we can see that the T value of Price is -10.389, with a corresponding P value of < 2e-16. We can also see that the T value of USYes is 4.635, with a corresponding P value of 4.86e-06. In both of these cases, With a 5% level of significance we can reject the null hypothesis H0 : βj = 0. This result suggests that Price and USYes significantly impact sales.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

Here is the model using only Price and USYes as predictors:

lm.Carseats2 <- lm(Sales~Price+US, data = Carseats)
summary(lm.Carseats2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

We can see from the summary tables above that the adjusted R-squared for the first model that includes the UrbanYes variable as a predictor is .2335. The adjusted R-squared for the second model with only Price and USYes as predictors is .2354. Neither the first or second model fit the data well.

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

Here are the 95% confidence intervals:

confint(lm.Carseats2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

From this result we can see that the 95% confidence interval of the intercept is 11.790 to 14.271, the 95% confidence interval for the Price is -0.0648 to -0.0442, and the 95% confidence interval for USYes is 0.692 to 1.708.

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(lm.Carseats2)

Based on these plots, there appears to be only a few outliers and high leverage values in the data.

problem 12.

This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimate for the regression of X onto Y would be the same as the coefficient estimate of Y onto X if the denominators from equation 3.38 are the same value.

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

X <- rnorm(100)
Y <- X+rnorm(100)

coef(lm(Y~X))

## (Intercept)           X 
## -0.01169708  0.85540419

coef(lm(X~Y))

## (Intercept)           Y 
##  0.06545589  0.52088379

Here we see that the coefficient estimates for Y~X and X~Y are different.

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

X <- rnorm(100)
Y <- X

coef(lm(Y~X))

## (Intercept)           X 
##           0           1

coef(lm(X~Y))

## (Intercept)           Y 
##           0           1

Here we see that the coefficient estimate for Y~X and X~Y are the same.

DataMining Ch.3 HW

Daniel Bloemker

2/13/2022

problem 2.

problem 9.

Problem 10.

problem 12.