2. Carefully explain the differences between the KNN classifier and KNN regression methods

KNN classifier: used to solve classification problems( qualitative) based on k nearest neighbor

KNN regression: used to solve regression problems(quantaitaive) by identifying observations close to x_o and estimates the function using the averages.

data(Auto)

Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

cor(Auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

lm.auto <- lm(mpg~. -name,data = Auto)

summary(lm.auto)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?

as p-value indicates there are predictors that are significant, we conclude that there is a relationship the predictors and the response.

Which predictors appear to have a statistically significant relationship to the response?

Displacement, weight, year, and origin.

What does the coefficient for the year variable suggest?

when year increases by one factor(all other variable staying constant) mpg increases by 0.750773.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))

plot(lm.auto)

Identifiable outliers seen on the right side indicating a slight skewness

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? 4 highest pairs from cor.

Lm.auto.interact <- lm(mpg~ cylinders*displacement +displacement*weight+ cylinders* weight + displacement*horsepower + weight*horsepower , data = Auto[,1:8])

summary(Lm.auto.interact)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     weight + cylinders * weight + displacement * horsepower + 
##     weight * horsepower, data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2352  -2.1592  -0.3998   1.8286  17.1431 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.860e+01  6.608e+00   8.868  < 2e-16 ***
## cylinders               -8.234e-01  2.074e+00  -0.397  0.69157    
## displacement            -8.021e-02  4.045e-02  -1.983  0.04805 *  
## weight                  -5.139e-03  2.944e-03  -1.746  0.08166 .  
## horsepower              -1.636e-01  6.146e-02  -2.662  0.00809 ** 
## cylinders:displacement  -1.273e-03  5.574e-03  -0.228  0.81947    
## displacement:weight      1.985e-06  1.026e-05   0.193  0.84676    
## cylinders:weight         5.441e-04  8.071e-04   0.674  0.50062    
## displacement:horsepower  5.095e-04  1.732e-04   2.941  0.00347 ** 
## weight:horsepower       -1.225e-05  2.587e-05  -0.474  0.63604    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.873 on 382 degrees of freedom
## Multiple R-squared:  0.7594, Adjusted R-squared:  0.7538 
## F-statistic:   134 on 9 and 382 DF,  p-value: < 2.2e-16

interaction between displacement and horsepower is signifciant as seen by the p-value.

Try a few different transformations of the variables, such as log(X),√X, X2. Comment on your findings.

displacement

par(mfrow = c(2,2))
plot(log(Auto$displacement), Auto$mpg)
plot(sqrt(Auto$displacement), Auto$mpg)
plot((Auto$displacement)^2, Auto$mpg)

horsepower

par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)

log transformation on horsepower and displacement displays a linear trend

10.

Fit a multiple regression model to predict Sales using Price, Urban, and US.

data("Carseats")

lm.carseats <- lm(Sales ~ Price+Urban+ US, data= Carseats)

summary(lm.carseats)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Price : 1 unit increase in price is a decrease of 0.05445 units in sales if all other predictors are constant

Urban: if the location is urban there is a deacrease of 0.0219 units in sales if all other predictors are constant

US; on average there is an increase of 1.2000 units in sales if the location is US and all other predictors are constant.

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.04 -0.054Price -0.0216Urban + 1.201*US + error

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

We can reject the null hypothesis for “price’ and”US’ variables.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.carseats1 <- lm(Sales ~ Price +US, data =Carseats)

summary(lm.carseats1)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

R^2 for the smaller model and larger are relatively the same. only about 23% of variation can be explained by the mdoels.

Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(lm.carseats1)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(lm.carseats1)

# 12.

Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

coefficients estimate of regression of Y into x is sum of the product x and y divided by sum of x^2

and coefficients estimate of regression of X onto Y is sum of the product of x and y divided by sum of y^2

therefore, coefficients are the same if and only if sum of x^2 is equal to sum of y^2

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x <- 1:100
sum(x^2)

## [1] 338350

y <-2*x +rnorm(100, sd = 0.1)
sum(y^2)

## [1] 1353049

lm.x <- lm(x~y+0)
summary(lm.x)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.149611 -0.035205  0.006287  0.031466  0.106609 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## y 5.001e-01  4.187e-05   11943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0487 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.426e+08 on 1 and 99 DF,  p-value: < 2.2e-16

lm.y <- lm(y~x+0)
summary(lm.y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21315 -0.06283 -0.01253  0.07047  0.29919 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x 1.9997397  0.0001674   11943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09739 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.426e+08 on 1 and 99 DF,  p-value: < 2.2e-16

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x <- 1:100 
sum(x^2)

## [1] 338350

y <- 100:1 
sum(y^2)

## [1] 338350

lm.x <- lm(x~y +0)
lm.y <- lm(y~x +0)

summary(lm.x)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

summary(lm.y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

homework 2

Sudeep Jacob

2/26/2021

2. Carefully explain the differences between the KNN classifier and KNN regression methods

10.