Homework #2

Problem 2
Problem 9
Problem 10
Problem 12

Problem 2

Carefully explain the difference between the KNN classifier and KNN regression methods.

The Knn classifier is used when dealing with qualitative response variables and estimates the conditional probability for categorical variables by picking the most common variable. The KNN regression method is used when dealing with a quantitative response variable and estimates an average.

Problem 9

This question involves the use of multiple linear regression on the Auto data set.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.5.2

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.5.3

## 
## Attaching package: 'ISLR2'

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

Produce a scatterplot matrix which includes all of the variables in the data set.

#threw an error for non-numberic "name" column
pairs(Auto[, -which(names(Auto) == "name")])

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative. cor()

cor(Auto[, -which(names(Auto) == "name")])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

Is there a relationship between the predictors and the response?
Which predictors appear to have a statistically significant relationship to the response?
What does the coefficient for the year variable suggest?

lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

-There is a stat. significant relationship with displacement, weight, year, and origin because they have a p-val of < 0.05 -There is a relationship with the predictors and the response because the values for the most part aren’t 0 in the estimates. High F-statistic with stat signif. p-value and r^2 of .8182. -The coefficient suggests that as there is a value of 1 mpg increasing, year of vehicle also increases by .75. Newer cars have higher mpg

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

plot(lm.fit)

There is a U-shaped trend which suggests the data is non-linear. The residual plot does show a high leverage point at 14, and a few data points around 4, so it does indicate outliers.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm.fit1 <- lm(mpg ~ . - name, cylinders:displacement, data = Auto)

## Warning in cylinders:displacement: numerical expression has 392 elements: only
## the first used

summary(lm.fit1)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto, subset = cylinders:displacement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3945 -1.7041 -0.0186  1.6198 13.3258 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.1345737  5.1622721   0.220   0.8262    
## cylinders    -0.4608452  0.3128285  -1.473   0.1418    
## displacement  0.0070389  0.0070386   1.000   0.3181    
## horsepower   -0.0183184  0.0125424  -1.461   0.1452    
## weight       -0.0052788  0.0005963  -8.853  < 2e-16 ***
## acceleration -0.0936875  0.0959226  -0.977   0.3295    
## year          0.5353830  0.0619708   8.639 3.77e-16 ***
## origin        0.7307611  0.3023433   2.417   0.0163 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.792 on 292 degrees of freedom
## Multiple R-squared:  0.8204, Adjusted R-squared:  0.8161 
## F-statistic: 190.5 on 7 and 292 DF,  p-value: < 2.2e-16

lm.fit2 <- lm(mpg ~ . - name, horsepower:weight, data = Auto)

## Warning in horsepower:weight: numerical expression has 392 elements: only the
## first used

summary(lm.fit2)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto, subset = horsepower:weight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6166 -2.0557 -0.2415  1.5404 12.0351 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -29.231835   7.641106  -3.826 0.000164 ***
## cylinders      0.143802   0.438423   0.328 0.743182    
## displacement   0.009192   0.011106   0.828 0.408630    
## horsepower    -0.061881   0.022917  -2.700 0.007394 ** 
## weight        -0.005994   0.001013  -5.916 1.05e-08 ***
## acceleration   0.045217   0.136117   0.332 0.740014    
## year           0.932636   0.088183  10.576  < 2e-16 ***
## origin         1.226859   0.324440   3.781 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.387 on 255 degrees of freedom
##   (3112 observations deleted due to missingness)
## Multiple R-squared:  0.8078, Adjusted R-squared:  0.8025 
## F-statistic: 153.1 on 7 and 255 DF,  p-value: < 2.2e-16

On the 2 interactions I’ve tested, they both appear to be statistically significant due to p-value being less than 0.05.

Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

plot(log(Auto$displacement), Auto$mpg)

plot(sqrt(Auto$displacement), Auto$mpg)

plot((Auto$displacement)^2, Auto$mpg)

Log and sqrt makes the relationship between displacement and mpg more linear, whereas squared does not

Problem 10

This question should be answered using the Carseats data set.

library(ISLR2)
data("Carseats")

Fit a multiple regression model to predict Sales using Price, Urban, and US.

Q10_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Q10_fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

-As price decreases by 1 unit, the sales go up by .054459 on average. This is statistically significant with a p-value of <0.05. -If the area is Urban, there is a decrease in sales by -.021916 units. However, this is not statistically significant because the p-value is .936. -If it is in the US, there will be an increase in sales by 1.200573 on average, this is statistically significant with a p-value of <0.05. -The intercept shows us that when each variable is at 0 (or ‘No’), sales is at 13.043469 units.

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.043469 - 0.054459 * Price - 0.021916 * 1(UrbanYes) + 1.200573 * 1(USYes)

If UrbanNo or USNo then ‘0’ respectively.

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

-We can reject the null hypothesis for Price and US since their p-value is <0.05 and is stat. signif.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

Q10_fit1 <- lm(Sales ~ Price + US, data = Carseats)
summary(Q10_fit1)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data? The models in a and b share similar R^2 around 0.2393, which only represents 23.93% of the variance in the data and does not fit very well.
Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(Q10_fit1)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

-Neither variable contains 0, still represents statistical significance.

Is there evidence of outliers or high leverage observations in the model from (e)?

plot(Q10_fit1)

There is no evidence of outliers beyond +/-3 standard residuals. There are a few high leverage points, particularly one above 0.04.

Problem 12

This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

-When the sums of squares of X and Y are equal

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1) n <- 100

set.seed(1)

x <- rnorm(100)
y <- 2*x + rnorm(100)

Q12_fit <-lm(y ~ + 0)
Q12_fit1 <- lm(x ~ + 0)

coef(Q12_fit)

## numeric(0)

coef(Q12_fit1)

## numeric(0)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(1)

x1 <- rnorm(100)
yy <- x

Q12_fit2 <-lm(yy ~ x1 + 0)
Q12_fit3 <- lm(x1 ~ yy + 0)

coef(Q12_fit2)

## x1 
##  1

coef(Q12_fit3)

## yy 
##  1