Assignment 2 STA

Question 2:

“Carefully explain the differences between the KNN classifier and KNN regression methods.”

Answer: “The KNN classifier is used to solve classification problems by identifying the neighborhood of x0 and then estimating the conditional probability P(Y=j|X=x0) for class j as the fraction of points in the neighborhood whose response values equal j. The KNN regression method is used to solve regression problems by aslso identifying the neighborhood of x0 and then estimating f(x0) as the average of all the training responses in the neighborhood.”

Question 9:

“This question involves the use of multiple linear regression on the Auto data set.”

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.3

data(Auto)

Part A: “Produce a scatterplot matrix which includes all of the variablesin the data set.”

pairs(Auto)

Part B: “Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the “name” variable, which is qualitative."

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

cor(Auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Part C:“Use the lm() function to perform a multiple linear regression with “mpg” as the response and all other variables except “name” as the predictors. Use the summary() function to print the results. Comment on the output. For instance:"

i. Is there a relationship between the predictors and the response?

fit1 <- lm(mpg ~ . - name, data = Auto)
summary(fit1)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comment: : “We can answer this question by again testing the hypothesis H0:βi=0 ∀i. The p-value corresponding to the F-statistic is 2.037105910^{-139}, this indicates a clear evidence of a relationship between “mpg” and the other predictors."

ii. Which predictors appear to have a statistically significant relationship to the response?

Answer: We can answer this question by checking the p-values associated with each predictos t-stat. We may conclude that all predictors are significant except “cylinders”, “horsepower” and “acceleration”.

iii. What does the coefficient for the “year” variable suggest?

Answer: The coefficient ot the “year” variable suggests that the average effect of an increase of 1 year is an increase of 0.7507727 in “mpg”. Basically, cars become more fuel efficient every year by almost 1 mpg / year.

Part D: “Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers ? Does the leverage plots identify any observations with unusually high leverages?”

par(mfrow = c(2, 2))
plot(fit1)

#### Comment: The plot of residuals versus fitted values shows the presence of mild non linearity in the data. The plot of standardized residuals versus leverage shows the presence of a few outliers (higher than 2 or lower than -2) and one high leverage point (point 14).

Part E: Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Comment: From the p-values, we can see that the interaction between displacement and weight is signifcant, while the interactiion between cylinders and displacement is not.

fit2 <- lm(mpg ~ cylinders * displacement+displacement * weight, data = Auto[, 1:8])
summary(fit2)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     weight, data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2934  -2.5184  -0.3476   1.8399  17.7723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.262e+01  2.237e+00  23.519  < 2e-16 ***
## cylinders               7.606e-01  7.669e-01   0.992    0.322    
## displacement           -7.351e-02  1.669e-02  -4.403 1.38e-05 ***
## weight                 -9.888e-03  1.329e-03  -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03  3.426e-03  -0.872    0.384    
## displacement:weight     2.128e-05  5.002e-06   4.254 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7237 
## F-statistic: 205.8 on 5 and 386 DF,  p-value: < 2.2e-16

Part F: Try a few different transformations of the variables, such as logX, X−−√, X2. Comment on your findings.

par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)

#### Comment: We limit ourselves to examining “horsepower” as sole predictor. It seems that the log transformation gives the most linear looking plot.

Question 10

This question should be answered using the “Carseats” data set.

library(ISLR)
data(Carseats)

Part A: Fit a multiple regression model to predict “Sales” using “Price”, “Urban” and “US”.

data(Carseats)
fitQ10 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(fitQ10)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Part B: Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are qualitative

Answer: The coefficient of the “Price” variable may be interpreted by saying that the average effect of a price increase of 1 dollar is a decrease of 54.4588492 units in sales all other predictors remaining fixed. The coefficient of the “Urban” variable may be interpreted by saying that on average the unit sales in urban location are 21.9161508 units less than in rural location all other predictors remaining fixed. The coefficient of the “US” variable may be interpreted by saying that on average the unit sales in a US store are 1200.5726978 units more than in a non US store all other predictors remaining fixed.

Part C: Write out the model in equation form, being careful to handle the qualitative variables properly.

Answer: Sales = 13.0434689 + (−0.0544588) ×P rice+(−0.0219162) × Urban + (1.2005727) × US + ε

Part D: For which of the predictors can you reject the null hypothesis H0:βj=0?

Answer: We can reject the null hypothesis for the “Price” and “US” variables.

Part E: On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fitparte <- lm(Sales ~ Price + US, data = Carseats)
summary(fitparte)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Part F: How well do the models in (a) and (e) fit the data?

Answer: The R2 for the smaller model is marginally better than for the bigger model.

Part G: Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(fitparte)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Part H: Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2, 2))
plot(fitparte)

Question 12:

This problem involves simple linear regression without an intercept.

Part A: Recall that the coefficient estimate β^ for the linear regression of Y onto X witout an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

Answer:

Part B: Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x <- 1:100
sum(x^2)

## [1] 338350

y <- 2 * x + rnorm(100, sd = 0.1)
sum(y^2)

## [1] 1353606

fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.223590 -0.062560  0.004426  0.058507  0.230926 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x 2.0001514  0.0001548   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16

summary(fit.X)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115418 -0.029231 -0.002186  0.031322  0.111795 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y 5.00e-01   3.87e-05   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16