2. Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN classifier and KNN regression methods are similar to each other but they have different objectives. In fact, the first aims to classify the observations in different classes (classification problem) while the second is used to predict the value of a target variable (regression problem).
auto = Auto
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.
lm.model = lm(mpg~.-name, data= Auto)
summary(lm.model)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
For instance:
i. Is there a relationship between the predictors and the response?
Yes, the model appears to be significant.
ii. Which predictors appear to have a statistically significant relationship to the response?
Based on the p-value, displacement, weight, year and origin appear to be significant
iii. What does the coefficient for the year variable suggest?
The coefficient for the year is 0.750773. Therefore, in this dataset older cars tend to have a better gas mileage when everything else is constant. In particular each year difference is associated with 0.75 more miles per gallon.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow=c(2,2))
plot(lm.model)
The redisuals vs fitted and Normal QQ plot show presence of outliers for high values of gas mileage. The leverage plot indicates that there are influential points, which have an unusual leverage
(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
par(mfrow=c(2,2))
plot(mpg ~ log(displacement), data = Auto)
plot(mpg ~ sqrt(displacement), data = Auto)
plot(mpg ~ displacement^2, data = Auto)
The sqrt transformation seems to give the most linear plot.
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
Carseats$Sales = Carseats$Sales*1000
m1 = lm(Sales ~ Price+Urban+US, data= Carseats)
(b) Provide an interpretation of each coefficient in the model. Be careful some of the variables in the model are qualitative!
summary(m1)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6920.6 -1622.0 -56.4 1578.6 7058.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13043.469 651.012 20.036 < 2e-16 ***
## Price -54.459 5.242 -10.389 < 2e-16 ***
## UrbanYes -21.916 271.650 -0.081 0.936
## USYes 1200.573 259.042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The unit sold discrease by about 54 for each unit of increase in Price. Stores in the urban area sell 21 units more than the ones in the rural area. Stores in the United States sell 1200 units more than the ones outside of the country.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = 13043.469 -54.459Price - UrbanYes21.916 - 1200.573USYes
(d) For which of the predictors can you reject the null hypothesis H0 :βj =0?
Price and US
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
m2 = lm(Sales ~ Price+US, data= Carseats)
summary(m2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6926.9 -1628.6 -57.4 1576.6 7051.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13030.79 630.98 20.652 < 2e-16 ***
## Price -54.48 5.23 -10.416 < 2e-16 ***
## USYes 1199.64 258.46 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
Both models can explain about 23,93% of the variance.
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(m2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11790.32020 14271.26531
## Price -64.75984 -44.19543
## USYes 691.51957 1707.76632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow=c(2,2))
plot(m2)
The leverage plot indicates that there are points which have an unusual leverage
(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficients are the same if \(∑_{j}x^2_{j}=∑_{j}y^2_{j}\)
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x = 1:100
y = 4*x + rnorm(100)
m1 = lm(x ~ y)
m2 = lm(y ~ x)
summary(m1)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57377 -0.14607 -0.00247 0.15111 0.58286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0298930 0.0454995 -0.657 0.513
## y 0.2500132 0.0001955 1278.991 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2257 on 98 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.636e+06 on 1 and 98 DF, p-value: < 2.2e-16
summary(m2)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.34005 -0.60584 0.01551 0.58514 2.29747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.131666 0.181897 0.724 0.471
## x 3.999549 0.003127 1278.991 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9027 on 98 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.636e+06 on 1 and 98 DF, p-value: < 2.2e-16
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x = 1:100
y = x
m1 = lm(x ~ y)
m2 = lm(y ~ x)
summary(m1)
## Warning in summary.lm(m1): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.680e-13 -4.300e-16 2.850e-15 5.302e-15 3.575e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.684e-14 5.598e-15 -1.015e+01 <2e-16 ***
## y 1.000e+00 9.624e-17 1.039e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778e-14 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.08e+32 on 1 and 98 DF, p-value: < 2.2e-16
summary(m2)
## Warning in summary.lm(m2): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.680e-13 -4.300e-16 2.850e-15 5.302e-15 3.575e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.684e-14 5.598e-15 -1.015e+01 <2e-16 ***
## x 1.000e+00 9.624e-17 1.039e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778e-14 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.08e+32 on 1 and 98 DF, p-value: < 2.2e-16