Exercise 2

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classifier and KNN regression methods are similar to each other but they have different objectives. In fact, the first aims to classify the observations in different classes (classification problem) while the second is used to predict the value of a target variable (regression problem).

Execise 9

auto = Auto

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(Auto[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

lm.model = lm(mpg~.-name, data= Auto)
summary(lm.model)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

For instance:

i. Is there a relationship between the predictors and the response?
Yes, the model appears to be significant.

ii. Which predictors appear to have a statistically significant relationship to the response?
Based on the p-value, displacement, weight, year and origin appear to be significant

iii. What does the coefficient for the year variable suggest?
The coefficient for the year is 0.750773. Therefore, in this dataset older cars tend to have a better gas mileage when everything else is constant. In particular each year difference is associated with 0.75 more miles per gallon.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lm.model)

The redisuals vs fitted and Normal QQ plot show presence of outliers for high values of gas mileage. The leverage plot indicates that there are influential points, which have an unusual leverage

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

par(mfrow=c(2,2))
plot(mpg ~ log(displacement), data = Auto)
plot(mpg ~ sqrt(displacement), data = Auto)
plot(mpg ~ displacement^2, data = Auto)

The sqrt transformation seems to give the most linear plot.

Exercise 10

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

Carseats$Sales = Carseats$Sales*1000
m1 = lm(Sales ~ Price+Urban+US, data= Carseats)

(b) Provide an interpretation of each coefficient in the model. Be careful some of the variables in the model are qualitative!

summary(m1)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6920.6 -1622.0   -56.4  1578.6  7058.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13043.469    651.012  20.036  < 2e-16 ***
## Price         -54.459      5.242 -10.389  < 2e-16 ***
## UrbanYes      -21.916    271.650  -0.081    0.936    
## USYes        1200.573    259.042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

The unit sold discrease by about 54 for each unit of increase in Price. Stores in the urban area sell 21 units more than the ones in the rural area. Stores in the United States sell 1200 units more than the ones outside of the country.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = 13043.469 -54.459Price - UrbanYes21.916 - 1200.573USYes

(d) For which of the predictors can you reject the null hypothesis H0 :βj =0?
Price and US

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

m2 = lm(Sales ~ Price+US, data= Carseats)
summary(m2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6926.9 -1628.6   -57.4  1576.6  7051.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13030.79     630.98  20.652  < 2e-16 ***
## Price         -54.48       5.23 -10.416  < 2e-16 ***
## USYes        1199.64     258.46   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?
Both models can explain about 23,93% of the variance.

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(m2, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11790.32020 14271.26531
## Price         -64.75984   -44.19543
## USYes         691.51957  1707.76632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(m2)

The leverage plot indicates that there are points which have an unusual leverage

Exercise 12

(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficients are the same if \(∑_{j}x^2_{j}=∑_{j}y^2_{j}\)

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x = 1:100
y = 4*x + rnorm(100)
m1 = lm(x ~ y)
m2 = lm(y ~ x)
summary(m1)
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57377 -0.14607 -0.00247  0.15111  0.58286 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -0.0298930  0.0454995   -0.657    0.513    
## y            0.2500132  0.0001955 1278.991   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2257 on 98 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 1.636e+06 on 1 and 98 DF,  p-value: < 2.2e-16
summary(m2)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34005 -0.60584  0.01551  0.58514  2.29747 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 0.131666   0.181897    0.724    0.471    
## x           3.999549   0.003127 1278.991   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9027 on 98 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 1.636e+06 on 1 and 98 DF,  p-value: < 2.2e-16

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x = 1:100
y = x
m1 = lm(x ~ y)
m2 = lm(y ~ x)
summary(m1)
## Warning in summary.lm(m1): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.680e-13 -4.300e-16  2.850e-15  5.302e-15  3.575e-14 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -5.684e-14  5.598e-15 -1.015e+01   <2e-16 ***
## y            1.000e+00  9.624e-17  1.039e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.778e-14 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.08e+32 on 1 and 98 DF,  p-value: < 2.2e-16
summary(m2)
## Warning in summary.lm(m2): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.680e-13 -4.300e-16  2.850e-15  5.302e-15  3.575e-14 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -5.684e-14  5.598e-15 -1.015e+01   <2e-16 ***
## x            1.000e+00  9.624e-17  1.039e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.778e-14 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.08e+32 on 1 and 98 DF,  p-value: < 2.2e-16