Chapter 03: 2, 9, 10, 12

Chapter 3 Problem 2:

Carefully explain the differences between the KNN classifier and KNN regression methods

Answer:

The KNN classifiers are used for classification (categorical) tasks whereas the KNN regression is used for regression (continuous/numerical) tasks. While KNN classifier k favors majority classes when large leading to bias, KNN regression may have a smoother prediction but less variable fit. In KNN classification, accuracy, precision, recall, and F1-score are all used as performance metrics, while in KNN regression, MSE and R-squared are used.

Chapter 3 Problem 9:

This question involves the use of multiple linear regression on the Auto data set.

library(ISLR2)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Auto_quantitative <- Auto[, sapply(Auto, is.numeric)]
cor(Auto_quantitative)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

mlr1 <- lm(mpg ~ ., data = Auto_quantitative)
summary(mlr1)

## 
## Call:
## lm(formula = mpg ~ ., data = Auto_quantitative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Answers:

i. Is there a relationship between the predictors and the response?

Yes, there are some statistically significant predictors in the model that explain ~82% of the variance in mpg based on the adjusted R-squared value.

ii. Which predictors appear to have a statistically significant relationship to the response?

Displacement, weight, year, and origin are all statistically significant because they have a p-value < 0.05.

iii. What does the coefficient for the year variable suggest?

The coefficient for the year variable suggests that as year increases by 1, the mpg increases by 0.75. This suggests that newer cars, have higher mpg.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow= c(2,2))
plot(mlr1)

Answer:

The residual vs. fitted plot suggests a slight patter as fitted values increase, may suggest possible heteroscedasticity.The Scale-location shows the same trend. The Normal Q-Q plot suggests that the points are not normally distributed because there is deviation from the line at the tails (right-skewness). The residual plot suggests that point 14 has a higher leverage than the rest of the points.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

interaction1 <- lm(mpg~.+ displacement:horsepower, data = Auto_quantitative)
summary(interaction1)

## 
## Call:
## lm(formula = mpg ~ . + displacement:horsepower, data = Auto_quantitative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7010 -1.6009 -0.0967  1.4119 12.6734 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.894e+00  4.302e+00  -0.440  0.66007    
## cylinders                6.466e-01  3.017e-01   2.143  0.03275 *  
## displacement            -7.487e-02  1.092e-02  -6.859 2.80e-11 ***
## horsepower              -1.975e-01  2.052e-02  -9.624  < 2e-16 ***
## weight                  -3.147e-03  6.475e-04  -4.861 1.71e-06 ***
## acceleration            -2.131e-01  9.062e-02  -2.351  0.01921 *  
## year                     7.379e-01  4.463e-02  16.534  < 2e-16 ***
## origin                   6.891e-01  2.527e-01   2.727  0.00668 ** 
## displacement:horsepower  5.236e-04  4.813e-05  10.878  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.912 on 383 degrees of freedom
## Multiple R-squared:  0.8636, Adjusted R-squared:  0.8608 
## F-statistic: 303.1 on 8 and 383 DF,  p-value: < 2.2e-16

In this model, adding the interaction term between displacement and horsepower is very significant. This suggests that displacement and mpg is influenced by the level of horsepower. The R-squared value is also higher in this model suggesting by adding the interaction term, the model improves.

interaction2 <- lm(mpg ~ . + cylinders*displacement + displacement*horsepower, data = Auto_quantitative)
summary(interaction2)

## 
## Call:
## lm(formula = mpg ~ . + cylinders * displacement + displacement * 
##     horsepower, data = Auto_quantitative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0692 -1.5453 -0.1215  1.4814 12.7232 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.006e+00  4.426e+00  -0.227   0.8203    
## cylinders                2.309e-01  5.702e-01   0.405   0.6858    
## displacement            -8.149e-02  1.336e-02  -6.099 2.61e-09 ***
## horsepower              -1.873e-01  2.367e-02  -7.915 2.71e-14 ***
## weight                  -3.235e-03  6.557e-04  -4.933 1.21e-06 ***
## acceleration            -1.923e-01  9.382e-02  -2.050   0.0410 *  
## year                     7.402e-01  4.472e-02  16.550  < 2e-16 ***
## origin                   6.435e-01  2.583e-01   2.492   0.0131 *  
## cylinders:displacement   1.997e-03  2.325e-03   0.859   0.3908    
## displacement:horsepower  4.812e-04  6.897e-05   6.976 1.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.913 on 382 degrees of freedom
## Multiple R-squared:  0.8639, Adjusted R-squared:  0.8607 
## F-statistic: 269.4 on 9 and 382 DF,  p-value: < 2.2e-16

By adding cylinders:displacement, it did not change the model’s explanatory value much. This is because this interaction is not significant.

interaction3 <- lm(mpg ~ . + horsepower:cylinders + displacement*horsepower, data = Auto_quantitative)
summary(interaction3)

## 
## Call:
## lm(formula = mpg ~ . + horsepower:cylinders + displacement * 
##     horsepower, data = Auto_quantitative)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.891 -1.627 -0.089  1.353 12.506 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.7245533  5.4406102   0.868  0.38573    
## cylinders               -1.4390172  1.0981296  -1.310  0.19084    
## displacement            -0.0476961  0.0175411  -2.719  0.00684 ** 
## horsepower              -0.2572895  0.0365449  -7.040 8.96e-12 ***
## weight                  -0.0033093  0.0006502  -5.089 5.65e-07 ***
## acceleration            -0.2088022  0.0903086  -2.312  0.02130 *  
## year                     0.7378654  0.0444594  16.596  < 2e-16 ***
## origin                   0.7459615  0.2533706   2.944  0.00344 ** 
## cylinders:horsepower     0.0173764  0.0087998   1.975  0.04903 *  
## displacement:horsepower  0.0003226  0.0001125   2.867  0.00437 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.901 on 382 degrees of freedom
## Multiple R-squared:  0.865,  Adjusted R-squared:  0.8618 
## F-statistic: 271.9 on 9 and 382 DF,  p-value: < 2.2e-16

Adding the interaction of cylinders:horsepower, the adjusted r-squared value rose slightly. The interaction between cylinders:horsepower is significant suggesting that the effect of horsepower on mpg changes bassed on both cylinders and displacement.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

Auto_quantitative$log_displacement <- log(Auto_quantitative$displacement)
Auto_quantitative$sqrt_displacement <- sqrt(Auto_quantitative$displacement)
Auto_quantitative$sq_displacement <- Auto_quantitative$displacement^2

Auto_quantitative$log_horsepower <- log(Auto_quantitative$horsepower)
Auto_quantitative$sqrt_horsepower <- sqrt(Auto_quantitative$horsepower)
Auto_quantitative$sq_horsepower <- Auto_quantitative$horsepower^2

model_log <- lm(mpg ~ log_displacement + log_horsepower + weight + acceleration + year + origin + log_displacement:log_horsepower, data = Auto_quantitative)
summary(model_log)

## 
## Call:
## lm(formula = mpg ~ log_displacement + log_horsepower + weight + 
##     acceleration + year + origin + log_displacement:log_horsepower, 
##     data = Auto_quantitative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1112 -1.6669 -0.1139  1.4579 12.4210 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.159e+02  2.154e+01  10.026  < 2e-16 ***
## log_displacement                -3.970e+01  4.212e+00  -9.426  < 2e-16 ***
## log_horsepower                  -4.983e+01  4.705e+00 -10.589  < 2e-16 ***
## weight                          -3.424e-03  6.294e-04  -5.440 9.49e-08 ***
## acceleration                    -2.279e-01  9.362e-02  -2.435   0.0154 *  
## year                             7.506e-01  4.421e-02  16.977  < 2e-16 ***
## origin                           6.280e-01  2.599e-01   2.416   0.0162 *  
## log_displacement:log_horsepower  8.249e+00  9.040e-01   9.125  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.896 on 384 degrees of freedom
## Multiple R-squared:  0.8648, Adjusted R-squared:  0.8623 
## F-statistic: 350.9 on 7 and 384 DF,  p-value: < 2.2e-16

By taking the log of displacement and horsepower, every variable became significant in the model. The adjusted R-squared also increased a little bit.

par(mfrow = c(2,2))
plot(model_log)

model_sqrt <- lm(mpg ~ sqrt_displacement + sqrt_horsepower + weight + acceleration + year + origin + sqrt_displacement:sqrt_horsepower, data = Auto_quantitative)
summary(model_sqrt)

## 
## Call:
## lm(formula = mpg ~ sqrt_displacement + sqrt_horsepower + weight + 
##     acceleration + year + origin + sqrt_displacement:sqrt_horsepower, 
##     data = Auto_quantitative)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.047 -1.572 -0.115  1.433 12.452 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       36.5625343  6.3010566   5.803 1.37e-08 ***
## sqrt_displacement                 -3.0920551  0.3129674  -9.880  < 2e-16 ***
## sqrt_horsepower                   -5.5087076  0.4903298 -11.235  < 2e-16 ***
## weight                            -0.0032206  0.0006332  -5.086 5.72e-07 ***
## acceleration                      -0.2244416  0.0914854  -2.453  0.01460 *  
## year                               0.7438615  0.0441035  16.866  < 2e-16 ***
## origin                             0.6717378  0.2533701   2.651  0.00835 ** 
## sqrt_displacement:sqrt_horsepower  0.2792940  0.0250597  11.145  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.884 on 384 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8634 
## F-statistic: 354.2 on 7 and 384 DF,  p-value: < 2.2e-16

By taking the square root of displacement and horsepower, every variable became significant in the model. The adjusted R-squared also increased a little bit more from the log model.

par(mfrow = c(2,2))
plot(model_sqrt)

model_sq <- lm(mpg ~ sq_displacement + sq_horsepower + weight + acceleration + year + origin + sq_displacement:sq_horsepower, data = Auto_quantitative)
summary(model_sq)

## 
## Call:
## lm(formula = mpg ~ sq_displacement + sq_horsepower + weight + 
##     acceleration + year + origin + sq_displacement:sq_horsepower, 
##     data = Auto_quantitative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4488 -1.8420 -0.0478  1.7963 12.4703 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -1.799e+01  4.047e+00  -4.445 1.15e-05 ***
## sq_displacement               -1.639e-05  1.560e-05  -1.051    0.294    
## sq_horsepower                 -3.518e-04  8.012e-05  -4.392 1.46e-05 ***
## weight                        -5.445e-03  6.511e-04  -8.363 1.14e-15 ***
## acceleration                   4.939e-02  9.036e-02   0.547    0.585    
## year                           7.611e-01  4.821e-02  15.789  < 2e-16 ***
## origin                         1.122e+00  2.523e-01   4.447 1.14e-05 ***
## sq_displacement:sq_horsepower  2.342e-09  4.455e-10   5.258 2.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.14 on 384 degrees of freedom
## Multiple R-squared:  0.8411, Adjusted R-squared:  0.8382 
## F-statistic: 290.3 on 7 and 384 DF,  p-value: < 2.2e-16

By taking the square of displacement and horsepower, the square of displacement and acceleration become insignificant. The adjusted R-squared also decreased lower than any model except for the original mlr1 model.

par(mfrow = c(2,2))
plot(model_sq)

None of the plots had very significant changes in any of the transformations.

Chapter 3 Problem 10:

This question should be answered using the Carseats data set.

library(ISLR2)
attach(Carseats)
head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

summary(Carseats)

##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
##

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

model1 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model1)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

After looking at the table above, we can tell that Price and USYes are significant predictors of Sales.

The coefficient for Price is -0.054459 which means for every unit of increase in the price of the carseat, the store’s sales decrease by ~54 units on average.

The coefficient for US = Yes is 1.200573 which means US stores sell, on average, $1,200 more worth of carseats compared to stores outside the US.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

$Sales = 13.04 - 0.05Price - 0.02Urban + 1.2USYes$

(d) For which of the predictors can you reject the null hypothesis $H_0 : \beta_j = 0$?

Price and US = Yes are significant thus we can reject the null hypothesis $H_0 : \beta_j = 0$.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

model_significant <- lm(Sales ~ Price + US, data = Carseats)
summary(model_significant)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

They are not great fit for the data because the adjusted R-square values are not good. The Adj R-squared for part (a) is 0.2335, whereas the Adj R-squared for part(e) is 0.2354.This means each model only explains about 23% variance in Sales. Preferrably the Adj R-squared value would be greater than 0.7.

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(model_significant)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(model_significant)

summary(influence.measures(model_significant))

## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

There are quiet a few observations that are outliers in the data and influence the measures.

Chapter 3 Problem 12:

This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X

Answer:

The coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X when the sum of squared values of xi is equal to the sum of the squaared values of yi (the total variance is the same for X and Y).

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)

n <- 100
X <- rnorm(n)
Y <- 3 * X + rnorm(n)

XonY <- lm(X~Y-1)
summary(XonY)

## 
## Call:
## lm(formula = X ~ Y - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63420 -0.16066  0.07099  0.18507  0.59841 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## Y  0.29684    0.01056   28.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 99 degrees of freedom
## Multiple R-squared:  0.8887, Adjusted R-squared:  0.8876 
## F-statistic: 790.6 on 1 and 99 DF,  p-value: < 2.2e-16

YonX <- lm(Y~X-1)
summary(YonX)

## 
## Call:
## lm(formula = Y ~ X - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## X   2.9939     0.1065   28.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.8887, Adjusted R-squared:  0.8876 
## F-statistic: 790.6 on 1 and 99 DF,  p-value: < 2.2e-16

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(1)

X<-rnorm(100)
Y<-X

XonY2 <-lm(X~Y-1)
summary(XonY2)

## Warning in summary.lm(XonY2): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = X ~ Y - 1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## Y 1.000e+00  6.479e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16

YonX2<- lm(Y~X-1)
summary(YonX2)

## Warning in summary.lm(YonX2): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = Y ~ X - 1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## X 1.000e+00  6.479e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16

STA 6543 Assignment 2

Abigail Davis mjv135

2025-02-26

Chapter 3 Problem 2:

Carefully explain the differences between the KNN classifier and KNN regression methods

Answer:

Chapter 3 Problem 9:

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

Answers:

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Answer:

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

Chapter 3 Problem 10:

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

(d) For which of the predictors can you reject the null hypothesis \(H_0 : \beta_j = 0\)?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

(f) How well do the models in (a) and (e) fit the data?

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

Chapter 3 Problem 12:

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X

Answer:

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.