Veera_Solige_Exercise 2

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

A) This KNN classifier and KNN regression methods are closely related as KNN classifier is used when the response is qualitative (like binary: yes or no, class(es): high, low & medium) and KNN regression is used when the response is quantitative (like income/salary, mpg et.al.).

Another difference is in the way both predict the response. For a test observation x₀, KNN classifier identifies neighboring K points in training data closest to x₀(represented as N₀) then estimates the conditional probability for class j as the fraction of points in the neighborhood whose response values equal j and apply Bayes rule to classify test observation x₀ to the class with largest probability. This is represented as below: \[ Pr(Y=j|X=x_0)= \frac1K\sum_{i\in N_0}I(y_i=j) \] On the other hand, KNN regression also selects the neighboring K points (represented as N₀) the same way, the response(estimate) for the test observation is the average of responses of all the K points in the neighborhood. This is represented as below: \[ \hat{f}(x_0)=\frac1K\sum_{i\in N_0}y_i \]

9. This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all the variables in the data set

my_Auto <- Auto

pairs(my_Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

# get the column number for the "name" as we don't want to include in correlation matrix
colnum <- (as.integer(which(colnames(my_Auto)=="name")) * -1)

# Correlation excluding "name"
cor(my_Auto[ ,colnum])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with `mpg` as the response and all other variables except `name` as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

# fit multiple linear model except "name"
# Convert "origin" as factor with labels as in the data description
my_Auto$origin <- as.factor(my_Auto$origin)
# labels are "American","European","Japanese"

lm_fit_x_name <-  lm(mpg ~ .-name, data = my_Auto)

summary(lm_fit_x_name)

## 
## Call:
## lm(formula = mpg ~ . - name, data = my_Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders    -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement  2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower   -1.818e-02  1.371e-02  -1.326 0.185488    
## weight       -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration  7.910e-02  9.822e-02   0.805 0.421101    
## year          7.770e-01  5.178e-02  15.005  < 2e-16 ***
## origin2       2.630e+00  5.664e-01   4.643 4.72e-06 ***
## origin3       2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

A) Before we comment on the output, we need to set up out hypothesis,

H₀: There is no significant relationship between predictors and response(mpg) or all β’s are zero H_a: At least one predictor has a significant relationship with the response(mpg) or at least one of the β’s is non-zero

F-test result can be used to answer this hypothesis. Assuming the significance level(alpha) is 0.05 the p-Value for the F-test is less than alpha thus we reject H₀ indicating that there is relationship between the predictors and response.

ii.Which predictors appear to have a statistically significant relationship to the response?

A)To answer this, we need to look at the t-test results in the output. Assuming the significance level(alpha) is 0.05 and setting hypothesis for each predictor like,

H₀: Predictor “Displacement” has no significant relationship with the response(mpg) or β_displacement = 0 H_a: Predictor “Displacement” has significant relationship with the response(mpg) or β_displacement ≠ 0

We can see from the output that the p-value from the t-test for displacement is less than alpha(0.05) indicating that displacement has significant relationship with mpg(response). We reject H₀ and thus coefficient for displacement is not zero. Similarly, we can see that in addition to ‘displacement’ below are also significant: weight, year and origin.

iii. What does the coefficient for the year variable suggest?

A) From the above output we can see that β_year = 0.78 which can be interpreted as for every one unit increase in year average mpg is going to increase by 0.78, if all other predictors are constant. This can be seen as year over year change in average mpg with others remaining constant or same.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(lm_fit_x_name)

#Influence points identification
inf_pts_all=which(cooks.distance(lm_fit_x_name)>0.015)
inf_pts_all

##  14  45 112 156 167 245 276 278 310 323 326 327 328 330 335 387 394 
##  14  44 111 154 165 243 274 276 308 321 324 325 326 328 332 382 389

par(mfrow=c(1,1)) # reset back

1) Zooming in ‘Residuals Vs Fitted’ plot, we can see that there is non-linearity as the residuals show a pattern.

2) Also, I can see heteroscedasticity as we can see that the variance of the residuals is increasing.

3) Normal Q-Q plot is also showing a right tailed behavior with some outliers and not a perfect conditionally normal distribution.

4) Residuals Vs Leverage show that the observation 14 has high leverage.

5) Above is the list of influential points (using cooks distance of 0.015)which needs to be looked at..

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

# We are excluding the predictor name(column 9) before we go for interaction effect
lm_fit_itrt <- lm(mpg~ . *., data = my_Auto[,-9])

# alternate syntax
lm_fit_itrt2 <- lm(mpg~ .+cylinders:acceleration+acceleration:year, data = my_Auto[,-9])

summary(lm_fit_itrt)

## 
## Call:
## lm(formula = mpg ~ . * ., data = my_Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6008 -1.2863  0.0813  1.2082 12.0382 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.401e+01  5.147e+01   0.855 0.393048    
## cylinders                  3.302e+00  8.187e+00   0.403 0.686976    
## displacement              -3.529e-01  1.974e-01  -1.788 0.074638 .  
## horsepower                 5.312e-01  3.390e-01   1.567 0.117970    
## weight                    -3.259e-03  1.820e-02  -0.179 0.857980    
## acceleration              -6.048e+00  2.147e+00  -2.818 0.005109 ** 
## year                       4.833e-01  5.923e-01   0.816 0.415119    
## origin2                   -3.517e+01  1.260e+01  -2.790 0.005547 ** 
## origin3                   -3.765e+01  1.426e+01  -2.640 0.008661 ** 
## cylinders:displacement    -6.316e-03  7.106e-03  -0.889 0.374707    
## cylinders:horsepower       1.452e-02  2.457e-02   0.591 0.555109    
## cylinders:weight           5.703e-04  9.044e-04   0.631 0.528709    
## cylinders:acceleration     3.658e-01  1.671e-01   2.189 0.029261 *  
## cylinders:year            -1.447e-01  9.652e-02  -1.499 0.134846    
## cylinders:origin2         -7.210e-01  1.088e+00  -0.662 0.508100    
## cylinders:origin3          1.226e+00  1.007e+00   1.217 0.224379    
## displacement:horsepower   -5.407e-05  2.861e-04  -0.189 0.850212    
## displacement:weight        2.659e-05  1.455e-05   1.828 0.068435 .  
## displacement:acceleration -2.547e-03  3.356e-03  -0.759 0.448415    
## displacement:year          4.547e-03  2.446e-03   1.859 0.063842 .  
## displacement:origin2      -3.364e-02  4.220e-02  -0.797 0.425902    
## displacement:origin3       5.375e-02  4.145e-02   1.297 0.195527    
## horsepower:weight         -3.407e-05  2.955e-05  -1.153 0.249743    
## horsepower:acceleration   -3.445e-03  3.937e-03  -0.875 0.382122    
## horsepower:year           -6.427e-03  3.891e-03  -1.652 0.099487 .  
## horsepower:origin2        -4.869e-03  5.061e-02  -0.096 0.923408    
## horsepower:origin3         2.289e-02  6.252e-02   0.366 0.714533    
## weight:acceleration       -6.851e-05  2.385e-04  -0.287 0.774061    
## weight:year               -8.065e-05  2.184e-04  -0.369 0.712223    
## weight:origin2             2.277e-03  2.685e-03   0.848 0.397037    
## weight:origin3            -4.498e-03  3.481e-03  -1.292 0.197101    
## acceleration:year          6.141e-02  2.547e-02   2.412 0.016390 *  
## acceleration:origin2       9.234e-01  2.641e-01   3.496 0.000531 ***
## acceleration:origin3       7.159e-01  3.258e-01   2.198 0.028614 *  
## year:origin2               2.932e-01  1.444e-01   2.031 0.043005 *  
## year:origin3               3.139e-01  1.483e-01   2.116 0.035034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared:  0.8967, Adjusted R-squared:  0.8866 
## F-statistic: 88.34 on 35 and 356 DF,  p-value: < 2.2e-16

# summary(lm_fit_itrt2)

A) Shown above are two different syntax with in ‘lm()’. Below output shows that interaction effect among the predictors have in fact increase the relation strength metric(R²) from 0.8205 to 0.8866.

Below interactions appear significant using the alpha as 0.05 and looking at the t-statistic p-value:

Note: Origin 2 is “European”, Origin 3 is “Japanese”

cylinders:acceleration
acceleration:year
acceleration:origin2
acceleration:origin3
year:origin2
year:origin3

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

A) Square transformation

# Square root and square transformation

lm_square_fit <-lm(mpg ~ .-name +I(acceleration ^ 2)+ + I(displacement ^ 2)+ I(horsepower ^ 2), data = my_Auto)

summary(lm_square_fit)

## 
## Call:
## lm(formula = mpg ~ . - name + I(acceleration^2) + +I(displacement^2) + 
##     I(horsepower^2), data = my_Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6033 -1.5451 -0.0509  1.5556 11.8887 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.754e+00  6.328e+00   1.225   0.2212    
## cylinders          7.415e-01  3.116e-01   2.380   0.0178 *  
## displacement      -7.040e-02  1.672e-02  -4.211 3.18e-05 ***
## horsepower        -2.222e-01  3.947e-02  -5.630 3.51e-08 ***
## weight            -2.907e-03  6.949e-04  -4.184 3.57e-05 ***
## acceleration      -1.365e+00  5.540e-01  -2.463   0.0142 *  
## year               7.485e-01  4.594e-02  16.293  < 2e-16 ***
## origin2            5.266e-01  5.538e-01   0.951   0.3423    
## origin3            1.143e+00  5.393e-01   2.119   0.0347 *  
## I(acceleration^2)  3.342e-02  1.612e-02   2.073   0.0388 *  
## I(displacement^2)  1.168e-04  2.872e-05   4.066 5.83e-05 ***
## I(horsepower^2)    5.803e-04  1.371e-04   4.232 2.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.922 on 380 degrees of freedom
## Multiple R-squared:  0.8637, Adjusted R-squared:  0.8598 
## F-statistic:   219 on 11 and 380 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm_square_fit)

A) Square root transformation

lm_sqr_fit <-lm(mpg ~ .-name +I(sqrt(acceleration))+ I(displacement ^ 2)+ I(sqrt(horsepower)), data = my_Auto)

summary(lm_sqr_fit)

## 
## Call:
## lm(formula = mpg ~ . - name + I(sqrt(acceleration)) + I(displacement^2) + 
##     I(sqrt(horsepower)), data = my_Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.402 -1.549  0.013  1.499 11.857 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.840e+01  1.692e+01   2.861 0.004463 ** 
## cylinders              5.175e-01  3.139e-01   1.648 0.100079    
## displacement          -6.184e-02  1.730e-02  -3.574 0.000397 ***
## horsepower             2.346e-01  6.608e-02   3.550 0.000433 ***
## weight                -2.993e-03  6.877e-04  -4.352 1.74e-05 ***
## acceleration           1.172e+00  9.877e-01   1.186 0.236314    
## year                   7.504e-01  4.578e-02  16.391  < 2e-16 ***
## origin2                5.371e-01  5.496e-01   0.977 0.329039    
## origin3                1.129e+00  5.370e-01   2.102 0.036214 *  
## I(sqrt(acceleration)) -1.155e+01  7.986e+00  -1.447 0.148771    
## I(displacement^2)      1.035e-04  2.973e-05   3.480 0.000559 ***
## I(sqrt(horsepower))   -6.615e+00  1.433e+00  -4.615 5.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.915 on 380 degrees of freedom
## Multiple R-squared:  0.8645, Adjusted R-squared:  0.8605 
## F-statistic: 220.3 on 11 and 380 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm_sqr_fit)

A) log transformation

lm_log_fit <-lm(log(mpg) ~ .-name + I(log(acceleration)), data = my_Auto)

summary(lm_log_fit)

## 
## Call:
## lm(formula = log(mpg) ~ . - name + I(log(acceleration)), data = my_Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40961 -0.06432  0.00662  0.07307  0.36336 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.816e+00  5.512e-01   6.923 1.88e-11 ***
## cylinders            -2.079e-02  1.141e-02  -1.822 0.069192 .  
## displacement          3.351e-04  2.914e-04   1.150 0.250915    
## horsepower           -2.078e-03  5.008e-04  -4.148 4.13e-05 ***
## weight               -2.229e-04  2.516e-05  -8.860  < 2e-16 ***
## acceleration          6.767e-02  1.761e-02   3.843 0.000143 ***
## year                  3.033e-02  1.818e-03  16.685  < 2e-16 ***
## origin2               6.436e-02  2.055e-02   3.131 0.001876 ** 
## origin3               7.499e-02  1.946e-02   3.853 0.000137 ***
## I(log(acceleration)) -1.163e+00  2.907e-01  -4.000 7.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.116 on 382 degrees of freedom
## Multiple R-squared:  0.8862, Adjusted R-squared:  0.8836 
## F-statistic: 330.7 on 9 and 382 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm_log_fit)

Comments:: Square, square root and log transformations have improved the R-squared metric and also as the diagnostics plot for residuals vs fitted values have almost showed linearity.

As shown in the last plot for log transformation, linearity (residuals vs fitted plot) and also conditional normality(normal -Q-Q plot) assumptions were close to valid.

10. This question should be answered using the `Carseats` data set.

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.

my_carseats <- Carseats

lm_cseat_fit <- lm(Sales~Price+Urban+US,data = my_carseats)
summary(lm_cseat_fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = my_carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

A) before interpreting the coefficients above we would like to understand the coding that R used for categorical variables:

contrasts(my_carseats$US)

##     Yes
## No    0
## Yes   1

contrasts(my_carseats$Urban)

##     Yes
## No    0
## Yes   1

coef(lm_cseat_fit)

## (Intercept)       Price    UrbanYes       USYes 
## 13.04346894 -0.05445885 -0.02191615  1.20057270

Coefficient Interpretation

β_Price = -0.054459, for every one unit of price increase the average sales goes down by 0.054 units when other predictors are constant.

β_UrbanYes = -0.021916, this predictor is not significant but here is the interpretation: On average Urban locations (represented as ‘Yes’) have 0.022 units less sales than that of non-Urban location when other predictors are constant.

β_USYes = 1.2006, On average US locations(represented as ‘Yes’) have 1.2006 units more sales than that of non-US(represented as ‘No’) location when other predictors are constant.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

A) note: non-significant terms are also included.

Sales(hat) ≈ 13.0434 – (0.054 * Price) –(0.0219 * Urban) + (1.2006 * US)

If Urban = ‘No’ (coded as 0) and US = ‘No’ (coded as 0)

Sales(hat) ≈ 13.0434 – ( 0.054 * Price)

If Urban = ‘Yes’ (coded as 1) and US = ‘No’ (coded as 0)

Sales(hat) ≈ 13.0434 – (0.054 * Price) -0.0219 * 1 = (13.0434 – 0.0219)-(0.054 * Price)

If Urban = ‘No’ (coded as 0) and US = ‘Yes’ (coded as 1)

Sales(hat) ≈ 13.0434 – (0.054 * Price) +1.2006 * 1 = (13.0434 + 1.2006)-(0.054 * Price)

If Urban = ‘Yes’ (coded as 1) and US = ‘Yes’ (coded as 1)

Sales(hat) ≈ 13.0434 – (0.054 * Price) –(0.0219 * Urban) + (1.2006US) = (13.0434-0.0219+1.2006)-(0.054 Price)

(d) For which of the predictors can you reject the null hypothesis H₀ : β_j = 0?

summary(lm_cseat_fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = my_carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

A) Assuming the significance level is 0.05,t-Test p-value is less than significance level thus indicating significance and we reject H₀ of β_Price = 0 and β_US = 0.

Significant based on the model fitted in (a) are:

Price
US

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

A) Removed Urban from the model due to its non-significance.

lm_sel_fit <- lm(Sales ~ Price+US,data = my_carseats )

summary(lm_sel_fit)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = my_carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

A) Looking at the model fit measures,

Model (a):
- R-squared = 0.2393 = 23.93% of variation is explained in this model
- Adj R-Square = 23.35%
- RSE (Residual standard error) = 2.472
Model (e):
- R-squared = 0.2393 = 23.93% of variation is explained in this model
- Adj R-Square = 23.54%
- RSE (Residual standard error) = 2.469

Even though Model (e) has the same R² of 23.93% but looking at Adj R² and RSE, model (e) is much slightly better. It has slight improvement of a lower RSE than model (a).

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(lm_sel_fit,level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

A) To check the outliers, I used the studentized residuals which are above +/- 2.8. Also using the diagnostics plot we can see that observation 377 is an outlier and the same is confirmed below:

par(mfrow = c(2,2))
plot(lm_sel_fit)

par(mfrow = c(1,1))


df_plt <- data.frame(fit = predict (lm_sel_fit), stud = rstudent (lm_sel_fit))
df_plt <- cbind(obs=rownames(df_plt),df_plt)
which(df_plt$stud>2.8)

## [1] 377

d_hc <- hchart(
  df_plt,
  "scatter",
  hcaes(x = fit, y = stud)
)
d_hc %>%
  hc_add_theme(hc_theme_economist()) %>%
  hc_xAxis(title = list(text = "Fitted Values")) %>%
  hc_yAxis(title = list(text = "Studentized Residuals")) %>%
  hc_title(text = "Identifying Outliers with Studentized residuals above +/-2.8")

Outlier from studentized residuals is observation 377

High Leverage points used the hat values function and observation 43 is the one with high leverage.

which.max(hatvalues(lm_sel_fit))

## 43 
## 43

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

A) Referring to the simple linear equation from ILSR chapter 3, equation 3.4 with no intercept:

Regressing Y on X

\[ \hat{y}_i=\hat{\beta_1}x_i \]

\[ \hat{\beta}_1 = \frac{\sum_{i = 1}^{n}(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i = 1}^{n}(x_i-\overline{x})^2} \] Assuming that x and y have normal distribution with mean is 0 \[ \overline{x}=0;\overline{y}=0 \] \[ \hat{\beta}_1 = \frac{\sum_{i = 1}^{n}(x_i)(y_i)}{\sum_{i = 1}^{n}(x_i)^2}--Say(Equation (a)) \] Regressing X on Y

\[ \hat{x}_i=\hat{\beta_2}y_i \] \[ \hat{\beta}_2 = \frac{\sum_{i = 1}^{n}(y_i)(x_i)}{\sum_{i = 1}^{n}(y_i)^2}--Say(Equation (b)) \] if (a) has to be equal to (b) then:

\[ {\sum_{i = 1}^{n}(y_i)^2}={\sum_{i = 1}^{n}(x_i)^2} \]

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

A) I have simulated the observations(random 100 which follow normal distribution) and the sum of X2 (83.30737) and sum of Y2 (1384.24) is not equal and based on the conclusion in 12(a) we can estimate the coefficients to be different.

# simulated data
set.seed(123)
x <- rnorm(100)
y <- 4*x+rnorm(100)
# end of simulated data without intercept but with error term with zero mean
df_not_equal <- data.frame(x=(x*x),y= y*y)

df_x2_y2 <- df_not_equal %>% summarise( x = sum(x),
                                        y = sum(y),
                                        .groups = "drop")
df_x2_y2

##          x       y
## 1 83.30737 1384.24

lm_y_x <- lm(y~x+0) #Regressing Y onto X
summary(lm_y_x)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0010 -0.7901 -0.1800  0.4693  3.1762 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   3.9364     0.1064   36.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9713 on 99 degrees of freedom
## Multiple R-squared:  0.9325, Adjusted R-squared:  0.9319 
## F-statistic:  1368 on 1 and 99 DF,  p-value: < 2.2e-16

lm_x_y <- lm(x~y+0) #Regressing X onto Y
summary(lm_x_y)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82117 -0.12311  0.04998  0.18452  0.52946 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y 0.236902   0.006404   36.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2383 on 99 degrees of freedom
## Multiple R-squared:  0.9325, Adjusted R-squared:  0.9319 
## F-statistic:  1368 on 1 and 99 DF,  p-value: < 2.2e-16

\[ {\sum_{i = 1}^{n}(x_i)^2}=83.30737 \]

\[ {\sum_{i = 1}^{n}(y_i)^2}=1384.24 \] As per results below, we can see that coefficient(beta1) = 3.9364 and coefficient(beta 2) = 0.2369

\[ \hat{\beta}_1 = 3.9364 \] \[ \hat{\beta}_2 = 0.2369 \]

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

I have simulated the data and made sure that the values in X set and Y set are same but in a reverse order. The sum of X2 (83.30737) and sum of Y2 (83.30737) is equal and based on the conclusion in 12(a) we can estimate the coefficients to be same.

# simulated data
set.seed(123)
x <- rnorm(100)
i <- length(x)
j <-  1
while(i != 0)
{
  y[j] <- x[i]
  i <- i - 1
  j <- j + 1
}
# end of simulated data without intercept
df_equal <- data.frame(x=(x*x),y= y*y)

df_x2_y2 <- df_equal %>% summarise( x = sum(x),
                                    y = sum(y),
                              .groups = "drop")
df_x2_y2

##          x        y
## 1 83.30737 83.30737

lm_y_x <- lm(y~x+0) #Regressing Y onto X
summary(lm_y_x)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1108 -0.5523  0.0173  0.7728  2.4388 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)  
## x  0.17425    0.09897   1.761   0.0814 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9033 on 99 degrees of freedom
## Multiple R-squared:  0.03036,    Adjusted R-squared:  0.02057 
## F-statistic:   3.1 on 1 and 99 DF,  p-value: 0.08137

lm_x_y <- lm(x~y+0) #Regressing X onto Y
summary(lm_x_y)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1108 -0.5523  0.0173  0.7728  2.4388 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)  
## y  0.17425    0.09897   1.761   0.0814 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9033 on 99 degrees of freedom
## Multiple R-squared:  0.03036,    Adjusted R-squared:  0.02057 
## F-statistic:   3.1 on 1 and 99 DF,  p-value: 0.08137

\[ {\sum_{i = 1}^{n}(x_i)^2}=83.30737 \]

\[ {\sum_{i = 1}^{n}(y_i)^2}=83.30737 \] As per results below, we can see that coefficient(beta1) = 0.17425 and coefficient(beta 2) = 0.17425

\[ \hat{\beta}_1 = 0.17425 \] \[ \hat{\beta}_2 = 0.17425 \]

After fitting Y onto X and vice versa coefficient beta 1 is 0.17425 and same is the value for coefficient beta 2.

Veera_Solige_Exercise 2

Veera Solige(UKD311)

6/15/2021

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

9. This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all the variables in the data set

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

(c) Use the lm() function to perform a multiple linear regression with `mpg` as the response and all other variables except `name` as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii.Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

10. This question should be answered using the `Carseats` data set.

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

(d) For which of the predictors can you reject the null hypothesis H₀ : β_j = 0?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

(f) How well do the models in (a) and (e) fit the data?

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

Veera_Solige_Exercise 2

Veera Solige(UKD311)

6/15/2021

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

9. This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all the variables in the data set

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii.Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

10. This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

(f) How well do the models in (a) and (e) fit the data?

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

(c) Use the lm() function to perform a multiple linear regression with `mpg` as the response and all other variables except `name` as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

10. This question should be answered using the `Carseats` data set.

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.

(d) For which of the predictors can you reject the null hypothesis H₀ : β_j = 0?