pnt130_ Assignment2

Exercise 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

Answer: KNN classifier and KNN regression are use the K-th nearest Neighbor to get the value. But, KNN classifier predics a category of input by checking the major category of the k-th nearest neighbor. KNN regression predicts a value by averaging of values of the k-th nearest neighbor.

Exercise 9

This question involves the use of multiple linear regression on the Auto data set.

Exercise 9 (a)

Produce a scatterplot matrix which includes all of the variables in the data set.

library(ISLR)
data(Auto)

pairs(Auto)

Exercise 9 (b)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

Auto_quant <- Auto[, -which(names(Auto) == "name")]

cor_matrix = cor(Auto_quant)
cor_matrix

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Exercise 9 (c)

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output.

Answer:

is there Relationship between the predictor and the response? => F-value is 252.4 so, we can say “Yes”
Which predictors appear to have a statistically significant relationship to the response? => year, displacement, weight, origin
What does the coefficient for the year variable suggest? => the mpg increases by about 0.75 per year

lm_model <- lm(mpg~.-name, data=Auto)
summary(lm_model)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Exercise 9 (d)

Use the plot() function to produce diagnostic plots of the linearregression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? => In the Residuals vs Fitted plot, some points in high mpg have large residuals.

Does the leverage plot identify any observations with unusually high leverage? => In the Residuals vs Leverage plot, some points have high leverage values.

plot(lm_model)

Exercise 9 (e)

Use the * and : symbols to fit linear regression models with interaction effects.

Symbol case => The coefficient of year:weight is -4.579e-04 and the t-value is is -7.752 with 8.02e-14 p-value. The combination of year and weight has an effect on mpg.

model_lm_inter <- lm(mpg ~ year * weight, data = Auto)
summary(model_lm_inter)

## 
## Call:
## lm(formula = mpg ~ year * weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0397 -1.9956 -0.0983  1.6525 12.9896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.105e+02  1.295e+01  -8.531 3.30e-16 ***
## year         2.040e+00  1.718e-01  11.876  < 2e-16 ***
## weight       2.755e-02  4.413e-03   6.242 1.14e-09 ***
## year:weight -4.579e-04  5.907e-05  -7.752 8.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.193 on 388 degrees of freedom
## Multiple R-squared:  0.8339, Adjusted R-squared:  0.8326 
## F-statistic: 649.3 on 3 and 388 DF,  p-value: < 2.2e-16

: Symbol => The coefficient of year:weight is -9.882e-05 and the t-value is is -24.07 with <2e-16 p-value. The combination of year and weight has an effect on mpg.

model_lm_inter2 <- lm(mpg ~ year : weight, data = Auto)
summary(model_lm_inter2)

## 
## Call:
## lm(formula = mpg ~ year:weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3849  -3.3041  -0.5901   2.6158  17.5737 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.571e+01  9.581e-01   47.71   <2e-16 ***
## year:weight -9.882e-05  4.105e-06  -24.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.957 on 390 degrees of freedom
## Multiple R-squared:  0.5977, Adjusted R-squared:  0.5967 
## F-statistic: 579.4 on 1 and 390 DF,  p-value: < 2.2e-16

Do any interactions appear to be statistically significant? => year and weight has significant interaction.

Exercise 9 (f)

Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

=> log(weight) has more correlation with mpg than orignal value. F-statistic: 878.8 vs 967.3, t-value: -29.64 vs -31.10

# log(x)
model_log <- lm(mpg ~ log(weight), data = Auto)

# Squart
model_sqrt <- lm(mpg ~ sqrt(weight), data = Auto)

# ^2
model_sq <- lm(mpg ~ I(weight^2), data = Auto)

#Orig
model_orig <- lm(mpg ~ weight, data = Auto)

summary(model_log)

## 
## Call:
## lm(formula = mpg ~ log(weight), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4315  -2.6752  -0.2888   1.9429  16.0136 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 209.9433     6.0002   34.99   <2e-16 ***
## log(weight) -23.4317     0.7534  -31.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.189 on 390 degrees of freedom
## Multiple R-squared:  0.7127, Adjusted R-squared:  0.7119 
## F-statistic: 967.3 on 1 and 390 DF,  p-value: < 2.2e-16

summary(model_sqrt)

## 
## Call:
## lm(formula = mpg ~ sqrt(weight), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2402  -2.9005  -0.3708   2.0791  16.2296 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.67218    1.52649   45.64   <2e-16 ***
## sqrt(weight) -0.85560    0.02797  -30.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.239 on 390 degrees of freedom
## Multiple R-squared:  0.7058, Adjusted R-squared:  0.705 
## F-statistic: 935.4 on 1 and 390 DF,  p-value: < 2.2e-16

summary(model_sq)

## 
## Call:
## lm(formula = mpg ~ I(weight^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2813  -3.1744  -0.4708   2.2708  17.2506 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.447e+01  4.708e-01   73.22   <2e-16 ***
## I(weight^2) -1.150e-06  4.266e-08  -26.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.619 on 390 degrees of freedom
## Multiple R-squared:  0.6507, Adjusted R-squared:  0.6498 
## F-statistic: 726.6 on 1 and 390 DF,  p-value: < 2.2e-16

summary(model_orig)

## 
## Call:
## lm(formula = mpg ~ weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9736  -2.7556  -0.3358   2.1379  16.5194 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.216524   0.798673   57.87   <2e-16 ***
## weight      -0.007647   0.000258  -29.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.333 on 390 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6918 
## F-statistic: 878.8 on 1 and 390 DF,  p-value: < 2.2e-16

Exercise 10

This question should be answered using the Carseats data set.

Exercise 10 (a)

Fit a multiple regression model to predict Sales using Price,Urban, and US.

data("Carseats")
#head(Carseats)

model_Carseats = lm(Sales ~ Price + Urban + US, data=Carseats)
summary(model_Carseats)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Exercise 10 (b)

Provide an interpretation of each coeﬀicient in the model. Becareful—some of the variables in the model are qualitative!

- Price: t-value is -10.389 with <2e-16 p-value. We can say Sales decreases by 0.054 units. - UrbanYes: P-value is very high, so it is not statistically significant. - USYes: P-value is 4.86e-06. so it is statistically significant.

Exercise 10 (c)

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.0435 -0.0544*Price -0.0219*UrbanYes +1.2006*USYes

Exercise 10 (d)

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

Price and USYes have low p-value. So those predictors can reject the null hypothesis H0 : βj = 0

Exercise 10 (e)

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

model_Carseats_smaller = lm(Sales ~ Price + US, data=Carseats)
summary(model_Carseats_smaller)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Exercise 10 (f)

How well do the models in (a) and (e) fit the data?

smaller model has more higher F-value=62.43 than original model F-value=41.52

Exercise 10 (g)

Using the model from (e), obtain 95 % confidence intervals for the coeﬀicient(s).

confint(model_Carseats_smaller)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Exercise 10 (h)

Is there evidence of outliers or high leverage observations in the model from (e)?

In Q-Q Residual, We can see some points have high value of Standardized residual.And we can also observe some point are unusually high value.

plot(model_Carseats_smaller)

Exercise 12

This problem involves simple linear regression without an intercept.

Exercise 12 (a)

Recall that the coeﬀicient estimate β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coeﬀicient estimate for the regression of X onto Y the same as the coeﬀicient estimate for the regression of Y onto X?

In order for the regression coefficient of X onto Y to be the same as that of Y onto X, the sum of squares of X and Y must be equal.

Exercise 12 (b)

Generate an example in R with n = 100 observations in which the coeﬀicient estimate for the regression of X onto Y is different from the coeﬀicient estimate for the regression of Y onto X.

n <- 100

x <- rnorm(n)
y <- 2 * x + rnorm(n)  # Add noise so β estimates differ

# Regression of Y onto X
beta_y_on_x <- sum(x * y) / sum(x^2)

# Regression of X onto Y
beta_x_on_y <- sum(x * y) / sum(y^2)

cat("β (Y ~ X):", beta_y_on_x, "\n")

## β (Y ~ X): 2.037694

cat("β (X ~ Y):", beta_x_on_y, "\n")

## β (X ~ Y): 0.3753051

Exercise 12 (c)

Generate an example in R with n = 100 observations in which the coeﬀicient estimate for the regression of X onto Y is the same as the coeﬀicient estimate for the regression of Y onto X.

set.seed(130)
n <- 100
x <- rnorm(n)
y <- x / sqrt(sum(x^2)) * sqrt(sum(x^2))

# Regression of Y onto X
beta_y_on_x <- sum(x * y) / sum(x^2)

# Regression of X onto Y
beta_x_on_y <- sum(x * y) / sum(y^2)

cat("β (Y ~ X):", beta_y_on_x, "\n")

## β (Y ~ X): 1

cat("β (X ~ Y):", beta_x_on_y, "\n")

## β (X ~ Y): 1