Assignment2

Q2

Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classifier and KNN regression methods are non-parametric approaches that depend on the value of a k-number of neighbors of a point to define the value of that point. The difference between both methods resides in the fact that a KNN classifier assigns the value to the point by evaluating its probability in function of the neighbors. As such, the point assumes the value with a higher probability. Unlike the KNN regression method, the values of a KNN classifier are qualitative. The KNN regression method also uses the same approach, however instead of assigning the value that has the greater probability to the point, it averages the value of the k-neighbors and it assigns the value of that average to the point. As such, the values of a KNN regression method are quantitative.

Q9

This question involves the use of multiple linear regression on the Auto data set

library(ISLR)
library(MASS)

Produce a scatterplot matrix which includes all of the variables in the data set.

plot(Auto)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Auto_new <- Auto[,-c(9)]
cor(Auto_new)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance

lm_fit<- lm(mpg ~ ., data = Auto_new)
summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ ., data = Auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?

The F-statistic: 252.4 and p-value: < 2.2e-16, provide evidence that there is a relationship between at least one of the predictors and the response.

Which predictors appear to have a statistically significant relationship to the response?

Due to their p-values, the predictors that appear to have a statistically significant relationship to the response are: Displacement, Weight, Year and Origin.

What does the coefficient for the year variable suggest?

There is a positive linear relationship between year and mpg. Each year represents a 0.750773 increase of mpg.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par (mfrow = c(2, 2))
plot(lm_fit)

In the Residuals vs Fitted Plot, the fit to the residuals shows a curve that indicates that the data might have a non-linear relation. There is a point (14) with high leverage in the Residuals vs Leverage plot. Its residual value indicates it’s not an outlier.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

summary(lm(mpg ~ .:., data = Auto_new))

## 
## Call:
## lm(formula = mpg ~ .:., data = Auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

summary(lm(mpg ~ .*., data = Auto_new))

## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

The interactions that, due their p-value, seem to be statistically significant are: acceleration:origin, acceleration:year and displacement:year.

Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

summary(lm(mpg ~ acceleration+I(acceleration^2), data = Auto_new))

## 
## Call:
## lm(formula = mpg ~ acceleration + I(acceleration^2), data = Auto_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0877  -5.5700  -0.8524   4.3827  22.9813 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -15.26045    7.79899  -1.957 0.051095 .  
## acceleration        3.79787    0.98283   3.864 0.000131 ***
## I(acceleration^2)  -0.08156    0.03056  -2.669 0.007934 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.025 on 389 degrees of freedom
## Multiple R-squared:  0.194,  Adjusted R-squared:  0.1898 
## F-statistic:  46.8 on 2 and 389 DF,  p-value: < 2.2e-16

summary(lm(mpg ~ acceleration+I(sqrt(acceleration)), data = Auto_new))

## 
## Call:
## lm(formula = mpg ~ acceleration + I(sqrt(acceleration)), data = Auto_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9559  -5.5979  -0.8015   4.6222  22.8777 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)            -72.877     29.051  -2.509  0.01253 * 
## acceleration            -3.845      1.885  -2.040  0.04203 * 
## I(sqrt(acceleration))   39.750     14.823   2.682  0.00764 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.025 on 389 degrees of freedom
## Multiple R-squared:  0.1941, Adjusted R-squared:   0.19 
## F-statistic: 46.85 on 2 and 389 DF,  p-value: < 2.2e-16

Q10

This question should be answered using the Carseats data set.

Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm_sales <- lm(Sales ~ Price + Urban + US, Carseats)
summary(lm_sales)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The data set contains sales of child car seats, measured in thousands. Price is a continuous variable, while Urban and Us are categorical. - The Price coefficient (-0.054459) indicates that for every unit of price increased, sales decreased by 54 units. The P-value shows evidence of a strong relationship between Price and Sales. - The Urban variable can take 2 values: Yes(1) or NO(0) to indicate if the store is in an Urban location. Its coefficient (-0.021916) indicates that whenever a store is in a urban location, sales decrease by 21 units. Its p-value, however, shows that there is not enough evidence to indicate a relation between Urban and Price. - The US variable can take 2 values: Yes or No, to indicate if the store is in the US. Its coefficient (1.200573) indicates that for sales increase by 1200 units whenever a store is located in the US. The P-value shows evidence of a relation between Sales and US.

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.043469 - 0.054459(Price) - 0.021916(Urban)[1 if the Store is in Urban area; 0 otherwise] + 1.200573(US)[1 if the store is in the US; 0 otherwise]

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

The null hypothesis can be rejected for the Price and US variables, since their p-values is 2e-16 and 4.86e-06, respectively

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm_sales_new <- lm(Sales ~ Price + US, Carseats)
summary(lm_sales_new)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

Model a has an Adjusted R-squared of:0.2335.Model b has an Adjusted R-squared of: 0.2354. Both models explain around 23% of the variance, which confirms that dropping the Urban variable from the set didn’t affect the fit of the model to the data.

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(lm_sales_new, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

par (mfrow = c(3, 2))
plot(hatvalues(lm_sales_new))
plot(lm_sales_new)

The hatvalues and the Residuals vs Leverage doesn’t seem to indicate the presence of observations with high leverage value.

Q12

This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

After equating the respective quotients to calculate the coefficients for both x and y, respectively, we can see that the estimates are equal for both x and y when the summation of x squared is equal to the summation of y squared.

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(99)
x <- rnorm(100)
y= 3*x+rnorm(100)
df <- data.frame(x, y)

fit1 <- lm(y ~ x + 0, data = df)
fit2 <- lm(x ~ y + 0, data = df)

summary(fit1)$coefficients

##   Estimate Std. Error t value     Pr(>|t|)
## x 3.087646  0.1182444 26.1124 3.415688e-46

summary(fit2)$coefficients

##    Estimate Std. Error t value     Pr(>|t|)
## y 0.2828097 0.01083047 26.1124 3.415688e-46

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(89)
x1 <- rnorm(100)
y1= x1
df1 <- data.frame(x1, y1)

fit2_1 <- lm(y1 ~ x1 + 0, data = df1)
fit2_2 <- lm(x1 ~ y1 + 0, data = df1)

summary(fit2_1)$coefficients

## Warning in summary.lm(fit2_1): essentially perfect fit: summary may be
## unreliable

##    Estimate   Std. Error     t value Pr(>|t|)
## x1        1 4.787301e-18 2.08886e+17        0

summary(fit2_2)$coefficients

## Warning in summary.lm(fit2_2): essentially perfect fit: summary may be
## unreliable

##    Estimate   Std. Error     t value Pr(>|t|)
## y1        1 4.787301e-18 2.08886e+17        0

Assignment2

Jose Rodriguez

2023-02-17

Q2

Q9

Q10

Q12