Predictive Modeling Assignment 2

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

One difference between KNN classifier and KNN regression is that in KNN classifier, is that KNN regression would be used in situations where our target variable is numeric / continuous where y is any number and we would use KNN classifier if our target is categorical / discrete because our y output is 0 or 1. KNN regression takes K and the prediction point to find the nearest training points near the prediction point, then tries to predict using the average of the nearest neighbors. KNN classifier takes the points near the prediction point in the training data and estimates the conditional probability of the class. KNN classifier will then classify a test point to whichever class it had the highest probability for.

9. This question involves the use of multiple linear regression on the Auto data set.

auto <- read.csv("~/R-Studio/Predictive Modeling/ALL CSV FILES - 2nd Edition/Auto.csv")

Produce a scatterplot matrix which includes all of the variables in the data set.

plot(auto)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

auto_no_name = auto %>% 
  dplyr::select(-name)
auto_no_name$horsepower = as.numeric(auto_no_name$horsepower)

## Warning: NAs introduced by coercion

cor(auto_no_name)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7762599   -0.8044430         NA -0.8317389
## cylinders    -0.7762599  1.0000000    0.9509199         NA  0.8970169
## displacement -0.8044430  0.9509199    1.0000000         NA  0.9331044
## horsepower           NA         NA           NA          1         NA
## weight       -0.8317389  0.8970169    0.9331044         NA  1.0000000
## acceleration  0.4222974 -0.5040606   -0.5441618         NA -0.4195023
## year          0.5814695 -0.3467172   -0.3698041         NA -0.3079004
## origin        0.5636979 -0.5649716   -0.6106643         NA -0.5812652
##              acceleration       year     origin
## mpg             0.4222974  0.5814695  0.5636979
## cylinders      -0.5040606 -0.3467172 -0.5649716
## displacement   -0.5441618 -0.3698041 -0.6106643
## horsepower             NA         NA         NA
## weight         -0.4195023 -0.3079004 -0.5812652
## acceleration    1.0000000  0.2829009  0.2100836
## year            0.2829009  1.0000000  0.1843141
## origin          0.2100836  0.1843141  1.0000000

is.na(auto_no_name$horsepower)

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [337]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE

#counted as ? in original data

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest?

auto_lm = lm(mpg ~ ., data = auto_no_name)
summary(auto_lm)

## 
## Call:
## lm(formula = mpg ~ ., data = auto_no_name)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Given our p-value of 2.2e-16, we can reject the null and state that at least one predictor is significant on mpg (assuming alpha level of 0.05). Displacement, weight, year, and origin are our significant predictors. The coefficient for year is 0.750773, so assuming all other variables are held constant, this means that a one unit increase in year leads to a 0.75 increase in mpg.

Use the plot() function to produce diagnostic plots of the linear regression ft. Comment on any problems you see with the ft. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(auto_lm)

Our linear model looks to possibly be not normal because there a lot of points on the top end that do not follow the normal curve, along with it not following homoskedasticity / not equal variance because in our residuals vs fitted plot the points begin spreading out like < when they should be equidistant and not have a pattern. Looking at our Cook’s distance there is no point’s that go past the 0.5 threshold so there seems to be no influential points.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

auto_lm_int = lm(mpg~.*. , data = auto_no_name)
summary(auto_lm_int)

## 
## Call:
## lm(formula = mpg ~ . * ., data = auto_no_name)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Using an alpha level of 0.05, the following interactions are significant: displacement:year,acceleration:year, and acceleration:origin.

Try a few different transformations of the variables, such as log(X), √ X, X2. Comment on your findings

auto_lm_trans = lm(mpg~log(horsepower)+horsepower, data=auto_no_name)
summary(auto_lm_trans)

## 
## Call:
## lm(formula = mpg ~ log(horsepower) + horsepower, data = auto_no_name)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5118  -2.5018  -0.2533   2.4446  15.3102 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     156.04057   12.08267  12.914  < 2e-16 ***
## log(horsepower) -31.59815    3.28363  -9.623  < 2e-16 ***
## horsepower        0.11846    0.02929   4.044 6.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.415 on 389 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6801 
## F-statistic: 416.6 on 2 and 389 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(auto_lm_trans)

Previously, horsepower was not significant when including all variables of the model but now it became significant with the transformation. It also seems to follow normality and equal variance more than the entire model.

10.

carsets <- read.csv("~/R-Studio/Predictive Modeling/ALL CSV FILES - 2nd Edition/Carseats.csv")

Fit a multiple regression model to predict Sales using Price, Urban, and US.

carseats_lm = lm(Sales ~ Price + Urban + US, data = carsets)
summary(carseats_lm)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carsets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! Price(p<0.05): With all other variables held constant, a one unit increase in price is a loss of 0.054459 sales. Urban(p>0.05): Urban is not a significant predictor of Sales since p=0.936, so there is no evidence that a store being Urban would affect sales. ~~With all other variables held constant, if a store is in an urban area, then there is a loss of 0.021916 sales.~~ US(p<0.05): With all other variables held constant, if a store is in the US, then there is a 1.200573 increase in sales.
Write out the model in equation form, being careful to handle the qualitative variables properly. 3.7 Exercises 125

Sales = 13.043469 - 0.054459 * (Price) - 0.021916 * (Urban) + 1.200573 * (US) Urban and US can be 0 or 1 to nullify their part of the equation.

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

Price and US, assuming 0.05

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

carseat_e = lm(Sales ~ Price + US, data = carsets)
summary(carseat_e)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = carsets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

Using Adjusted R-squared, Model A is 0.2335 and Model E is 0.2354. This means that both models can explain ~23% of variance but Model E has a better F statistic at 62.43 which is ~20 higher than Model A.

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(carseat_e)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

There is a 95% chance that the values of price fall between (-0.06475984, -0.04419543) and ( 0.69151957, 1.70776632) for US and a 5% chance it does not.

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(carseat_e)

Looks like there are no high leverage observations in the Residuals vs Leverage plot.

12.This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

When the coefficient is 1 they should be the same?

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x = rnorm(100)
y = 0.42*x + rnorm(100)
q12_a = lm(y~x)
q12_b = lm(x~y)
summary(q12_a)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5025 -0.6111  0.1161  0.6228  2.3254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02470    0.09485   0.260    0.795    
## x            0.46252    0.09405   4.918 3.52e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9477 on 98 degrees of freedom
## Multiple R-squared:  0.1979, Adjusted R-squared:  0.1898 
## F-statistic: 24.19 on 1 and 98 DF,  p-value: 3.522e-06

summary(q12_b)

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27960 -0.73804 -0.06571  0.73459  1.88715 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02204    0.09124   0.242     0.81    
## y            0.42798    0.08702   4.918 3.52e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9116 on 98 degrees of freedom
## Multiple R-squared:  0.1979, Adjusted R-squared:  0.1898 
## F-statistic: 24.19 on 1 and 98 DF,  p-value: 3.522e-06

coefficients(q12_a)

## (Intercept)           x 
##  0.02469722  0.46251559

coefficients(q12_b)

## (Intercept)           y 
##  0.02203504  0.42797841

They are different

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X

x = rnorm(100)
y = x 
q12_c = lm(y~x)
q12_d = lm(x~y)
summary(q12_c)

## Warning in summary.lm(q12_c): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.931e-15 -9.390e-18  1.784e-17  5.380e-17  1.842e-16 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 2.220e-17  2.082e-17 1.067e+00    0.289    
## x           1.000e+00  2.029e-17 4.929e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.064e-16 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.43e+33 on 1 and 98 DF,  p-value: < 2.2e-16

summary(q12_d)

## Warning in summary.lm(q12_d): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.931e-15 -9.390e-18  1.784e-17  5.380e-17  1.842e-16 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 2.220e-17  2.082e-17 1.067e+00    0.289    
## y           1.000e+00  2.029e-17 4.929e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.064e-16 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.43e+33 on 1 and 98 DF,  p-value: < 2.2e-16

coefficients(q12_c)

##  (Intercept)            x 
## 2.220446e-17 1.000000e+00

coefficients(q12_d)

##  (Intercept)            y 
## 2.220446e-17 1.000000e+00

Now they are the same / perfect fit.

Predictive Modeling Assignment 2

Jordan Dever

2024-02-22

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

9. This question involves the use of multiple linear regression on the Auto data set.

10.

12.This problem involves simple linear regression without an intercept.