Question 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

Both KNN classifier and KNN regression is a non parametric approach to model data. Both depend on parameter k which is the number of nearest neighbors that the algorithm will look for to estimate the value of the response. The classifier method looks at the proportion of nearest neighbors to determine the class an observation should belong to. It then determines a decision boundary to predict new values into classes. KNN regression instead of a proportion averages the values of the nearest neighbors to then determine a step like function to predict new values.

Question 9

Using the Auto Data

A) Scatter Plot Matrix

Auto <- Auto
pairs(Auto)

B) Correlation Matrix

cor(select(Auto,-name))

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

C) Multiple Linear Regression

Autolm <- lm(mpg ~.-name, Auto)
summary(Autolm)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?
- We can see there is a relationship between predictors and the response with at least one predictor with the F test.
Which predictors appear to have a statistically significant relationship to the response?
- The predictors that appear statistically significant are weight, year, origin, and displacement.
What does the coefficient for the year variable suggest?
- The cofficient of year being postive means that cars overtime have become more fuel efficenct with higher miles per gallon.

D) Use the plot() function to produce diagnostic plots of the linear regression ﬁt. Comment on any problems you see with the ﬁt. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(Autolm)

In the diagnostic plots we see that in the residual vs fit we can see an curve upward when we want it to be a flat line. In the Q-Q plot there is points don’t follow exactly the line in the right tail. Concerning high leverage R points to point 14. Concerning outliers there are points with very high residuals like point 327, 394, 326, 323.

E) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

AutolmInter <-  lm(mpg ~ .+ weight*horsepower, select(Auto, -name, -cylinders, -acceleration))
summary(AutolmInter)

## 
## Call:
## lm(formula = mpg ~ . + weight * horsepower, data = select(Auto, 
##     -name, -cylinders, -acceleration))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4091 -1.7734 -0.1386  1.5039 11.9502 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.012e-01  3.966e+00   0.177 0.859757    
## displacement       6.504e-03  4.884e-03   1.331 0.183822    
## horsepower        -2.202e-01  2.076e-02 -10.605  < 2e-16 ***
## weight            -1.142e-02  6.891e-04 -16.571  < 2e-16 ***
## year               7.727e-01  4.476e-02  17.263  < 2e-16 ***
## origin             8.399e-01  2.488e-01   3.376 0.000809 ***
## horsepower:weight  5.438e-05  5.068e-06  10.731  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.928 on 385 degrees of freedom
## Multiple R-squared:  0.8615, Adjusted R-squared:  0.8593 
## F-statistic:   399 on 6 and 385 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(AutolmInter)

From exploring interaction terms displacement:horsepower and horsepower:weight are consistently significant depending on the combination of variables. Removing one of the terms will increase the significance of the other. horsepower:weight has the better effect on R squared.Including interaction terms make the residuals more flat.

F) Try a few different transformations of the variables, such aslog(X),√X, X-squared. Comment on your findings.

Auto %>%
  select(-name) %>% 
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_density()

Seeing if any variables can benefit from normalization. Origin should be a categorical variable not continuous.

Autolmtrans <-  lm(mpg ~ . + poly(horsepower, 2) - weight + log(weight) + as.factor(origin) - origin, select(Auto,-name, -cylinders, -acceleration, -displacement))
summary(Autolmtrans)

## 
## Call:
## lm(formula = mpg ~ . + poly(horsepower, 2) - weight + log(weight) + 
##     as.factor(origin) - origin, data = select(Auto, -name, -cylinders, 
##     -acceleration, -displacement))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5229 -1.8585 -0.0073  1.5756 12.3550 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           85.104181   9.631414   8.836  < 2e-16 ***
## horsepower            -0.026151   0.008622  -3.033  0.00258 ** 
## year                   0.776688   0.045725  16.986  < 2e-16 ***
## poly(horsepower, 2)1         NA         NA      NA       NA    
## poly(horsepower, 2)2  22.988399   3.367983   6.826 3.42e-11 ***
## log(weight)          -14.885036   1.263899 -11.777  < 2e-16 ***
## as.factor(origin)2     1.293095   0.462423   2.796  0.00543 ** 
## as.factor(origin)3     1.536449   0.474980   3.235  0.00132 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.948 on 385 degrees of freedom
## Multiple R-squared:  0.8595, Adjusted R-squared:  0.8573 
## F-statistic: 392.6 on 6 and 385 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(Autolmtrans)

Transforming the variables increased the R- squared and lead to an improvement in the residuals by flattening the mean. Transforming variables to find the best combination seems to be endless to maximize the R- Squared. I based my transformations on the histograms to make variables more normal. As well as on the pairs plot to examine polynomial behavior.

Question 10

Using the Carseats Data

A) Fit a multiple regression model to predict Sales using Price, Urban, and US.

lmcs <- Carseats %>%
    select(Sales,Price, Urban, US) %>% 
    lm(Sales ~ ., .)

summary(lmcs)

## 
## Call:
## lm(formula = Sales ~ ., data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

B) Provide an interpretation of each coefficient in the model. Be

careful—some of the variables in the model are qualitative!

This model has 4 coefficients. The intercept does not have a strong interpretation because the domain of price does not contain zero. The price coefficient is negative indicating that when price goes up the sales will go down. This intuitive since demand will decrease when price goes up. Urban variable says that if in an urban environment then that will lower sales. This is probably because in an urban environment people use public transit though this variable is not significant. The US variables being positive leads to a positive increase in sales. This is probably because the US has laws and safety measures that require car seats.

C) Write out the model in equation form, being careful to handle the qualitative variables properly.

\[\begin{equation} Sales = -0.054459 * Price + Urban * -0.021916 + US * 1.200573 + 13.043469 \; \begin{array}{c} Urban_{i} = \begin{cases} 1 & \text{if store $i$ in urban location}\\ 0 & \text{if store $i$ in rural location}\\ \end{cases} \\ US_{i} = \begin{cases} 1 & \text{if store $i$ in the United States}\\ 0 & \text{if store $i$ not in the United States}\\ \end{cases} \end{array} \end{equation}\]

D) For which of the predictors can you reject the null hypothesis $H_{0}$ : $β_{j}$ = 0?

We can reject the null hypothesis that $β$ = 0 for the variables price and US. With P values very close to zero.

E) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lmcssimple <- Carseats %>%
    select(Sales,Price, US) %>% 
    lm(Sales ~ ., .)
summary(lmcssimple)

## 
## Call:
## lm(formula = Sales ~ ., data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

F) How well do the models in A) and E) fit the data?

The two models are very similar in a lot of ways. The R squared value only went down by .0001 from A) to E). The estimates of the parameters $β$ only adjusted a little bit. E) model benefit is that it is simpler and easier to explain. Both don’t explain a lot of variance with Sales with R - squared values around 23 %

G) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(lmcssimple)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

H) Is there evidence of outliers or high leverage observations in the model from E)?

par(mfrow=c(2,2))
plot(lmcssimple)

In the model from E) we can see points that have high residuals are outliers like point 377, 69, and 51. A point that has high influence on the model is 368.

Question 12

A) Recall that the coefficient estimate β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficients are the same if $∑(x- \bar{x})^{2}=∑(y- \bar{y})^{2}$.

B) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x <- jitter(1:100, 0)
y <- jitter(1:100, 100)

TSSX <- sum((x-mean(x))^2)
TSSY <- sum((y-mean(y))^2)
print(TSSX == TSSY)

## [1] FALSE

coef(lm(y~x))

## (Intercept)           x 
##   0.7955612   0.9983826

coef(lm(x~y))

## (Intercept)           y 
##    5.366281    0.881279

plot(x,y)

C) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x <- 1:100
y <- 101:200

TSSX <- sum((x-mean(x))^2)
TSSY <- sum((y-mean(y))^2)
print(TSSX == TSSY)

## [1] TRUE

coef(lm(y~x))

## (Intercept)           x 
##         100           1

coef(lm(x~y))

## (Intercept)           y 
##        -100           1

plot(x,y)

Assignment #2 Data Mining

Victor Feagins

2/15/2021