Homework 2

Question 2

The KNN classifier is used when the target variables is discrete. The KNN regression model is used when y is continuous. The classifier is used to predict an observation to a category, while the regression is trying to predict a value.

Question 9

A)

auto = read.csv("Auto.csv", na.strings = "?")
auto = na.omit(auto)

str(auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

auto$horsepower = as.numeric(auto$horsepower)

pairs(auto[1:8])

B)

cor(auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

C)

auto.lm = lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = auto)
summary(auto.lm)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i)

Yes, the linear model appears to have 4 significant variables in predicting mpg. These are displacement, weight, year and origin. All the ones mentioned, expect for weight have a positive relationship to mpg. This makes sense, as weight tends to slow a car down.

ii)

The most significant indicators are weight and year. Both have the same P-value at < 2e-16 and are way below the .05 level of significance.

iii)

The coefficient for the variable year is .7508. This indicates that for every one movement in x, y will move .7508. In other words, the newer the car the higher the mpg.

D)

par(mfrow=c(2,2))
plot(auto.lm, which=c(1:4))

The residual vs fitted plot seems to have a slight pattern, which could indicate heteroscedasticity. The red line that plotted in the scale-location chart is roughly horizontal and therefore seems fine. The points on the QQ plot seem to mostly follow the QQ line, which indicates normality. Lastly, the cook’s distance plot indicates that there might be three outliers. The one with the highest cooks distance has a value of .08 and could be removed.

E)

auto.lm2 = lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin + cylinders*displacement + cylinders*horsepower + cylinders*weight + cylinders*acceleration + cylinders*year + cylinders*origin +
              horsepower*weight + horsepower*acceleration + horsepower*year + horsepower*origin + weight*acceleration + weight*year + weight*origin + acceleration*year + acceleration*origin + year*origin , data = auto)
summary(auto.lm2)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin + cylinders * displacement + 
##     cylinders * horsepower + cylinders * weight + cylinders * 
##     acceleration + cylinders * year + cylinders * origin + horsepower * 
##     weight + horsepower * acceleration + horsepower * year + 
##     horsepower * origin + weight * acceleration + weight * year + 
##     weight * origin + acceleration * year + acceleration * origin + 
##     year * origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5120 -1.4640 -0.0339  1.3639 11.3647 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)   
## (Intercept)              8.575e+01  5.013e+01   1.711  0.08801 . 
## cylinders               -6.267e+00  5.546e+00  -1.130  0.25919   
## displacement             1.199e-02  3.019e-02   0.397  0.69140   
## horsepower               3.396e-01  3.299e-01   1.029  0.30396   
## weight                  -2.183e-02  1.514e-02  -1.441  0.15038   
## acceleration            -5.227e+00  2.124e+00  -2.461  0.01433 * 
## year                     1.174e-01  5.786e-01   0.203  0.83932   
## origin                  -1.734e+01  6.923e+00  -2.504  0.01271 * 
## cylinders:displacement  -1.999e-03  4.488e-03  -0.445  0.65628   
## cylinders:horsepower    -8.357e-03  1.486e-02  -0.562  0.57429   
## cylinders:weight         1.537e-03  5.757e-04   2.670  0.00793 **
## cylinders:acceleration   7.697e-02  9.863e-02   0.780  0.43566   
## cylinders:year           8.582e-03  6.453e-02   0.133  0.89427   
## cylinders:origin         7.531e-01  3.931e-01   1.916  0.05618 . 
## horsepower:weight        2.088e-05  1.796e-05   1.163  0.24557   
## horsepower:acceleration -6.810e-03  3.221e-03  -2.114  0.03516 * 
## horsepower:year         -4.214e-03  3.812e-03  -1.105  0.26974   
## horsepower:origin        3.354e-03  2.888e-02   0.116  0.90763   
## weight:acceleration      2.388e-04  1.992e-04   1.199  0.23147   
## weight:year              2.183e-05  1.755e-04   0.124  0.90108   
## weight:origin            4.048e-04  1.133e-03   0.357  0.72102   
## acceleration:year        5.187e-02  2.496e-02   2.079  0.03835 * 
## acceleration:origin      4.785e-01  1.456e-01   3.286  0.00111 **
## year:origin              7.480e-02  7.128e-02   1.049  0.29468   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.719 on 368 degrees of freedom
## Multiple R-squared:  0.8858, Adjusted R-squared:  0.8786 
## F-statistic: 124.1 on 23 and 368 DF,  p-value: < 2.2e-16

There seems to be 4 significant interaction effects. These are cylinders:weight, cylinders:origin, horsepower:acceleration, acceleration:year and acceleration:origin. The most significant is acceleration:origin.

F)

auto.lm3 = lm(mpg ~ log(cylinders) + log(displacement) + log(horsepower) + log(weight) + log(acceleration) + log(year) + log(origin), data = auto)
summary(auto.lm3)

## 
## Call:
## lm(formula = mpg ~ log(cylinders) + log(displacement) + log(horsepower) + 
##     log(weight) + log(acceleration) + log(year) + log(origin), 
##     data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5987 -1.8172 -0.0181  1.5906 12.8132 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -66.5643    17.5053  -3.803 0.000167 ***
## log(cylinders)      1.4818     1.6589   0.893 0.372273    
## log(displacement)  -1.0551     1.5385  -0.686 0.493230    
## log(horsepower)    -6.9657     1.5569  -4.474 1.01e-05 ***
## log(weight)       -12.5728     2.2251  -5.650 3.12e-08 ***
## log(acceleration)  -4.9831     1.6078  -3.099 0.002082 ** 
## log(year)          54.9857     3.5555  15.465  < 2e-16 ***
## log(origin)         1.5822     0.5083   3.113 0.001991 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.069 on 384 degrees of freedom
## Multiple R-squared:  0.8482, Adjusted R-squared:  0.8454 
## F-statistic: 306.5 on 7 and 384 DF,  p-value: < 2.2e-16

auto.lm4 = lm(mpg ~ sqrt(cylinders) + sqrt(displacement) + sqrt(horsepower) + sqrt(weight) + sqrt(acceleration) + sqrt(year) + sqrt(origin), data = auto)
summary(auto.lm4)

## 
## Call:
## lm(formula = mpg ~ sqrt(cylinders) + sqrt(displacement) + sqrt(horsepower) + 
##     sqrt(weight) + sqrt(acceleration) + sqrt(year) + sqrt(origin), 
##     data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5250 -1.9822 -0.1111  1.7347 13.0681 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -49.79814    9.17832  -5.426 1.02e-07 ***
## sqrt(cylinders)     -0.23699    1.53753  -0.154   0.8776    
## sqrt(displacement)   0.22580    0.22940   0.984   0.3256    
## sqrt(horsepower)    -0.77976    0.30788  -2.533   0.0117 *  
## sqrt(weight)        -0.62172    0.07898  -7.872 3.59e-14 ***
## sqrt(acceleration)  -0.82529    0.83443  -0.989   0.3233    
## sqrt(year)          12.79030    0.85891  14.891  < 2e-16 ***
## sqrt(origin)         3.26036    0.76767   4.247 2.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.21 on 384 degrees of freedom
## Multiple R-squared:  0.8338, Adjusted R-squared:  0.8308 
## F-statistic: 275.3 on 7 and 384 DF,  p-value: < 2.2e-16

auto.lm5 = lm(mpg ~ (cylinders)**2 + (displacement)**2 + (horsepower)**2 + (weight)**2 + (acceleration)**2 + (year)**2 + (origin)**2, data = auto)
summary(auto.lm5)

## 
## Call:
## lm(formula = mpg ~ (cylinders)^2 + (displacement)^2 + (horsepower)^2 + 
##     (weight)^2 + (acceleration)^2 + (year)^2 + (origin)^2, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

The best predicting model that was created during this part was the log model, with a r-square value of .8482. The best model overall was the one that was created in the previous part. That model had a r-square of .8856.

Question 10

A)

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.3

car = Carseats
car = na.omit(car)

car.lm = lm(Sales ~ Price + Urban + US, data = car)
summary(car.lm)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

B)

According to the model that was created, only price and urban are significant to predicting child cars seat sales. As the price of the car seats increase, its sales decreases. For every one unit increase in x, y decreases .0545 units. If the store is in an urban location sales decrease and they increase if the store is in the US.

C)

Yhat = 13.0435 - .0545(price) - .0219(UrbanYes) + 1.2005(USYes)

D)

The null hypothesis of the variables Price and USYes are rejected because these two have a p-value below .05 and means that they are significant.

E)

car.lm2 = lm(Sales ~ Price + US, data = car)
summary(car.lm2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

F)

Both models have low r-square values. The model in part E) has a value of .2393, which is equivalent to the one seen on part A). Model E) has a slightly higher adjusted r-square with a value of .2354 versus .2335 in model A). The coefficients of the variables barely change.

G)

confint(car.lm2, level = .95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

H)

cook.model = cooks.distance(car.lm2)
plot(cook.model, col = "red", pch = 20, cex = 1)

According to the cooks distance plot, no outliers or high leverage observations populate the model. All the points seem to be close to each other.

Question 12

A)

When x and y are symmetrical or equivalent

B)

set.seed(70)
x = runif(100)
set.seed(58)
y = rnorm(100)

model = lm(y~x)
model2 = lm(x~y)
summary(model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91100 -0.76273  0.02784  0.61826  2.30196 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.3556     0.1963   1.811   0.0731 .
## x            -0.7723     0.3365  -2.295   0.0238 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.935 on 98 degrees of freedom
## Multiple R-squared:  0.05102,    Adjusted R-squared:  0.04133 
## F-statistic: 5.268 on 1 and 98 DF,  p-value: 0.02385

summary(model2)

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52975 -0.19776  0.01285  0.22857  0.51232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.51039    0.02737  18.649   <2e-16 ***
## y           -0.06606    0.02878  -2.295   0.0238 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2734 on 98 degrees of freedom
## Multiple R-squared:  0.05102,    Adjusted R-squared:  0.04133 
## F-statistic: 5.268 on 1 and 98 DF,  p-value: 0.02385

plot(x,y)
abline(model)

plot(y,x)
abline(model2)

C)

set.seed(70)
x2 = sample.int(100,100)
y2 = sample.int(100,100)

model3 = lm(y2~x2)
model4 = lm(x2~y2)
summary(model3)

## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.136 -23.523   0.382  23.836  49.518 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.9915     5.8408   7.703 1.09e-11 ***
## x2            0.1091     0.1004   1.086     0.28    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.99 on 98 degrees of freedom
## Multiple R-squared:  0.0119, Adjusted R-squared:  0.001816 
## F-statistic:  1.18 on 1 and 98 DF,  p-value: 0.28

summary(model4)

## 
## Call:
## lm(formula = x2 ~ y2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.754 -24.723   0.173  24.509  51.300 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.9915     5.8408   7.703 1.09e-11 ***
## y2            0.1091     0.1004   1.086     0.28    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.99 on 98 degrees of freedom
## Multiple R-squared:  0.0119, Adjusted R-squared:  0.001816 
## F-statistic:  1.18 on 1 and 98 DF,  p-value: 0.28

plot(x2,y2)
abline(model3)

plot(y2,x2)
abline(model4)