Hw2 MS-4373

Q2

The KNN classifier is mainly used to solve classification problems that have a qualitive output. Whereas the KNN regression method is used to solve problems that have a quantitative response.

Q9

Using the name function it displayed all the column headers in the data. Then I had it compute a correlation matrix from the data excluding the name column because that variable is qualitative data.

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

cor(Auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Q9

fit1 = lm(mpg ~ . - name, data = Auto)
summary(fit1)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

1.According to the P-value of the f statistic of 2.2e-16 we can conclude that there is a correlation between the predictors and the response. 2. To figure out which predictors are statistically significant we can check the p-values associated with each predictor. We find from the data that the only predictors not significant are cylinders, horsepower, and acceleration. 3. The coefficient for the variable “Year” says that the average effect of 1 year results in an increase in mpg by .75. Concluding that as newer cars are released they are becoming more fuel efficient.

Q9

par(mfrow = c(2, 2))
plot(fit1)

The plot of residuals vs. fitted data presents a data set that is non-linear in nature. We can also see that in the data of residuals vs. leverage there are a few outliers below -2 and higher than positive 2.

Q9

fit2 = lm(mpg ~ cylinders * displacement+displacement * weight, data = Auto[,1:8])
summary(fit2)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     weight, data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2934  -2.5184  -0.3476   1.8399  17.7723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.262e+01  2.237e+00  23.519  < 2e-16 ***
## cylinders               7.606e-01  7.669e-01   0.992    0.322    
## displacement           -7.351e-02  1.669e-02  -4.403 1.38e-05 ***
## weight                 -9.888e-03  1.329e-03  -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03  3.426e-03  -0.872    0.384    
## displacement:weight     2.128e-05  5.002e-06   4.254 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7237 
## F-statistic: 205.8 on 5 and 386 DF,  p-value: < 2.2e-16

From the P-values we can see that the relationship between cylinders and displacement is not significant while the relationship between displacement and weight is significant.

Q9

par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)

We can see in the top left that is the plot given by the log function and it appears to be the most linear of the three. Then on the top right we have the sqrt function plot and on the bottom left we have the squared function plot.

Q10

data("Carseats")
cr1 = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(cr1)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

B. To interpret each variable we must first look at “Price” variable which can be interpreted as for every 1 dollar increase there is a decrease of 54.459. The coefficient of “Urban” can be interpreted as on average unit sales in urban locations are 21.91 units less than rural locations. The coefficient of “US” can be interpreted as unit sales in us stores are 1200,57 more than in non-U.S stores.

C. Sales = 13.043 + (-.054459) X Price + (-.021916) X Urban + (1.200573) X US

D. We reject the null hypothesis for the variables of “Price” and “US”.

Q10

f4 = lm(Sales ~ Price + US, data = Carseats)
summary(f4)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

F. When looking at the two models we can see that the new smaller model constructed fits better as it has a better r-squared.

Q10

confint(f4)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Computing of a 95% confidence interval for the smaller data model.

Q10

par(mfrow = c(2,2))
plot(f4)

This code shows that the plot of residuals vs leverage shows outliers between higher than 2 and lower than -2.

Q12

A.The coefficients estimate will be the same when the sum of x^2 is equal to the sum of y^2.

x = 1:100
sum(x^2)

## [1] 338350

y = 2*x + rnorm(100, sd = .1)
sum(y^2)

## [1] 1353448

ft.y = lm(y ~ x + 0)
ft.x = lm(x ~ y + 0)
summary(ft.y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.183244 -0.065632 -0.000408  0.064997  0.219022 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x  2.00003    0.00016   12500   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09307 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.563e+08 on 1 and 99 DF,  p-value: < 2.2e-16

summary(ft.x)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10949 -0.03248  0.00025  0.03283  0.09165 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y    5e-01      4e-05   12500   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04653 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.563e+08 on 1 and 99 DF,  p-value: < 2.2e-16

These lines of code will make an observation X with 100 observations. We will then create a y-variable that will create a normal distribution of 100 observations but to make the coefficient estimates different we will include 2 * X in the equation.

x = 1:100
sum(x^2)

## [1] 338350

y = 100:1
sum(y^2)

## [1] 338350

ft.y = lm(y ~ x + 0)
ft.x = lm(x ~ y + 0)
summary(ft.y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

summary(ft.x)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

Hw2 MS-4373

Thomas Farrell

2/9/2022

Q2

Q9

Q9

Q9

Q9

Q9

Q9

Q10

Q10

Q10

Q10

Q12