KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it (ISLR, pg. 165). KNN regression predicts a value of a given test observation, using similar methods
Auto<-read.table('http://www-bcf.usc.edu/~gareth/ISL/Auto.data', header = TRUE)
pairs(Auto[,1:8])
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.2
## corrplot 0.84 loaded
Auto[1:8] <- sapply(Auto[1:8],as.numeric)
AutoCorter<- cor(Auto[,1:8])
corrplot(AutoCorter, method ='square')
lm.fit<-lm(mpg~.-name, data=Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.629 -2.034 -0.046 1.801 13.010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.128e+01 4.259e+00 -4.998 8.78e-07 ***
## cylinders -2.927e-01 3.382e-01 -0.865 0.3874
## displacement 1.603e-02 7.284e-03 2.201 0.0283 *
## horsepower 7.942e-03 6.809e-03 1.166 0.2442
## weight -6.870e-03 5.799e-04 -11.846 < 2e-16 ***
## acceleration 1.539e-01 7.750e-02 1.986 0.0477 *
## year 7.734e-01 4.939e-02 15.661 < 2e-16 ***
## origin 1.346e+00 2.691e-01 5.004 8.52e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.331 on 389 degrees of freedom
## Multiple R-squared: 0.822, Adjusted R-squared: 0.8188
## F-statistic: 256.7 on 7 and 389 DF, p-value: < 2.2e-16
Displacement, acceleration weight, year, and origin all have p values <.05 and significant. The variables account for 81.88% of the MPGs variation. It looks like cylinders and horsepower are not significant.
The newer the car, the better gas mileage. The model suggests for every increase in year, mpg increases by 7.734e-01
par(mfrow=c(2,2))
plot(lm.fit)
There are not any observations with an unusually high leverage outside of Cook’s distance. Observation 14 is in the bottom right, however not outside Cook’s distance.
lm.fit<-lm(mpg~.-name+cylinders*acceleration, data=Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders * acceleration, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6385 -2.0148 0.0181 1.8525 12.5798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.997857 5.777865 -6.576 1.56e-10 ***
## cylinders 2.603860 0.767972 3.391 0.000769 ***
## displacement 0.003954 0.007698 0.514 0.607805
## horsepower 0.007197 0.006672 1.079 0.281377
## weight -0.006220 0.000589 -10.560 < 2e-16 ***
## acceleration 1.077460 0.233595 4.613 5.41e-06 ***
## year 0.800046 0.048790 16.398 < 2e-16 ***
## origin 1.135397 0.268341 4.231 2.90e-05 ***
## cylinders:acceleration -0.178164 0.042618 -4.181 3.60e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.263 on 388 degrees of freedom
## Multiple R-squared: 0.8297, Adjusted R-squared: 0.8262
## F-statistic: 236.3 on 8 and 388 DF, p-value: < 2.2e-16
lm.fit<-lm(mpg~.-name+cylinders*acceleration+displacement*horsepower, data=Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders * acceleration + displacement *
## horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9058 -2.0680 -0.0239 1.9590 12.5716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.871e+01 5.838e+00 -6.631 1.13e-10 ***
## cylinders 2.537e+00 7.721e-01 3.285 0.00111 **
## displacement 8.395e-03 9.256e-03 0.907 0.36500
## horsepower 2.266e-02 1.909e-02 1.187 0.23592
## weight -6.222e-03 5.892e-04 -10.560 < 2e-16 ***
## acceleration 1.071e+00 2.338e-01 4.580 6.28e-06 ***
## year 8.016e-01 4.884e-02 16.413 < 2e-16 ***
## origin 1.122e+00 2.689e-01 4.173 3.71e-05 ***
## cylinders:acceleration -1.760e-01 4.270e-02 -4.123 4.59e-05 ***
## displacement:horsepower -8.778e-05 1.015e-04 -0.865 0.38780
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.264 on 387 degrees of freedom
## Multiple R-squared: 0.83, Adjusted R-squared: 0.8261
## F-statistic: 210 on 9 and 387 DF, p-value: < 2.2e-16
The interaction between cylinders and acceleration is significant with a p-value of 3.60e-05 < .05. The addition of this interaction made the previously significant dispalcement and horsepower variables insignificant
Adding the interaction between displacement and horsepower was not significant, and did not make the individual variables or pairing signs insignificant
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.2
##
## Attaching package: 'ISLR'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
head(Carseats)
lm.10<-lm(Sales~Price+Urban+US, data = Carseats)
summary(lm.10)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The price is significant with a p value of < 2e-16 . For every unit increase in price, sales decreases by -0.054459. The UrbanYes categorical variable is significant with a p value of 0.936. Whether or not the person was in an urban area, proved to be irrelevant.The USYes categorical variable is significant with a p value of 4.86e-06. If someone lives in the US, sales goes up by 1.20
Sales= 13.04 -.054xPrice + 1.2xUSYes
I can reject the null hypothesis that beta is equal to 0 for Price and whether or not someone lives in the US.
lm.10<-lm(Sales~Price+US, data = Carseats)
summary(lm.10)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Adjusted R-squared is slightly better for (e)
confint(lm.10)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c(2,2))
plot(lm.10)
The first three graphs have the data looking normal, and the fourth graph gives us our answer. There are no points outside Cook’s distance for model (e).
When beta is 1, x is equal to y and y is equal to x
x=rnorm(100)
y=.696969*x+rnorm(100)
coefficients(lm(x~y))
## (Intercept) y
## 0.02868183 0.54224052
coefficients(lm(y~x))
## (Intercept) x
## 0.07144916 0.71084865
x=rnorm(100)
y=1*x
coefficients(lm(x~y))
## (Intercept) y
## 3.330669e-17 1.000000e+00
coefficients(lm(y~x))
## (Intercept) x
## 3.330669e-17 1.000000e+00
Comment on any problems you see with the fit.Do the residual plots suggest any unusually large outliers?
The residual plot does not indicate linearity, as residuals are u-shaped and not randomly scattered. The model indicates outliers with the smaller fitted values and even more outliers with the bigger fitted values.