options(repos = list(CRAN="http://cran.rstudio.com/"))
The KNN Classifier focuses on predicting the class that an output belongs to by looking at its nearest neighbors whereas KNN Regression focuses on predicting a value based on the average of all the training responses that are closest to the prediction point.
install.packages("ISLR")
library(ISLR)
plot(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
i.Is there a relationship between the predictors and the response? There is a relationship between 4 of the predictor variables variables
ii.Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year and origin all have a p value that is less than .05 which means they have a statistically significant relationship
iii.What does the coefficient for the year variable suggest? There is a statistically significant relationship between mpg and year of the vehicle. Each year mpg increases my .75 with an error rate varying by .05.
lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The residual vs fitted plot shows that the residuals follow a linear pattern. There are some outliers in between 30-35. The Normal QQ plot shows that the residuals are normally distributed except in the tail end of the line. The scale-Location plot shows the red line is mostly Horizontal across the the plot which means that the assumption of equal variance is likely met. The residuals vs leverage plot shows that no observations fall outside of Cooks dashed red line, this means there are no high leverage points in our regression model.
par(mfrow = c(2,2))
plot(lm.fit)
The interactions that are statistically significant are horsepower:weight and acceleration:year with a p value less that .05
lm.fit2 <- lm(mpg ~ cylinders*displacement + horsepower*weight + acceleration *year + cylinders*origin ,data=Auto)
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + horsepower * weight +
## acceleration * year + cylinders * origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6734 -1.5452 -0.0555 1.2971 11.3844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.145e+02 1.850e+01 6.188 1.58e-09 ***
## cylinders -1.088e+00 7.457e-01 -1.459 0.14534
## displacement -2.203e-02 1.574e-02 -1.400 0.16232
## horsepower -2.266e-01 2.590e-02 -8.749 < 2e-16 ***
## weight -9.916e-03 9.050e-04 -10.957 < 2e-16 ***
## acceleration -6.815e+00 1.149e+00 -5.929 6.84e-09 ***
## year -6.300e-01 2.395e-01 -2.630 0.00888 **
## origin -1.670e+00 1.293e+00 -1.292 0.19713
## cylinders:displacement 3.223e-03 2.310e-03 1.395 0.16381
## horsepower:weight 4.927e-05 6.771e-06 7.276 1.98e-12 ***
## acceleration:year 8.773e-02 1.490e-02 5.888 8.62e-09 ***
## cylinders:origin 5.450e-01 2.991e-01 1.822 0.06917 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.79 on 380 degrees of freedom
## Multiple R-squared: 0.8759, Adjusted R-squared: 0.8723
## F-statistic: 243.7 on 11 and 380 DF, p-value: < 2.2e-16
There are no high leverage points that affect the regression model based on the residuals vs leverage plot. The residual vs fitted plot shows that the residuals follow a somewhat linear pattern.
lm.fit3 <- lm(mpg ~ horsepower + I(horsepower^2), data=Auto)
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lm.fit3)
lm.fit4 <- lm(Sales ~ Price + Urban + US, data=Carseats)
summary(lm.fit4)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
There is a significant relationship between price and sales there is no significant relationship between UrbanYes and sales which indicates that sales are not affected when the store is located in an urban location.Lastly, there is a significant relationship between USyes and sales, stores in the US have higher sales.
10c.Write out the model in equation form, being careful to handle the qualitative variables properly.
sales = 13.0434689 + (-0.0544588) x Price + (-0.0219162) x Urban + (1.2005727) x US ε Urban=1 if the store is in an urban location and 0 if not, and US=1 if the store is in the US and 0 if not.
10d.For which of the predictors can you reject the null hypothesis H0 : βj = 0?
The null hypothesis can be rejected for Price and USyes
10e.On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
smaller.fit <- lm(Sales ~ Price + US, data = Carseats)
summary(smaller.fit)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
10f. How well do the models in (a) and (e) fit the data?
They both fit the data well with minor changes from (a) and (e). The R^2 only increases slightly when Urban is removed indicating that it does not provide any real improvements to the model fit.
10g. Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(smaller.fit)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
10h.Is there evidence of outliers or high leverage observations in the model from (e)?
There is no evidence of any major outliers or high leverage points.
par(mfrow = c(2,2))
plot(smaller.fit)
##Problem 12
12a.This problem involves simple linear regression without an intercept. Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficient of the estimate will be the same when Y=X.
12b.Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(15)
x1 <- rnorm(100)
y1 <- 2 * x1 + rnorm(100)
yinx <- lm(y1 ~ x1 + 0)
xiny <- lm(x1 ~ y1 + 0)
summary(yinx)
##
## Call:
## lm(formula = y1 ~ x1 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.58356 -0.68052 0.02862 0.61750 2.32039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x1 2.0291 0.1111 18.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.105 on 99 degrees of freedom
## Multiple R-squared: 0.7711, Adjusted R-squared: 0.7687
## F-statistic: 333.4 on 1 and 99 DF, p-value: < 2.2e-16
summary(xiny)
##
## Call:
## lm(formula = x1 ~ y1 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8082 -0.2606 0.0386 0.3557 1.5124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y1 0.38000 0.02081 18.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4781 on 99 degrees of freedom
## Multiple R-squared: 0.7711, Adjusted R-squared: 0.7687
## F-statistic: 333.4 on 1 and 99 DF, p-value: < 2.2e-16
12c.Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(15)
x2 <- rnorm(100)
y2 <- rnorm(100)
yinx2 <- lm(y2 ~ x2 + 0)
xiny2 <- lm(x2 ~ y2 + 0)
summary(yinx2)
##
## Call:
## lm(formula = y2 ~ x2 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.58356 -0.68052 0.02862 0.61750 2.32039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x2 0.02909 0.11112 0.262 0.794
##
## Residual standard error: 1.105 on 99 degrees of freedom
## Multiple R-squared: 0.0006919, Adjusted R-squared: -0.009402
## F-statistic: 0.06854 on 1 and 99 DF, p-value: 0.794
summary(xiny2)
##
## Call:
## lm(formula = x2 ~ y2 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.42014 -0.53089 0.02137 0.86571 2.54250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y2 0.02378 0.09084 0.262 0.794
##
## Residual standard error: 0.9989 on 99 degrees of freedom
## Multiple R-squared: 0.0006919, Adjusted R-squared: -0.009402
## F-statistic: 0.06854 on 1 and 99 DF, p-value: 0.794