library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
data(Auto)
college=read.csv("Auto.csv", stringsAsFactors=TRUE)
Carefully explain the differences between the KNN classifier and KNN regression methods. KNN classifiers method is used to solve qualitative classification of X and put into specific groups. KNN regression methods are used to solve non-parametrically (quantitative response) fit many valued observations.
This question involves the use of multiple linear regression on the Auto data set.
A)Produce a scatterplot matrix which includes all of the variables in the data set.
library("ISLR")
pairs(Auto)
B)Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(Auto[, names(Auto) !="name"])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
C)Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
model = lm(mpg ~. -name, data = Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and the response? Yes although not all predictors have a statistically significant effect on the response. R-squared value tells us that 82% of changes in the response can be explained by the predictors.
Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year, origin
What does the coefficient for the year variable suggest? When all of the other predictor are held constant, the mpg value will increase each passing year. mpg increase by 1.43 each year.
D)Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(model)
The first graph shows that there is a non-linear relationship between the responce and the predictors; The second graph shows that the residuals are normally distributed and right skewed; The third graph shows that the constant variance of error assumption is not true for this model; The Third graphs shows that there are no leverage points. However, there on observation that stands out as a potential leverage point (labeled 14 on the graph)
E)Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
model = lm(mpg ~.-name+displacement:weight, data = Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+acceleration:horsepower, data=Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight +
## acceleration:horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3344 -1.6333 0.0188 1.4740 11.9723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.725e+01 5.328e+00 -3.237 0.00131 **
## cylinders 6.354e-01 6.106e-01 1.041 0.29870
## displacement -6.805e-02 1.337e-02 -5.088 5.68e-07 ***
## horsepower 6.026e-02 2.601e-02 2.317 0.02105 *
## weight -8.864e-03 1.097e-03 -8.084 8.43e-15 ***
## acceleration 6.257e-01 1.592e-01 3.931 0.00010 ***
## year 7.845e-01 4.470e-02 17.549 < 2e-16 ***
## origin 4.668e-01 2.595e-01 1.799 0.07284 .
## cylinders:displacement -1.337e-03 2.726e-03 -0.490 0.62415
## displacement:weight 2.071e-05 3.638e-06 5.694 2.49e-08 ***
## horsepower:acceleration -7.467e-03 1.784e-03 -4.185 3.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.905 on 381 degrees of freedom
## Multiple R-squared: 0.865, Adjusted R-squared: 0.8615
## F-statistic: 244.2 on 10 and 381 DF, p-value: < 2.2e-16
model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+year:origin+acceleration:horsepower, data=Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight +
## year:origin + acceleration:horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6504 -1.6476 0.0381 1.4254 12.7893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.287e+00 9.074e+00 0.583 0.560429
## cylinders 4.249e-01 6.079e-01 0.699 0.485011
## displacement -7.322e-02 1.334e-02 -5.490 7.38e-08 ***
## horsepower 5.252e-02 2.586e-02 2.031 0.042913 *
## weight -8.689e-03 1.086e-03 -7.998 1.54e-14 ***
## acceleration 5.796e-01 1.582e-01 3.665 0.000283 ***
## year 5.116e-01 9.976e-02 5.129 4.66e-07 ***
## origin -1.220e+01 4.161e+00 -2.933 0.003560 **
## cylinders:displacement -4.368e-04 2.712e-03 -0.161 0.872156
## displacement:weight 1.992e-05 3.608e-06 5.522 6.21e-08 ***
## year:origin 1.630e-01 5.341e-02 3.051 0.002440 **
## horsepower:acceleration -6.735e-03 1.781e-03 -3.781 0.000181 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.874 on 380 degrees of freedom
## Multiple R-squared: 0.8683, Adjusted R-squared: 0.8644
## F-statistic: 227.7 on 11 and 380 DF, p-value: < 2.2e-16
model = lm(mpg ~.-name-cylinders-acceleration+year:origin+displacement:weight+
displacement:weight+acceleration:horsepower+acceleration:weight, data=Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name - cylinders - acceleration + year:origin +
## displacement:weight + displacement:weight + acceleration:horsepower +
## acceleration:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5074 -1.6324 0.0599 1.4577 12.7376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.868e+01 7.796e+00 2.396 0.017051 *
## displacement -7.794e-02 9.026e-03 -8.636 < 2e-16 ***
## horsepower 8.719e-02 3.167e-02 2.753 0.006183 **
## weight -1.350e-02 1.287e-03 -10.490 < 2e-16 ***
## year 4.911e-01 9.825e-02 4.998 8.83e-07 ***
## origin -1.262e+01 4.109e+00 -3.071 0.002288 **
## year:origin 1.686e-01 5.277e-02 3.195 0.001516 **
## displacement:weight 2.253e-05 2.184e-06 10.312 < 2e-16 ***
## horsepower:acceleration -9.164e-03 2.222e-03 -4.125 4.56e-05 ***
## weight:acceleration 2.784e-04 7.087e-05 3.929 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.861 on 382 degrees of freedom
## Multiple R-squared: 0.8687, Adjusted R-squared: 0.8656
## F-statistic: 280.8 on 9 and 382 DF, p-value: < 2.2e-16
After reviewing all of the models tested the last model tested is the only one that has all it’s variables showing as significant. R-squared estimates that 87% of changes in the response is explained by this set of predictors (interaction and single) No higher value was given from these trials.
This question should be answered using the Carseats data set.
A)Fit a multiple regression model to predict Sales using Price, Urban, and US.
library("ISLR")
?Carseats
## starting httpd help server ... done
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
lm.fit = lm(Sales ~ Price+Urban+US, data= Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
B)Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! when the price increases by $1000, the number of carseats sold will decrease by 54,459 The store’s sale is not affected if it is Urban or not A store in the US will sale 1200 more carseats on average, than any store that is abroad.
C)Write out the model in equation form, being careful to handle the qualitative variables properly. Sales=13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US+ε with Urban=1 if it’s an urban location, 0 if not, and US=1 if in the US and 0 if not.
D)For which of the predictors can you reject the null hypothesis Ho : βj = 0? ‘Urban’ because it’s p-value is not statistically significant with the value of 0.936.
E)On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.fit2 = lm(Sales ~ Price+US, data= Carseats)
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
F)How well do the models in (a) and (e) fit the data? Comparing their respective R-square values(summary tables), the two models are poor in quality (only 24% change in response)
G)Using the model from (e), obtain 95 % confidence intervals for the coefficient(s)
confint(lm.fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
H)Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow=c(2,2))
plot(lm.fit2)
The normal q-q pot and the residuals vs leverage plot tell us there are no evidence of such points.
This problem involves simple linear regression without an intercept.
A)Recall that the coefficient estimate β^ for the linear regression of Y onto X witout an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X ? The coefficient estimate for the regression of Y onto X is β^=∑ixiyi∑jx2j; The coefficient estimate for the regression of X onto Y is β^′=∑ixiyi∑jy2j. The coefficients are the same iff ∑jx2j=∑jy2j.
B)Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- 2 * x + rnorm(100, sd = 0.1)
sum(y^2)
## [1] 1353606
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.223590 -0.062560 0.004426 0.058507 0.230926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.0001514 0.0001548 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.115418 -0.029231 -0.002186 0.031322 0.111795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.87e-05 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
C)Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08