KNN is a non-parametric method, which basically translates to the fact that it does not make any assumptions on the underlying data distributions. KNN can be used for classification - an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. It can also be used for regression - The value of the object is the average/median of its k nearest neighbors.
This question involves the use of multiple linear regression on the Auto data set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.2
attach(Auto)
summary(Auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
##
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## amc matador : 5
## ford pinto : 5
## toyota corolla : 5
## amc gremlin : 4
## amc hornet : 4
## chevrolet chevette: 4
## (Other) :365
a)
plot(Auto)
b)
cor(subset(Auto, select = -name))
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
c)
fit.lm<-lm(mpg~.-name,data=Auto)
summary(fit.lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There is a relationship between the predictors and the response.
The response has statistically significant relationships with the following predictors: Displacement, Weight, Year, Origin
later model year cars have a higher mpg (as indicated by the positive value for ‘year’)
d)
plot(fit.lm)
the residuals plot evidences non-linearity
observation 14 has high leverage
fit.lm0 <- lm(mpg~displacement+weight+year+origin, data=Auto)
fit.lm1 <- lm(mpg~displacement+weight+year*origin, data=Auto)
fit.lm2 <- lm(mpg~displacement+origin+year*weight, data=Auto)
fit.lm3 <- lm(mpg~year+origin+displacement*weight, data=Auto)
summary(fit.lm0)
##
## Call:
## lm(formula = mpg ~ displacement + weight + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8102 -2.1129 -0.0388 1.7725 13.2085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.861e+01 4.028e+00 -4.620 5.25e-06 ***
## displacement 5.588e-03 4.768e-03 1.172 0.242
## weight -6.575e-03 5.571e-04 -11.802 < 2e-16 ***
## year 7.714e-01 4.981e-02 15.486 < 2e-16 ***
## origin 1.226e+00 2.670e-01 4.593 5.92e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.346 on 387 degrees of freedom
## Multiple R-squared: 0.8181, Adjusted R-squared: 0.8162
## F-statistic: 435.1 on 4 and 387 DF, p-value: < 2.2e-16
summary(fit.lm1)
##
## Call:
## lm(formula = mpg ~ displacement + weight + year * origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7541 -1.8722 -0.0936 1.6900 12.4650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.927e+00 8.873e+00 0.893 0.372229
## displacement 1.551e-03 4.859e-03 0.319 0.749735
## weight -6.394e-03 5.526e-04 -11.571 < 2e-16 ***
## year 4.313e-01 1.130e-01 3.818 0.000157 ***
## origin -1.449e+01 4.707e+00 -3.079 0.002225 **
## year:origin 2.023e-01 6.047e-02 3.345 0.000904 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.303 on 386 degrees of freedom
## Multiple R-squared: 0.8232, Adjusted R-squared: 0.8209
## F-statistic: 359.5 on 5 and 386 DF, p-value: < 2.2e-16
summary(fit.lm2)
##
## Call:
## lm(formula = mpg ~ displacement + origin + year * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9402 -1.8736 -0.0966 1.5924 12.2125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.076e+02 1.290e+01 -8.339 1.34e-15 ***
## displacement -4.020e-04 4.558e-03 -0.088 0.929767
## origin 9.116e-01 2.547e-01 3.579 0.000388 ***
## year 1.962e+00 1.716e-01 11.436 < 2e-16 ***
## weight 2.605e-02 4.552e-03 5.722 2.12e-08 ***
## year:weight -4.305e-04 5.967e-05 -7.214 2.89e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.145 on 386 degrees of freedom
## Multiple R-squared: 0.8397, Adjusted R-squared: 0.8376
## F-statistic: 404.4 on 5 and 386 DF, p-value: < 2.2e-16
summary(fit.lm3)
##
## Call:
## lm(formula = mpg ~ year + origin + displacement * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6119 -1.7290 -0.0115 1.5609 12.5584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.007e+00 3.798e+00 -2.108 0.0357 *
## year 8.194e-01 4.518e-02 18.136 < 2e-16 ***
## origin 3.567e-01 2.574e-01 1.386 0.1666
## displacement -7.148e-02 9.176e-03 -7.790 6.27e-14 ***
## weight -1.054e-02 6.530e-04 -16.146 < 2e-16 ***
## displacement:weight 2.104e-05 2.214e-06 9.506 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.016 on 386 degrees of freedom
## Multiple R-squared: 0.8526, Adjusted R-squared: 0.8507
## F-statistic: 446.5 on 5 and 386 DF, p-value: < 2.2e-16
All interactions have statistically significant effects
f)
fit.lm4 <- lm(mpg~poly(displacement,3)+weight+year+origin, data=Auto)
fit.lm5 <- lm(mpg~displacement+I(log(weight))+year+origin, data=Auto)
fit.lm6 <- lm(mpg~displacement+I(weight^2)+year+origin, data=Auto)
summary(fit.lm4)
##
## Call:
## lm(formula = mpg ~ poly(displacement, 3) + weight + year + origin,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8131 -1.8012 0.0788 1.5566 12.3181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.342e+01 3.802e+00 -6.160 1.84e-09 ***
## poly(displacement, 3)1 -1.701e+01 9.820e+00 -1.732 0.0840 .
## poly(displacement, 3)2 2.840e+01 3.610e+00 7.866 3.74e-14 ***
## poly(displacement, 3)3 -7.996e+00 3.164e+00 -2.527 0.0119 *
## weight -5.285e-03 5.419e-04 -9.753 < 2e-16 ***
## year 8.189e-01 4.660e-02 17.572 < 2e-16 ***
## origin 2.422e-01 2.761e-01 0.877 0.3810
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.102 on 385 degrees of freedom
## Multiple R-squared: 0.8445, Adjusted R-squared: 0.842
## F-statistic: 348.4 on 6 and 385 DF, p-value: < 2.2e-16
summary(fit.lm5)
##
## Call:
## lm(formula = mpg ~ displacement + I(log(weight)) + year + origin,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7136 -1.9214 0.0447 1.5790 12.9864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 131.274483 11.082986 11.845 < 2e-16 ***
## displacement 0.007711 0.004052 1.903 0.057810 .
## I(log(weight)) -21.584745 1.451851 -14.867 < 2e-16 ***
## year 0.804835 0.046532 17.296 < 2e-16 ***
## origin 0.836143 0.250485 3.338 0.000925 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.113 on 387 degrees of freedom
## Multiple R-squared: 0.8425, Adjusted R-squared: 0.8409
## F-statistic: 517.7 on 4 and 387 DF, p-value: < 2.2e-16
summary(fit.lm6)
##
## Call:
## lm(formula = mpg ~ displacement + I(weight^2) + year + origin,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0988 -2.2549 -0.1057 1.8704 13.4702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.609e+01 4.349e+00 -5.999 4.56e-09 ***
## displacement -9.114e-03 5.118e-03 -1.781 0.0757 .
## I(weight^2) -7.068e-07 9.075e-08 -7.789 6.28e-14 ***
## year 7.336e-01 5.380e-02 13.635 < 2e-16 ***
## origin 1.488e+00 2.900e-01 5.132 4.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.628 on 387 degrees of freedom
## Multiple R-squared: 0.7861, Adjusted R-squared: 0.7839
## F-statistic: 355.7 on 4 and 387 DF, p-value: < 2.2e-16
Displacement^2 has a larger effect than others
This question should be answered using the Carseats data set
library(ISLR)
attach(Carseats)
a)
fit<-lm(Sales~Price+Urban+US)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
b) Based on the fitmod results, it is evident that US and Price variables are significant predictors of Sales. For every $1 increase in price the sales go down by $0.05. Sales in the US are $1.20 higher than sales outside of the US. The third variable, Urban, has no effect on Sales.
c) The equation form: Sales = -0.054459XPrice - 0.021916XUrban + 1.200573XUSYes
d) We can reject the Null Hypothesis for Price and US.
e)
fit<-lm(Sales~Price+US)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
f) Both models only predict around 23% of sales
g)
fit<-lm(Sales~Price+US)
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
h)
plot(fit)
No evidence of strong outliers.
a) When \(x_{i}=y_{i}\), or more generally when the beta denominators are equal \(\sum x_{i}^2=\sum y_{i}^2\)
b)
set.seed(1)
x <- rnorm(100)
y <- 2*x + rnorm(100)
fit.lmY <- lm(y ~ x)
fit.lmX <- lm(x ~ y)
summary(fit.lmY)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
summary(fit.lmX)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
1.99894 is not equal to 0.38942
c)
set.seed(1)
x <- rnorm(100, mean=1000, sd=0.1)
y <- rnorm(100, mean=1000, sd=0.1)
fit.lmY <- lm(y ~ x)
fit.lmX <- lm(x ~ y)
summary(fit.lmY)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18768 -0.06138 -0.01395 0.05394 0.23462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1001.05662 107.72820 9.292 4.16e-15 ***
## x -0.00106 0.10773 -0.010 0.992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09628 on 98 degrees of freedom
## Multiple R-squared: 9.887e-07, Adjusted R-squared: -0.0102
## F-statistic: 9.689e-05 on 1 and 98 DF, p-value: 0.9922
summary(fit.lmX)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.232416 -0.060361 0.000536 0.058305 0.229316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.001e+03 9.472e+01 10.57 <2e-16 ***
## y -9.324e-04 9.472e-02 -0.01 0.992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09028 on 98 degrees of freedom
## Multiple R-squared: 9.887e-07, Adjusted R-squared: -0.0102
## F-statistic: 9.689e-05 on 1 and 98 DF, p-value: 0.9922
Betas for both equal 0.005