KNN classifier: used to solve classification problems( qualitative) based on k nearest neighbor
KNN regression: used to solve regression problems(quantaitaive) by identifying observations close to x_o and estimates the function using the averages.
#9
data(Auto)
pairs(Auto)
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.auto <- lm(mpg~. -name,data = Auto)
summary(lm.auto)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
as p-value indicates there are predictors that are significant, we conclude that there is a relationship the predictors and the response.
Displacement, weight, year, and origin.
when year increases by one factor(all other variable staying constant) mpg increases by 0.750773.
par(mfrow = c(2,2))
plot(lm.auto)
Identifiable outliers seen on the right side indicating a slight skewness
Lm.auto.interact <- lm(mpg~ cylinders*displacement +displacement*weight+ cylinders* weight + displacement*horsepower + weight*horsepower , data = Auto[,1:8])
summary(Lm.auto.interact)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight + cylinders * weight + displacement * horsepower +
## weight * horsepower, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2352 -2.1592 -0.3998 1.8286 17.1431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.860e+01 6.608e+00 8.868 < 2e-16 ***
## cylinders -8.234e-01 2.074e+00 -0.397 0.69157
## displacement -8.021e-02 4.045e-02 -1.983 0.04805 *
## weight -5.139e-03 2.944e-03 -1.746 0.08166 .
## horsepower -1.636e-01 6.146e-02 -2.662 0.00809 **
## cylinders:displacement -1.273e-03 5.574e-03 -0.228 0.81947
## displacement:weight 1.985e-06 1.026e-05 0.193 0.84676
## cylinders:weight 5.441e-04 8.071e-04 0.674 0.50062
## displacement:horsepower 5.095e-04 1.732e-04 2.941 0.00347 **
## weight:horsepower -1.225e-05 2.587e-05 -0.474 0.63604
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.873 on 382 degrees of freedom
## Multiple R-squared: 0.7594, Adjusted R-squared: 0.7538
## F-statistic: 134 on 9 and 382 DF, p-value: < 2.2e-16
interaction between displacement and horsepower is signifciant as seen by the p-value.
displacement
par(mfrow = c(2,2))
plot(log(Auto$displacement), Auto$mpg)
plot(sqrt(Auto$displacement), Auto$mpg)
plot((Auto$displacement)^2, Auto$mpg)
horsepower
par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)
log transformation on horsepower and displacement displays a linear trend
data("Carseats")
lm.carseats <- lm(Sales ~ Price+Urban+ US, data= Carseats)
summary(lm.carseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price : 1 unit increase in price is a decrease of 0.05445 units in sales if all other predictors are constant
Urban: if the location is urban there is a deacrease of 0.0219 units in sales if all other predictors are constant
US; on average there is an increase of 1.2000 units in sales if the location is US and all other predictors are constant.
Sales = 13.04 -0.054Price -0.0216Urban + 1.201*US + error
We can reject the null hypothesis for “price’ and”US’ variables.
lm.carseats1 <- lm(Sales ~ Price +US, data =Carseats)
summary(lm.carseats1)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
R^2 for the smaller model and larger are relatively the same. only about 23% of variation can be explained by the mdoels.
confint(lm.carseats1)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2,2))
plot(lm.carseats1)
# 12.
coefficients estimate of regression of Y into x is sum of the product x and y divided by sum of x^2
and coefficients estimate of regression of X onto Y is sum of the product of x and y divided by sum of y^2
therefore, coefficients are the same if and only if sum of x^2 is equal to sum of y^2
x <- 1:100
sum(x^2)
## [1] 338350
y <-2*x +rnorm(100, sd = 0.1)
sum(y^2)
## [1] 1353049
lm.x <- lm(x~y+0)
summary(lm.x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.149611 -0.035205 0.006287 0.031466 0.106609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.001e-01 4.187e-05 11943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0487 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.426e+08 on 1 and 99 DF, p-value: < 2.2e-16
lm.y <- lm(y~x+0)
summary(lm.y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.21315 -0.06283 -0.01253 0.07047 0.29919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9997397 0.0001674 11943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09739 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.426e+08 on 1 and 99 DF, p-value: < 2.2e-16
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
lm.x <- lm(x~y +0)
lm.y <- lm(y~x +0)
summary(lm.x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(lm.y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08