Carefully explain the difference between the KNN classifier and KNN regression methods.
The Knn classifier is used when dealing with qualitative response variables and estimates the conditional probability for categorical variables by picking the most common variable. The KNN regression method is used when dealing with a quantitative response variable and estimates an average.
This question involves the use of multiple linear regression on the Auto data set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.5.2
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.5.3
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
#threw an error for non-numberic "name" column
pairs(Auto[, -which(names(Auto) == "name")])
cor(Auto[, -which(names(Auto) == "name")])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
-There is a stat. significant relationship with displacement, weight, year, and origin because they have a p-val of < 0.05 -There is a relationship with the predictors and the response because the values for the most part aren’t 0 in the estimates. High F-statistic with stat signif. p-value and r^2 of .8182. -The coefficient suggests that as there is a value of 1 mpg increasing, year of vehicle also increases by .75. Newer cars have higher mpg
plot(lm.fit)
There is a U-shaped trend which suggests the data is non-linear. The
residual plot does show a high leverage point at 14, and a few data
points around 4, so it does indicate outliers.
lm.fit1 <- lm(mpg ~ . - name, cylinders:displacement, data = Auto)
## Warning in cylinders:displacement: numerical expression has 392 elements: only
## the first used
summary(lm.fit1)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto, subset = cylinders:displacement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3945 -1.7041 -0.0186 1.6198 13.3258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1345737 5.1622721 0.220 0.8262
## cylinders -0.4608452 0.3128285 -1.473 0.1418
## displacement 0.0070389 0.0070386 1.000 0.3181
## horsepower -0.0183184 0.0125424 -1.461 0.1452
## weight -0.0052788 0.0005963 -8.853 < 2e-16 ***
## acceleration -0.0936875 0.0959226 -0.977 0.3295
## year 0.5353830 0.0619708 8.639 3.77e-16 ***
## origin 0.7307611 0.3023433 2.417 0.0163 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.792 on 292 degrees of freedom
## Multiple R-squared: 0.8204, Adjusted R-squared: 0.8161
## F-statistic: 190.5 on 7 and 292 DF, p-value: < 2.2e-16
lm.fit2 <- lm(mpg ~ . - name, horsepower:weight, data = Auto)
## Warning in horsepower:weight: numerical expression has 392 elements: only the
## first used
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto, subset = horsepower:weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6166 -2.0557 -0.2415 1.5404 12.0351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.231835 7.641106 -3.826 0.000164 ***
## cylinders 0.143802 0.438423 0.328 0.743182
## displacement 0.009192 0.011106 0.828 0.408630
## horsepower -0.061881 0.022917 -2.700 0.007394 **
## weight -0.005994 0.001013 -5.916 1.05e-08 ***
## acceleration 0.045217 0.136117 0.332 0.740014
## year 0.932636 0.088183 10.576 < 2e-16 ***
## origin 1.226859 0.324440 3.781 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.387 on 255 degrees of freedom
## (3112 observations deleted due to missingness)
## Multiple R-squared: 0.8078, Adjusted R-squared: 0.8025
## F-statistic: 153.1 on 7 and 255 DF, p-value: < 2.2e-16
On the 2 interactions I’ve tested, they both appear to be statistically significant due to p-value being less than 0.05.
plot(log(Auto$displacement), Auto$mpg)
plot(sqrt(Auto$displacement), Auto$mpg)
plot((Auto$displacement)^2, Auto$mpg)
Log and sqrt makes the relationship between displacement and mpg more
linear, whereas squared does not
This question should be answered using the Carseats data set.
library(ISLR2)
data("Carseats")
Q10_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Q10_fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
-As price decreases by 1 unit, the sales go up by .054459 on average. This is statistically significant with a p-value of <0.05. -If the area is Urban, there is a decrease in sales by -.021916 units. However, this is not statistically significant because the p-value is .936. -If it is in the US, there will be an increase in sales by 1.200573 on average, this is statistically significant with a p-value of <0.05. -The intercept shows us that when each variable is at 0 (or ‘No’), sales is at 13.043469 units.
Sales = 13.043469 - 0.054459 * Price - 0.021916 * 1(UrbanYes) + 1.200573 * 1(USYes)
If UrbanNo or USNo then ‘0’ respectively.
-We can reject the null hypothesis for Price and US since their p-value is <0.05 and is stat. signif.
Q10_fit1 <- lm(Sales ~ Price + US, data = Carseats)
summary(Q10_fit1)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data? The models in a and b share similar R^2 around 0.2393, which only represents 23.93% of the variance in the data and does not fit very well.
Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(Q10_fit1)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
-Neither variable contains 0, still represents statistical significance.
plot(Q10_fit1)
There is no evidence of outliers beyond +/-3 standard residuals. There
are a few high leverage points, particularly one above 0.04.
This problem involves simple linear regression without an intercept.
-When the sums of squares of X and Y are equal
set.seed(1) n <- 100
set.seed(1)
x <- rnorm(100)
y <- 2*x + rnorm(100)
Q12_fit <-lm(y ~ + 0)
Q12_fit1 <- lm(x ~ + 0)
coef(Q12_fit)
## numeric(0)
coef(Q12_fit1)
## numeric(0)
set.seed(1)
x1 <- rnorm(100)
yy <- x
Q12_fit2 <-lm(yy ~ x1 + 0)
Q12_fit3 <- lm(x1 ~ yy + 0)
coef(Q12_fit2)
## x1
## 1
coef(Q12_fit3)
## yy
## 1