Both are non-parametric methods
-KNN Classifier is when the problem is of a classification type, which has a qualitative response. KNN Classifier does this by identifying the neighborhood of x0 and then estimating the conditional probability P(Y=j|X=x0) for class j as the fraction of points in the neighborhood whose response values equal j.
-KNN Regression is used with quantitative response by identifying the neighborhood of x0 and then estimating f(x0) as the average of all the training responses in the neighborhood.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
library(MASS)
data(Auto)
pairs(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Is there a relationship between the predictors and the response? After reviewing the model’s output of the F Statistic and the p-value, it appears that there is a relationship with “mpg” and the predictors.
Which predictors appear to have a statistically significant relationship to the response? Displacement, weight, year, and origin are statistically significant. Cylinder, horsepower and acceleration were not significant.
What does the coefficient for the year variable suggest? Year suggest that for every year newer the auto, its mpg increases by 0.750773. This shows that autos are becoming more efficient over time.
AFit <- lm(mpg ~ . - name, data = Auto)
summary(AFit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Residuals vs. Fitted show that there maybe a Homoscedasticity issue because of the pattern being formed. Points are not randomly located around zero. Q-Q Plot shows that there is deviation at one end of the tail, this could possibily be due to a normality violation. Residuals vs. Leverage show oultiers in the data set. Point 14 and the points outside of 2 and -2 appear to be outleirs.
plot(AFit)
The cor() function shows which interactions have the greastest correlation so I’ll use that info. Displacement and cylinder cor = 0.9508233 Displacement and weight cor = 0.9329944
AFit2 <- lm(mpg ~ cylinders * displacement+displacement * weight, data = Auto[, 1:8])
summary(AFit2)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2934 -2.5184 -0.3476 1.8399 17.7723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.262e+01 2.237e+00 23.519 < 2e-16 ***
## cylinders 7.606e-01 7.669e-01 0.992 0.322
## displacement -7.351e-02 1.669e-02 -4.403 1.38e-05 ***
## weight -9.888e-03 1.329e-03 -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03 3.426e-03 -0.872 0.384
## displacement:weight 2.128e-05 5.002e-06 4.254 2.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7237
## F-statistic: 205.8 on 5 and 386 DF, p-value: < 2.2e-16
Displacement and Weight are significant with a p-value of 2.64e-05. This model has an R-square of 0.72 which explains the variance in the model which is good. Displacement and cylinders are not significant.
AS the weight increases so does horsepower. Sqrt shows the most linear graph while x2 displays the same relationship but shows horsepower leveling off while weight can continue increasing.
plot(log(Auto$weight), Auto$horsepower)
plot(sqrt(Auto$weight), Auto$horsepower)
plot((Auto$weight)^2, Auto$horsepower)
data("Carseats")
CS <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(CS)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! Price- sales decrease 0.054459 units as price increases by 1 dollar Urban- sales decrease 0.021916 units if in a urban location US- sales increase in US stores by 1.200573 units
Write out the model in equation form, being careful to handle the qualitative variables properly. Sales=13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US+ε with Urban=1 if the store is urban and 0 if its false. For domestic sales US=1 true and US=0 if false.
For which of the predictors can you reject the null hypothesis H0 : βj = 0? The null can be rejected for “Price” and “US” because of the low p-values.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
CS1 <- lm(Sales ~ Price + US, data = Carseats)
summary(CS1)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data? The R-square is low in both models, explaining about 23% of the variance in both. More data would be ideal to build a better model to forcast Sales.
Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(CS1)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(CS1)
Plots overall appear good. Residuals vs leverage chart do suggest that there are some high leverage points and a few outleirs.
The coefficient estimate for the regression of Y onto X has x in the denominator The coefficient estimate for the regression of X onto Y has y in the denominator
the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- 2 * x + rnorm(100, sd = 0.1)
sum(y^2)
## [1] 1353606
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.223590 -0.062560 0.004426 0.058507 0.230926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.0001514 0.0001548 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.115418 -0.029231 -0.002186 0.031322 0.111795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.87e-05 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08