Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN classifier and KNN regression methods are non-parametric approaches that depend on the value of a k-number of neighbors of a point to define the value of that point. The difference between both methods resides in the fact that a KNN classifier assigns the value to the point by evaluating its probability in function of the neighbors. As such, the point assumes the value with a higher probability. Unlike the KNN regression method, the values of a KNN classifier are qualitative. The KNN regression method also uses the same approach, however instead of assigning the value that has the greater probability to the point, it averages the value of the k-neighbors and it assigns the value of that average to the point. As such, the values of a KNN regression method are quantitative.
This question involves the use of multiple linear regression on the Auto data set
library(ISLR)
library(MASS)
plot(Auto)
Auto_new <- Auto[,-c(9)]
cor(Auto_new)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm_fit<- lm(mpg ~ ., data = Auto_new)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ ., data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The F-statistic: 252.4 and p-value: < 2.2e-16, provide evidence that there is a relationship between at least one of the predictors and the response.
Due to their p-values, the predictors that appear to have a statistically significant relationship to the response are: Displacement, Weight, Year and Origin.
There is a positive linear relationship between year and mpg. Each year represents a 0.750773 increase of mpg.
par (mfrow = c(2, 2))
plot(lm_fit)
In the Residuals vs Fitted Plot, the fit to the residuals shows a curve
that indicates that the data might have a non-linear relation. There is
a point (14) with high leverage in the Residuals vs Leverage plot. Its
residual value indicates it’s not an outlier.
summary(lm(mpg ~ .:., data = Auto_new))
##
## Call:
## lm(formula = mpg ~ .:., data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
summary(lm(mpg ~ .*., data = Auto_new))
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
The interactions that, due their p-value, seem to be statistically significant are: acceleration:origin, acceleration:year and displacement:year.
summary(lm(mpg ~ acceleration+I(acceleration^2), data = Auto_new))
##
## Call:
## lm(formula = mpg ~ acceleration + I(acceleration^2), data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0877 -5.5700 -0.8524 4.3827 22.9813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.26045 7.79899 -1.957 0.051095 .
## acceleration 3.79787 0.98283 3.864 0.000131 ***
## I(acceleration^2) -0.08156 0.03056 -2.669 0.007934 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.025 on 389 degrees of freedom
## Multiple R-squared: 0.194, Adjusted R-squared: 0.1898
## F-statistic: 46.8 on 2 and 389 DF, p-value: < 2.2e-16
summary(lm(mpg ~ acceleration+I(sqrt(acceleration)), data = Auto_new))
##
## Call:
## lm(formula = mpg ~ acceleration + I(sqrt(acceleration)), data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9559 -5.5979 -0.8015 4.6222 22.8777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -72.877 29.051 -2.509 0.01253 *
## acceleration -3.845 1.885 -2.040 0.04203 *
## I(sqrt(acceleration)) 39.750 14.823 2.682 0.00764 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.025 on 389 degrees of freedom
## Multiple R-squared: 0.1941, Adjusted R-squared: 0.19
## F-statistic: 46.85 on 2 and 389 DF, p-value: < 2.2e-16
This question should be answered using the Carseats data set.
lm_sales <- lm(Sales ~ Price + Urban + US, Carseats)
summary(lm_sales)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The data set contains sales of child car seats, measured in thousands. Price is a continuous variable, while Urban and Us are categorical. - The Price coefficient (-0.054459) indicates that for every unit of price increased, sales decreased by 54 units. The P-value shows evidence of a strong relationship between Price and Sales. - The Urban variable can take 2 values: Yes(1) or NO(0) to indicate if the store is in an Urban location. Its coefficient (-0.021916) indicates that whenever a store is in a urban location, sales decrease by 21 units. Its p-value, however, shows that there is not enough evidence to indicate a relation between Urban and Price. - The US variable can take 2 values: Yes or No, to indicate if the store is in the US. Its coefficient (1.200573) indicates that for sales increase by 1200 units whenever a store is located in the US. The P-value shows evidence of a relation between Sales and US.
Sales = 13.043469 - 0.054459(Price) - 0.021916(Urban)[1 if the Store is in Urban area; 0 otherwise] + 1.200573(US)[1 if the store is in the US; 0 otherwise]
The null hypothesis can be rejected for the Price and US variables, since their p-values is 2e-16 and 4.86e-06, respectively
lm_sales_new <- lm(Sales ~ Price + US, Carseats)
summary(lm_sales_new)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Model a has an Adjusted R-squared of:0.2335.Model b has an Adjusted R-squared of: 0.2354. Both models explain around 23% of the variance, which confirms that dropping the Urban variable from the set didn’t affect the fit of the model to the data.
confint(lm_sales_new, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par (mfrow = c(3, 2))
plot(hatvalues(lm_sales_new))
plot(lm_sales_new)
The hatvalues and the Residuals vs Leverage doesn’t seem to indicate the presence of observations with high leverage value.
This problem involves simple linear regression without an intercept.
After equating the respective quotients to calculate the coefficients for both x and y, respectively, we can see that the estimates are equal for both x and y when the summation of x squared is equal to the summation of y squared.
set.seed(99)
x <- rnorm(100)
y= 3*x+rnorm(100)
df <- data.frame(x, y)
fit1 <- lm(y ~ x + 0, data = df)
fit2 <- lm(x ~ y + 0, data = df)
summary(fit1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## x 3.087646 0.1182444 26.1124 3.415688e-46
summary(fit2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## y 0.2828097 0.01083047 26.1124 3.415688e-46
set.seed(89)
x1 <- rnorm(100)
y1= x1
df1 <- data.frame(x1, y1)
fit2_1 <- lm(y1 ~ x1 + 0, data = df1)
fit2_2 <- lm(x1 ~ y1 + 0, data = df1)
summary(fit2_1)$coefficients
## Warning in summary.lm(fit2_1): essentially perfect fit: summary may be
## unreliable
## Estimate Std. Error t value Pr(>|t|)
## x1 1 4.787301e-18 2.08886e+17 0
summary(fit2_2)$coefficients
## Warning in summary.lm(fit2_2): essentially perfect fit: summary may be
## unreliable
## Estimate Std. Error t value Pr(>|t|)
## y1 1 4.787301e-18 2.08886e+17 0