Both the KNN classifier and the KNN regression approaches employ data from the K neighbors that are closest to the prediction point \(x_0\) in order to make predictions. KNN classification uses categorical (qualitative) response factors to assign the prediction point to a class, whereas KNN regression uses quantitative response variables to estimate the numerical value of the response. This is where they diverge.In KNN classification the solution is derived by identifying the neighborhood of \(x_0\) and then estimating the conditional probability \(P(Y=j|X=x_0)\) for class \(j\) as the proportion of points in the neighborhood whose response values equal \(j\). The KNN regression strategy is used to resolve regression concerns by once more choosing the \(x_0\) neighborhood and then estimating \(f(x_0)\) as the mean of all the training responses in the neighborhood.
Auto
data set.library(ISLR2)
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
Auto$name = as.factor(Auto$name)
pairs(Auto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
auto.mlr = lm(mpg ~ . -name, data=Auto)
summary(auto.mlr)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the
response?
As a result of their linked significant p-values, many predictors are
related to the response. The probability that the coefficient will take
a value of 0 is indicated by the p-value. The standard p-value cutoff is
0.05. If the probability is less than 0.05, then there is extremely
little possibility that it will be 0.
ii. Which predictors appear to have a statistically
significant relationship to the response?
Generally, we assume that a variable is significant and has some
relationship with the predictor if the p-value for that variable is less
than 0.05. So, in this case, all predictors are statistically
significant except “cylinders”, “horsepower” and “acceleration”.
iii. What does the coefficient for the year variable
suggest?
The year coefficient is 0.7507, or around 3/4. This explains the
relationship between the year and mpg. It implies that the mpg increases
by 3 every four years.
(d) Use the plot() function to produce diagnostic
plots of the linear regression fit. Comment on any problems you see with
the fit. Do the residual plots suggest any unusually large outliers?
Does the leverage plot identify any observations with unusually high
leverage?
par(mfrow = c(2, 2))
plot(auto.mlr)
A slight nonlinearity in the data is indicated by the plot of
residuals vs fitted values. One high leverage point (point 14) and a few
outliers (higher than 2 or lower than -2) can be seen on the plot of
standardized residuals vs leverage.
(e) Use the * and : symbols to fit linear regression models with
interaction effects. Do any interactions appear to be statistically
significant?
interact.fit = lm(mpg ~ . -name + horsepower*displacement, data=Auto)
origin.hp = lm(mpg ~ . -name + horsepower*origin, data=Auto)
summary(origin.hp)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower * origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.277 -1.875 -0.225 1.570 12.080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.196e+01 4.396e+00 -4.996 8.94e-07 ***
## cylinders -5.275e-01 3.028e-01 -1.742 0.0823 .
## displacement -1.486e-03 7.607e-03 -0.195 0.8452
## horsepower 8.173e-02 1.856e-02 4.404 1.38e-05 ***
## weight -4.710e-03 6.555e-04 -7.186 3.52e-12 ***
## acceleration -1.124e-01 9.617e-02 -1.168 0.2434
## year 7.327e-01 4.780e-02 15.328 < 2e-16 ***
## origin 7.695e+00 8.858e-01 8.687 < 2e-16 ***
## horsepower:origin -7.955e-02 1.074e-02 -7.405 8.44e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.116 on 383 degrees of freedom
## Multiple R-squared: 0.8438, Adjusted R-squared: 0.8406
## F-statistic: 258.7 on 8 and 383 DF, p-value: < 2.2e-16
Based on the highest correlation the two most significant terms are displacement and horsepower horsepower and origin
inter.fit = lm(mpg ~ .-name + horsepower:origin + horsepower:displacement, data=Auto)
summary(inter.fit)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower:origin + horsepower:displacement,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7222 -1.5251 -0.0968 1.3553 12.8419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.706e+00 4.686e+00 -1.004 0.3159
## cylinders 5.142e-01 3.139e-01 1.638 0.1022
## displacement -6.970e-02 1.143e-02 -6.098 2.63e-09 ***
## horsepower -1.540e-01 3.547e-02 -4.342 1.81e-05 ***
## weight -3.084e-03 6.478e-04 -4.761 2.73e-06 ***
## acceleration -2.276e-01 9.099e-02 -2.501 0.0128 *
## year 7.349e-01 4.460e-02 16.478 < 2e-16 ***
## origin 2.281e+00 1.090e+00 2.092 0.0371 *
## horsepower:origin -1.918e-02 1.278e-02 -1.500 0.1343
## displacement:horsepower 4.665e-04 6.127e-05 7.614 2.10e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.908 on 382 degrees of freedom
## Multiple R-squared: 0.8644, Adjusted R-squared: 0.8612
## F-statistic: 270.6 on 9 and 382 DF, p-value: < 2.2e-16
Adding more interactions, decreases the significance of previous
significant values.
(f) Try a few different transformations of the variables, such
as log(X), √X, X2. Comment on your findings.
par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)
Carseats
data set.(a) Fit a multiple regression model to predict
Sales using Price, Urban, and
US.
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
carseat.fit = lm(Sales ~ Price + Urban + US, data=Carseats)
summary(carseat.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the
model. Be careful—some of the variables in the model are
qualitative!
As per the multiple regression model derived in 10(a),
thePrice variable’s coefficient can be understood that,
with all other predictors held constant, a rise in price of one dollar
typically results in a drop in sales of 54.45 units. When all other
variables are held constant, the Urban variable’s
coefficient can be interpreted as on an average, urban locations sell
21.91 less units than rural locations. When all other predictors are
held constant, the US variable’s coefficient can be
understood as, on an average, US stores sell 1200.57 more units than
non-US stores.
(c) Write out the model in equation form, being careful to
handle the qualitative variables properly.
Sales=13.0434689+(−0.054459)×Price+(−0.021916)×Urban+(1.200573)×US+ε
where US=1 if the store is in the US and 0 otherwise, and Urban=1 if the
store is in an urban area and 0 otherwise.
(d) For which of the predictors can you reject the null
hypothesis \(H_0 : β_j =
0\)?
For the variables Price and US we can rule out
the null hypothesis.
(e) On the basis of your response to the previous question, fit
a smaller model that only uses the predictors for which there is
evidence of association with the outcome.
carseat.fit2 = lm(Sales ~ Price + US, data=Carseats)
summary(carseat.fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the
data?
As per the models (a) and (e), we can see that the \(R^2\) compared to the bigger model, the
smaller model is slightly better. 23.93% variability is explained in the
model.
(g) Using the model from (e), obtain 95 % confidence intervals
for the coefficient(s).
confint(carseat.fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2,2))
plot(carseat.fit2)
One observation can be found on the graph’s to the extreme right. This indicates that it has a very high leverage. A few others have considerable leverage as well.
(a) Recall that the coefficient estimate \(\hat{\beta}\) for the linear regression of
Y onto X without an intercept is given by (3.38).
Under what circumstance is the coefficient estimate for the regression
of X onto Y the same as the coefficient estimate for
the regression of Y onto X?
When there is no irreducible error and there is a perfect linear
relationship between X and Y (Y=X), the
coefficent estimate for regression of X onto Y will be
the same as the coefficent estimate for regression of Y onto
X.
coefficient estimate of Y onto X is:
\(\begin{aligned} \hat{\beta} =
\frac{\Sigma_ix_iy_i}{\Sigma_jx^2_j} \\ \end{aligned}\)
coefficient estimate of X onto Y is:
\(\hat{\beta}'\) = \(\frac{\Sigma_ix_iy_i}{\Sigma_jy^2_j}\)
coefficients are the same if:
\(\Sigma_jx^2_j\) = \(\Sigma_jy^2_j\)
(b) Generate an example in R with n = 100 observations
in which the coefficient estimate for the regression of X onto
Y is different from the coefficient estimate for the regression
of Y onto X.
set.seed(0)
x=rnorm(100)
y = 2 * x + rnorm(100)
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6391 -0.8650 -0.2032 0.5898 2.7879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.1374 0.1092 19.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9589 on 99 degrees of freedom
## Multiple R-squared: 0.7948, Adjusted R-squared: 0.7927
## F-statistic: 383.4 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22971 -0.24830 0.04216 0.34170 0.71230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.37185 0.01899 19.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4 on 99 degrees of freedom
## Multiple R-squared: 0.7948, Adjusted R-squared: 0.7927
## F-statistic: 383.4 on 1 and 99 DF, p-value: < 2.2e-16
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
y = x
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.121e-16 -3.665e-17 -8.400e-19 4.368e-17 2.976e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.000e+00 1.058e-17 9.449e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.297e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.928e+33 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.121e-16 -3.665e-17 -8.400e-19 4.368e-17 2.976e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 1.000e+00 1.058e-17 9.449e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.297e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.928e+33 on 1 and 99 DF, p-value: < 2.2e-16