The KNN classifier and KNN Regressor methods both predict in different ways. The KNN classifier predicts categorical variables, while the KNN regressor predicts continuos numerical variables.
Auto <-read.table("Auto.data", header = T, na.strings = "?",
stringsAsFactors = T)
Auto <- na.omit(Auto)
View(Auto)
pairs(Auto[, -9])
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There is a relationship between the predictors and the response. We can see that the r-squared value (0.8182) shows that around 82.15% of the variance in mpg is explained by the predictors in the model. Also, since the p-value is a very small value (2.2e-16), that means that at least one of the predictors has a linear relationship with the mpg variable.
The variables that have a statistically significant relationship to the response are weight, year, origin, and displacement.
The coefficient for the year variable is 0.750773. This suggests that for each one-year increase, mpg is predicted to increase by about 0.75 miles per gallon.
plot(lm.fit)
We can see that the residuals are pretty close to the normally distributed line, but there are deviations present especially in the tails. The leverage plot does show an observation with unusually high leverage, and that is observation 14.
lm.fit.inter <- lm(mpg ~ . - name + cylinders:displacement + displacement:weight, data = Auto)
summary(lm.fit.inter)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders:displacement + displacement:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0609 -1.7589 -0.0494 1.5790 12.1496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.795e+00 4.515e+00 -1.062 0.28883
## cylinders -1.091e-01 5.965e-01 -0.183 0.85502
## displacement -7.186e-02 1.363e-02 -5.273 2.25e-07 ***
## horsepower -3.457e-02 1.304e-02 -2.651 0.00836 **
## weight -1.030e-02 1.064e-03 -9.680 < 2e-16 ***
## acceleration 6.618e-02 8.817e-02 0.751 0.45334
## year 7.840e-01 4.566e-02 17.171 < 2e-16 ***
## origin 5.475e-01 2.643e-01 2.071 0.03901 *
## cylinders:displacement 1.186e-03 2.715e-03 0.437 0.66251
## displacement:weight 2.141e-05 3.712e-06 5.768 1.66e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.967 on 382 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8555
## F-statistic: 258.2 on 9 and 382 DF, p-value: < 2.2e-16
Yes there are interactions that are statistically significant. The interaction displacement:weight is statistically significant because it has a very small p-value (1.66e-08). This means that the effect of displacement on mpg depends on weight.
lm.fit.log <- lm(mpg ~ log(displacement) + cylinders + horsepower + weight + acceleration + year + origin, data = Auto)
summary(lm.fit.log)
##
## Call:
## lm(formula = mpg ~ log(displacement) + cylinders + horsepower +
## weight + acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6594 -1.8712 -0.0741 1.6427 12.8462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3922996 7.1102490 0.336 0.736709
## log(displacement) -5.2475829 1.3910486 -3.772 0.000187 ***
## cylinders 0.8052759 0.3081112 2.614 0.009312 **
## horsepower -0.0048428 0.0130659 -0.371 0.711106
## weight -0.0044886 0.0006912 -6.494 2.58e-10 ***
## acceleration -0.0047404 0.0986602 -0.048 0.961703
## year 0.7437614 0.0503990 14.757 < 2e-16 ***
## origin 0.6282457 0.3011778 2.086 0.037642 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.297 on 384 degrees of freedom
## Multiple R-squared: 0.8247, Adjusted R-squared: 0.8215
## F-statistic: 258.1 on 7 and 384 DF, p-value: < 2.2e-16
lm.fit.sqrt <- lm(sqrt(mpg) ~ displacement + cylinders + horsepower + weight + acceleration + year + origin, data = Auto)
summary(lm.fit.sqrt)
##
## Call:
## lm(formula = sqrt(mpg) ~ displacement + cylinders + horsepower +
## weight + acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98891 -0.18946 0.00505 0.16947 1.02581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.075e+00 4.290e-01 2.506 0.0126 *
## displacement 1.752e-03 6.942e-04 2.524 0.0120 *
## cylinders -5.942e-02 2.986e-02 -1.990 0.0474 *
## horsepower -2.512e-03 1.274e-03 -1.972 0.0493 *
## weight -6.367e-04 6.024e-05 -10.570 < 2e-16 ***
## acceleration 2.738e-03 9.131e-03 0.300 0.7644
## year 7.381e-02 4.709e-03 15.675 < 2e-16 ***
## origin 1.217e-01 2.569e-02 4.735 3.09e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3074 on 384 degrees of freedom
## Multiple R-squared: 0.8561, Adjusted R-squared: 0.8535
## F-statistic: 326.3 on 7 and 384 DF, p-value: < 2.2e-16
lm.fit.sq <- lm(mpg ~ I(horsepower^2) + displacement + cylinders + horsepower + weight + acceleration + year + origin, data = Auto)
summary(lm.fit.sq)
##
## Call:
## lm(formula = mpg ~ I(horsepower^2) + displacement + cylinders +
## horsepower + weight + acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5497 -1.7311 -0.2236 1.5877 11.9955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3236564 4.6247696 0.286 0.774872
## I(horsepower^2) 0.0010060 0.0001065 9.449 < 2e-16 ***
## displacement -0.0075649 0.0073733 -1.026 0.305550
## cylinders 0.3489063 0.3048310 1.145 0.253094
## horsepower -0.3194633 0.0343447 -9.302 < 2e-16 ***
## weight -0.0032712 0.0006787 -4.820 2.07e-06 ***
## acceleration -0.3305981 0.0991849 -3.333 0.000942 ***
## year 0.7353414 0.0459918 15.989 < 2e-16 ***
## origin 1.0144130 0.2545545 3.985 8.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.001 on 383 degrees of freedom
## Multiple R-squared: 0.8552, Adjusted R-squared: 0.8522
## F-statistic: 282.8 on 8 and 383 DF, p-value: < 2.2e-16
Based on these results, we can see that the square root transformation of mpg and adding squared horsepower resultsed in the highest r-squared values. This means better model fits. We can also note that the log transformation of displacement improved the model.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
data(Carseats)
lm.fit.full <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit.full)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Intercept is the estimated Sales value when all the predictor variables are zero, and its estimated coefficient value is 13.043469. So, if the Price is zero, and the store is not in an urban area, and the store is not in the US, the predicted Sales would be 13.043469 units.
The estimated coefficient value for the Price variable is -0.054459. This means that every 1 dollar increase in Price, the Sales are predicted to decrease by about 0.054459 units. This shows an inverse relationship between Price and Sales.
Next, is Urban. In this case, Urban is a qualitative variable, and UrbanYes is the dummy variable. The estimated coefficient value for UrbanYes is -0.021916. This value represents the difference in Sales between stores that are in Urban areas and stores that are not. This value can also represent that stores in urban areas have sales that are lower by 0.021916 units. But we shouldn’t consider that value because its not statistically significant.
Lastly, we have US. Similar to before, US is a qualitative variable, and USYes is the dummy variable. The estimated coefficient value for USYes is 1.200573. This value is the difference in Sales between stores that are in the US and stores that are not in the US. That value also means that stores in the US have sales that are higher by 1.200573 units. We should consider this value because it is statistically significant.
Sales = β0 + β1 * Price + β2 * UrbanYes + β3 * USYes
β0 = intercept
β1 = coefficient for Price
β2 = coefficient for UrbanYes
β3 = coefficient for USYes
I can reject the null hypothesis for Price and USYes.
lm.fit.reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit.reduced)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
summary(lm.fit.full)$r.squared
## [1] 0.2392754
summary(lm.fit.reduced)$r.squared
## [1] 0.2392629
We can see here from those results that both of these models do not fit the data very well, but the full model does fit the data a little better than the reduced model.
confint(lm.fit.reduced )
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(lm.fit.reduced)
As we can see from these plots, there are some outliers present in the model from part (e).
Σ(Xi^2) = Σ(Yi^2)
The coefficient estimates for the regression of Y onto X and X onto Y will be the same when the sum of the squared X values is equal to the sum of the squared Y values.
set.seed(123)
n <- 100
X <- rnorm(n, mean = 5, sd = 2)
Y <- 2 * X + rnorm(n, mean = 0, sd = 1)
beta_YX <- sum(X * Y) / sum(X^2)
print(paste("Beta (Y onto X):", beta_YX))
## [1] "Beta (Y onto X): 1.97864171794108"
beta_XY <- sum(X * Y) / sum(Y^2)
print(paste("Beta (X onto Y):", beta_XY))
## [1] "Beta (X onto Y): 0.501472437357611"
print(paste("Are betas different:", beta_YX != beta_XY))
## [1] "Are betas different: TRUE"
set.seed(456)
n <- 100
Z <- rnorm(n, mean = 0, sd = 1)
X_same <- Z * sqrt(runif(n, 0.5, 2))
Y_same <- Z * sqrt(runif(n, 0.5, 2))
X_same <- X_same * sqrt(sum(Y_same^2) / sum(X_same^2))
beta_YX_same <- sum(X_same * Y_same) / sum(X_same^2)
print(paste("Beta (Y_same onto X_same):", beta_YX_same))
## [1] "Beta (Y_same onto X_same): 0.963116954969162"
beta_XY_same <- sum(X_same * Y_same) / sum(Y_same^2)
print(paste("Beta (X_same onto Y_same):", beta_XY_same))
## [1] "Beta (X_same onto Y_same): 0.963116954969162"
print(paste("Are betas the same:", abs(beta_YX_same - beta_XY_same) < 1e-10))
## [1] "Are betas the same: TRUE"