Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN Classifier - Non-parametric approach that is used to solve classification problem or attempts to predict qualitative responses. KNN classifier is solved by identifying the neighborhood of x0 and then estimating the conditional probability P(Y=j|X=x0) for class j as the fraction of points in the neighborhood whose response values equal j.
KNN Regression - Non-parametric method that is used to solve regression problems or attempts to predict quantitative responses. KNN regression is solved by identifying the K training observations that are closest to x0 (represented by N0), and then estimating f(x0) as the average of all the training responses in the “neighborhood.”
This question involves the use of multiple linear regression on the Auto data set.
library(ISLR)
Auto = read.csv("C:/Users/selen/OneDrive/Documents/Summer 2020 - MSDA/DA 6543 Algorithms II/Data/Auto.csv", header=T, na.strings="?")
Auto = na.omit(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int 33 127 331 337 355
## ..- attr(*, "names")= chr "33" "127" "331" "337" ...
pairs(Auto)
Upon reviewing the correlation matrix, there appears to be at least a moderate assosication between almost all of the variables. Notably, we can see there is a persistently strong association between mpg, cylinders, displacement, horsepower, and weight.
auto.cor = cor(Auto[,names(Auto) !="name"])
auto.cor
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
attach(Auto)
auto.lm = lm(mpg~.-name, data=Auto)
summary(auto.lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Looking at the F-Statistic’s p-value of <2.2e-16 (close to 0 value), there is strong evidence that at least one of the predictiors is associated with our response variable, Miles per Gallon (mpg). With such a small F-Statistic p-value of <2.2e-16, we have sufficient evidence to reject the \(H_0\) (\(\beta_1\) = \(\beta_2\) = \(\beta_p\) = 0) and conclude that at least 1 predictor is related to the response. Furthermore, we can see that as \(H_a\) is true the F-Statistic of 252.4 is greater than 1.
Looking at the T-Test p-values, it appears that displacement, weight, year and origin have a statistically significant relationship to the response variable, Miles per Gallon (mpg), as the p-values are close to 0.
The coefficient of the year variable suggests that the average effect of an increase of 1 year results in cars becoming more fuel efficient by approximately 0.750773 mpg holding all other predictors fixed.
Plot 1: Residuals vs Fitted
The residuals exhibit a slight U-shape. The U-shape of residuals suggest non-linearity in the data.
There is evidence that suggest heteroscedasticity (non-constant variances in the errors) as a funnel shape (variance of the error terms increase with the value of the response) is apparent.
R has detected observations 323, 326, and 327 as unusally large outliers in the data as seen in the top, right-hand corner of the plot.
Plot 2: Normal Q-Q
The top-right hand corner of the Q-Q Plot suggests that the data may be slightly right-skewed.
Similar to the findings of the Residuals vs Fitted plot, the Q-Q Plot identifies observations 323, 326, and 327 as unusally large outliers.
Plot 3: Scale-Location
Plot 4: Residuals vs Leverage
par(mfrow = c(1,1))
plot(auto.lm)
For this problem, I ran 4 linear regression models with 2 of the models having interaction effects included. The goal of running the 4 linear regression (2 of which do not have interaction effects and 2 with interaction effects) is to test whether the additive assumption is realistic in this case. To measure the effectiveness of the interaction effects, I will refer to the change of \(R^2\) as an indicator.
The first model that I included is the linear regression model with the top three highest correlated pairs (displacement & cylinders, displacement & weight, and displacement & horsepower) without interaction effects.
The second model included the top three highest correlated pairs (displacement & cylinders, displacement & weight, and displacement & horsepower), but with interaction effects. For this model, only the interaction term for displacement:horsepower was deemed statistically significant while displacement:cylinders and displacement:weight were not.
The third model include the top two highest correlated pairs (displacement & cylinders and displacement & weight) without interaction effects.
The third model include the top two highest correlated pairs (displacement & cylinders and displacement & weight) with interaction effects. For this model, only the interaction term for displacement:weight is statistically significant.
Based on the reviews of the \(R^2\) values for the first and second set of linear regression models, it appears that the introduction of interaction terms results in an increase in \(R^2\) which suggests that it may be best to relax the addictive assumption of linear regression by including interaction terms.
auto.lm2 = lm(mpg~displacement + weight + cylinders + horsepower, data = Auto)
auto.lm2i = lm(mpg~displacement + weight + cylinders + horsepower + displacement:cylinders + displacement:weight + displacement:horsepower, data = Auto)
summary(auto.lm2)
##
## Call:
## lm(formula = mpg ~ displacement + weight + cylinders + horsepower,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5248 -2.7964 -0.3568 2.2577 16.3221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.7567705 1.5200437 30.102 < 2e-16 ***
## displacement 0.0001389 0.0090099 0.015 0.987709
## weight -0.0052772 0.0007166 -7.364 1.08e-12 ***
## cylinders -0.3932854 0.4095522 -0.960 0.337513
## horsepower -0.0428125 0.0128699 -3.327 0.000963 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.242 on 387 degrees of freedom
## Multiple R-squared: 0.7077, Adjusted R-squared: 0.7046
## F-statistic: 234.2 on 4 and 387 DF, p-value: < 2.2e-16
summary(auto.lm2i)
##
## Call:
## lm(formula = mpg ~ displacement + weight + cylinders + horsepower +
## displacement:cylinders + displacement:weight + displacement:horsepower,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1308 -2.1597 -0.3652 1.9001 16.9864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.584e+01 2.569e+00 21.733 < 2e-16 ***
## displacement -9.524e-02 1.605e-02 -5.935 6.59e-09 ***
## weight -3.803e-03 1.589e-03 -2.394 0.0172 *
## cylinders 3.330e-01 8.190e-01 0.407 0.6845
## horsepower -1.844e-01 2.855e-02 -6.460 3.18e-10 ***
## displacement:cylinders 1.569e-03 3.581e-03 0.438 0.6615
## displacement:weight 4.258e-06 5.555e-06 0.766 0.4439
## displacement:horsepower 4.238e-04 9.786e-05 4.331 1.90e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.865 on 384 degrees of freedom
## Multiple R-squared: 0.7591, Adjusted R-squared: 0.7547
## F-statistic: 172.9 on 7 and 384 DF, p-value: < 2.2e-16
auto.lm3 = lm(mpg~displacement + weight + cylinders, data = Auto)
auto.lm3i = lm(mpg~displacement + weight + cylinders + displacement*cylinders + displacement*weight, data=Auto)
summary(auto.lm3)
##
## Call:
## lm(formula = mpg ~ displacement + weight + cylinders, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5568 -2.8703 -0.3649 2.2708 16.4338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.3709616 1.4806851 29.967 < 2e-16 ***
## displacement -0.0126740 0.0082501 -1.536 0.125
## weight -0.0057079 0.0007139 -7.995 1.5e-14 ***
## cylinders -0.2677968 0.4130673 -0.648 0.517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.297 on 388 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.697
## F-statistic: 300.8 on 3 and 388 DF, p-value: < 2.2e-16
summary(auto.lm3i)
##
## Call:
## lm(formula = mpg ~ displacement + weight + cylinders + displacement *
## cylinders + displacement * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2934 -2.5184 -0.3476 1.8399 17.7723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.262e+01 2.237e+00 23.519 < 2e-16 ***
## displacement -7.351e-02 1.669e-02 -4.403 1.38e-05 ***
## weight -9.888e-03 1.329e-03 -7.438 6.69e-13 ***
## cylinders 7.606e-01 7.669e-01 0.992 0.322
## displacement:cylinders -2.986e-03 3.426e-03 -0.872 0.384
## displacement:weight 2.128e-05 5.002e-06 4.254 2.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7237
## F-statistic: 205.8 on 5 and 386 DF, p-value: < 2.2e-16
For this problem, I will try to transform the least significant variable with \(log(x)\),\(√x\), and \(x^2\) transformations.
As you may recall, the base linear regression model (auto.lm) resulted in acceleration with a p-value of 0.41548. As acceleration had the highest p-value, I began my transformation with this variable. Upon using log and square-root transformations, I see that the p-value worsens are it increases to 0.9368 and 0.70343 respectively. The squared transformation significantly improves the significance of acceleration as the p-value drops close to 0.
Speaking for the \(x^2\) transformation of acceleration we can see that the \(R^2\) slightly increases.
summary(auto.lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
auto.lm4 = lm(mpg~cylinders + displacement + horsepower + weight + log(acceleration) + year + origin, data = Auto)
summary(auto.lm4)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## log(acceleration) + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7774 -2.1790 -0.1636 1.8434 13.1268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.174273 6.443614 -2.355 0.0190 *
## cylinders -0.507167 0.323203 -1.569 0.1174
## displacement 0.019166 0.007595 2.524 0.0120 *
## horsepower -0.024622 0.014198 -1.734 0.0837 .
## weight -0.006190 0.000676 -9.157 < 2e-16 ***
## log(acceleration) -0.129499 1.631402 -0.079 0.9368
## year 0.747224 0.050993 14.654 < 2e-16 ***
## origin 1.428083 0.278370 5.130 4.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.331 on 384 degrees of freedom
## Multiple R-squared: 0.8212, Adjusted R-squared: 0.8179
## F-statistic: 251.9 on 7 and 384 DF, p-value: < 2.2e-16
auto.lm5 = lm(mpg~cylinders + displacement + horsepower + weight + sqrt(acceleration) + year + origin, data = Auto)
summary(auto.lm5)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## sqrt(acceleration) + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6792 -2.1496 -0.1413 1.8603 13.0920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.696e+01 5.556e+00 -3.052 0.00243 **
## cylinders -5.022e-01 3.233e-01 -1.553 0.12117
## displacement 1.966e-02 7.550e-03 2.604 0.00958 **
## horsepower -2.052e-02 1.401e-02 -1.464 0.14395
## weight -6.347e-03 6.639e-04 -9.560 < 2e-16 ***
## sqrt(acceleration) 3.086e-01 8.101e-01 0.381 0.70343
## year 7.490e-01 5.100e-02 14.687 < 2e-16 ***
## origin 1.428e+00 2.783e-01 5.131 4.58e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.33 on 384 degrees of freedom
## Multiple R-squared: 0.8212, Adjusted R-squared: 0.818
## F-statistic: 252 on 7 and 384 DF, p-value: < 2.2e-16
auto.lm6 = lm(mpg~cylinders + displacement + horsepower + weight + acceleration + I(acceleration^2) + year + origin, data = Auto)
summary(auto.lm6)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + I(acceleration^2) + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9680 -1.9266 -0.0124 1.9153 13.2722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1088174 6.4930423 0.787 0.4319
## cylinders -0.3181584 0.3165577 -1.005 0.3155
## displacement 0.0090446 0.0076528 1.182 0.2380
## horsepower -0.0346411 0.0139094 -2.490 0.0132 *
## weight -0.0054113 0.0006719 -8.053 1.03e-14 ***
## acceleration -2.6374431 0.5758788 -4.580 6.30e-06 ***
## I(acceleration^2) 0.0790472 0.0165131 4.787 2.42e-06 ***
## year 0.7535781 0.0495815 15.199 < 2e-16 ***
## origin 1.3265929 0.2713219 4.889 1.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.237 on 383 degrees of freedom
## Multiple R-squared: 0.8316, Adjusted R-squared: 0.828
## F-statistic: 236.3 on 8 and 383 DF, p-value: < 2.2e-16
However, even with the transformation of acceleration we are still breaching some of the linear regression assumptions -
The residuals vs fitted plot indicates heteroskedasticity (non-constant variance over mean) in the model.
The Q-Q plot indicates somewhat abnormality of the residuals as there is skewness visible on the right side of the plot.
Additionally, the same issue of unusually large outliers and high leverage points are persistent as seen in the residual and leverage plots.
To conclude, a better transformation to the data may be required to improve the soundness and fit of our model.
plot(auto.lm6)
detach(Auto)
This question should be answered using the Carseats data set.
library(ISLR)
fix(Carseats)
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
attach(Carseats)
car.lm = lm(Sales~Price+Urban+US, data = Carseats)
summary(car.lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price: The \(\beta\) value for Price of -0.0545 can be interpreted as average effect on Carseat Sales with a one unit increase in Price, holding all other variables fixed.
UrbanYes: The \(\beta_0\) of 13.043 can be interpreted as the overall Carseats Sales among non-Urban areas. \(\beta_0 + \beta_{UrbanYes}\) is the averge Sales for Carseats in Urban areas, and \(\beta_{UrbanYes}\) is the average difference in Carseats Sales between Urban and non-Urban areas. Further considering the high p-value of UrbanYes suggests that there is no statistical evidence of a difference in Carseat Sales between Urban and non-Urban areas. This interpretation does imply that we are holding all other variables fixed.
USYes: The \(\beta_0\) of 13.043 can be interpreted as the overall Carseats Sales among non-US stores. \(\beta_0 + \beta_{USYes}\) is the averge Sales for Carseats in stores located in the US, and \(\beta_{USYes}\) is the average difference in Carseats Sales between stores located in the US or not. This interpretation does imply that we are holding all other variables fixed.
\(\hat{Sales}\) = 13.043469 - (0.054459 * Price) - (0.021916 * UrbanYes) + (1.200573 * USYes)
We can reject the null hypothesis \(H_0\): \(\beta_j\) = 0 for variables Price & USYes as the p-values are close to zero. With the p-value being close to 0 indicates clear evidence of relationship between Price & USYes with Sales.
car.lm2 = lm(Sales~Price+US, data = Carseats)
summary(car.lm2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The RSE measure for the model with Sales~Price+Urban+US is 2.472. In other words, the actual Sales in carseats in each store will deviate from the true regression line by approximately $2,472, on average. The \(R^{2}\) for this model is 0.2393 meaning that, on average, 23.93% of the variance in Sales can be explained by Price, Urban and US.
For Sales~Price+US, we see a slightly lower RSE value of 2.469. The RSE value explains that the actual Sales in carseats in each store will deviate from the true regression line by approximately $2,469, on average. Like the model that includes the Urban variable, the \(R^{2}\) is 0.2393. The \(R^{2}\) value of 0.2393 means that, 23.93% of variance in Sales can be explained by Price and US variables.
Neither of the models fit that data well considering the small \(R^{2}\) value and high RSE. It may be best to consider a non-linear regression model, interaction terms, or even some type of transformations of the predictor variables.
confint(car.lm2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
The Residuals vs Fitted, Q-Q, and Scale-Location Plots show evidence that observations 51, 69 and 377 may be potential outliners. Observation 368 is a high leverage point. However, the Residuals vs Leverage plot does indicate observations 26 and 50 as notable points, but I believe this is due to these observations having high standardized residuals.
plot(car.lm2)
detach(Carseats)
This problem involves simple linear regression without an intercept.
The coefficient estimate of the regression of Y onto X is - \[\hat{\beta} = \frac{\sum_{i=1}^{n}x_{i}y_i}{\sum_{j=1}^{n}x^{2}_{j}}\]
The coefficient estimate of the regression of X onto Y is - \[\hat{\beta'} = \frac{\sum_{i=1}^{n}x_{i}y_i}{\sum_{j=1}^{n}y^{2}_{j}}\]
The coefficient estimate for the regression of X onto Y will be the same as the coefficient estimate for the regression Y onto X when the sum of squares of the observed y-values are equal to the sum of squares of the observed x-values. We are abole to find this true as the \(\beta\) estimates are extremely similar with the exception of the denominator.
\[{\sum_{j=1}^{n}x^{2}_{j}} = {\sum_{j=1}^{n}y^{2}_{j}}\]
As seen below, the sum of squares for X & Y are different. In this case, the regression of X onto Y and Y onto X are expected to have different coefficient estimates.
set.seed(12)
x = rnorm(100)
y = 2*x
sum(x^2)
## [1] 74.18993
sum(y^2)
## [1] 296.7597
lm.fit = lm(y~x+1)
lm.fit2 = lm(x~y+0)
summary(lm.fit)
## Warning in summary.lm(lm.fit): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y ~ x + 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.404e-15 -1.100e-17 5.550e-17 1.754e-16 5.476e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.441e-17 7.756e-17 -5.730e-01 0.568
## x 2.000e+00 9.005e-17 2.221e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.751e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.933e+32 on 1 and 98 DF, p-value: < 2.2e-16
summary(lm.fit2)
## Warning in summary.lm(lm.fit2): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.474e-16 -3.990e-17 -3.100e-18 3.530e-17 4.977e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.000e-01 2.964e-17 1.687e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.106e-16 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.845e+32 on 1 and 99 DF, p-value: < 2.2e-16
As seen below, the sum of squares for both X & Y are the same; thus, we can expect for the coefficient estimates for the regresion of X onto Y to be the same as the coefficient estimate for the regression of Y onto X.
set.seed(12)
x = rnorm(100)
y = - sample(x, 100)
sum(x^2)
## [1] 74.18993
sum(y^2)
## [1] 74.18993
lm.fit.x = lm(y~x+0)
lm.fit.y = lm(x~y+0)
summary(lm.fit.x)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1608 -0.4869 0.1308 0.5857 2.1343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.04395 0.10041 0.438 0.663
##
## Residual standard error: 0.8648 on 99 degrees of freedom
## Multiple R-squared: 0.001931, Adjusted R-squared: -0.00815
## F-statistic: 0.1916 on 1 and 99 DF, p-value: 0.6626
summary(lm.fit.y)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2370 -0.5915 -0.1117 0.4835 2.1114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.04395 0.10041 0.438 0.663
##
## Residual standard error: 0.8648 on 99 degrees of freedom
## Multiple R-squared: 0.001931, Adjusted R-squared: -0.00815
## F-statistic: 0.1916 on 1 and 99 DF, p-value: 0.6626