Prior to starting, I always load up Tidyverse to make sure I’m covering all my bases.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ISLR2)
#While similar, the KNN classifier tries to predict which class a variable belongs to, and the KNN regression tries to predict a value using the average of all training responses.
#Auto <- read.csv("Auto.csv")
attach(Auto)
## The following object is masked from package:ggplot2:
##
## mpg
plot(Auto)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Auto$horsepower <- as.numeric(Auto$horsepower)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
mlr <- lm(mpg~.-name, data=Auto)
summary(mlr)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
#I. There is an overall relationship of predictors and response in this model, with an f-value of 252.4, and p-value of 2.2e-16
#II. There is a relationship between the response (mpg) and the following predictors, as shown by a p-value of <.05: displacement, weight, year, origin
#III. Year coefficient shows that each year will increase MPG by 0.75 (I.e., MPG is increasing as time goes on)
par(mfrow = c(2, 2))
plot(mlr)
#These plots show outliers and observations with high leverage, especially at the higher end (right-skewed). The plots show high variance, with lower variance in the middle (close to a U shape), which may reveal that the data is not liner.
interaction effects. Do any interactions appear to be statistically significant?
mlr2 <- lm(mpg~acceleration*horsepower + weight:acceleration, data=Auto)
summary(mlr2)
##
## Call:
## lm(formula = mpg ~ acceleration * horsepower + weight:acceleration,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1722 -2.4537 -0.3763 2.1673 15.3486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.759e+01 3.268e+00 8.442 6.34e-16 ***
## acceleration 1.245e+00 2.050e-01 6.073 3.00e-09 ***
## horsepower -1.036e-02 2.575e-02 -0.403 0.6875
## acceleration:horsepower -5.676e-03 2.257e-03 -2.515 0.0123 *
## acceleration:weight -3.004e-04 3.823e-05 -7.858 3.90e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.116 on 387 degrees of freedom
## Multiple R-squared: 0.7248, Adjusted R-squared: 0.7219
## F-statistic: 254.8 on 4 and 387 DF, p-value: < 2.2e-16
#Choosing interactions of acceleration:horsepower, and acceleration:weight, it appears that both of them are statistically significant at the .05 level and lower.
log(X), √X, X2. Comment on your findings
par(mfrow = c(2, 2))
plot(Auto$acceleration, Auto$MPG)
plot(log(Auto$acceleration), Auto$MPG)
plot(sqrt(Auto$acceleration), Auto$MPG)
plot((Auto$acceleration)^2, Auto$MPG)
plot(Auto$horsepower, Auto$MPG)
plot(log(Auto$horsepower), Auto$MPG)
plot(sqrt(Auto$horsepower), Auto$MPG)
plot((Auto$horsepower)^2, Auto$MPG)
detach(Auto)
#When performing Log, Sqrt, and Sq for Acceleration, the plots don't show too much difference, as the data doesn't become any more linear or non-linear. For horsepower, the logged data may reveal a more linear graph than the non-transformed data. And don't forget to detach when you are finished with these questions!
#Carseats <- read_csv("Carseats.csv")
view(Carseats)
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
Carseats$US <- as.factor(Carseats$US)
Carseats$Urban <- as.factor(Carseats$Urban)
Carseats$ShelveLoc <- as.factor(Carseats$ShelveLoc)
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
attach(Carseats)
carSeatMRM <- lm(Sales~Price+Urban+US, data=Carseats)
summary(carSeatMRM)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
#Price: as price increases, sales decrease. The coefficients show that for every dollar a carseat increases, sales decrease by 0.05.
#Urban: This variable is not statistically significant which means sales do not change whether they are in an urban area or not.
#US: If a store is located in the US, the data is showing a 1.2 increase of sales compared to non-US stores.
#Sales = β0 + β1 x Price + β2 x Urbanyes + β3 USyes
#The null hypothesis of Price and US are rejected as their p-values are less than 0.05 (statistically significant)
evidence of association with the outcome.
carSeatMRM2 <- lm(Sales~Price+US, data=Carseats)
summary(carSeatMRM2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
#Sales = β0 + β1 x Price + β2 USyes
#The r-squared value for both models remains the same (.2393) which shows that Urban really did not do much to the model. The F-value increased in the second model, which shows that it is more robust
confint(carSeatMRM2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(carSeatMRM2)
#Yes the plots show some evidence of outliers and leverage for the model using only Price and US. It looks like there are two outliers above the data, and one outlier beneath the data. And there looks to be at least one leverage point.
detach(Carseats)
circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
#Since the numerators would be ∑jx2j and ∑jy2j, if these sums equaled each other, then the coefficients would also be equal
the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X
x <- 1:100
y <- 10:109
regressionX <- lm(x~y+0)
regressionY <- lm(y~x+0)
coef(regressionX)
## y
## 0.877558
coef(regressionY)
## x
## 1.134328
the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X
x2 <- 1:100
y2 <- x2
regressionX2 <- lm(x2~y2+0)
regressionY2 <- lm(y2~x2+0)
coef(regressionX2)
## y2
## 1
coef(regressionY2)
## x2
## 1