This is the R portion of your mid-term exam. You will analyze the
Auto dataset, which contains information about various car models
(similar to mtcar). Follow the instructions carefully and
write your R code in the provided chunks. You will be graded on the
correctness of your code, the quality of your analysis, and your
interpretation of the results.
Total points: 10 Time allowed: 45 minutes
Good luck!
Auto, and display the first few rows. (1 points)library(ggplot2)
Auto <- read.csv("Auto.csv")
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
# Show the structure of the Auto dataset (variables and data types)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
The dataset includes 9 variables with 392 observations.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.1
## ✔ purrr 1.2.1 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
corr_matrix <- Auto %>%
select(where(is.numeric)) %>%
cor()
print(round(corr_matrix, 2))
## mpg cylinders displacement horsepower weight acceleration year
## mpg 1.00 -0.78 -0.81 -0.78 -0.83 0.42 0.58
## cylinders -0.78 1.00 0.95 0.84 0.90 -0.50 -0.35
## displacement -0.81 0.95 1.00 0.90 0.93 -0.54 -0.37
## horsepower -0.78 0.84 0.90 1.00 0.86 -0.69 -0.42
## weight -0.83 0.90 0.93 0.86 1.00 -0.42 -0.31
## acceleration 0.42 -0.50 -0.54 -0.69 -0.42 1.00 0.29
## year 0.58 -0.35 -0.37 -0.42 -0.31 0.29 1.00
## origin 0.57 -0.57 -0.61 -0.46 -0.59 0.21 0.18
## origin
## mpg 0.57
## cylinders -0.57
## displacement -0.61
## horsepower -0.46
## weight -0.59
## acceleration 0.21
## year 0.18
## origin 1.00
plot() or ggplot()). Add a title and proper
axis labels. You don’t need to interpret the result here but you should
know how. (1 points)library(ggplot2)
ggplot(Auto, aes(x = weight, y = mpg)) +
geom_point() +
labs(title = "MPG vs Weight",
x = "Weight",
y = "Miles per Gallon")
#Simple linear model
Auto_lm_simple <- lm(mpg ~ weight + horsepower + year, data = Auto)
summary(Auto_lm_simple)
##
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7911 -2.3220 -0.1753 2.0595 14.3527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.372e+01 4.182e+00 -3.281 0.00113 **
## weight -6.448e-03 4.089e-04 -15.768 < 2e-16 ***
## horsepower -5.000e-03 9.439e-03 -0.530 0.59663
## year 7.487e-01 5.212e-02 14.365 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.8068
## F-statistic: 545.4 on 3 and 388 DF, p-value: < 2.2e-16
weight. What
do they tell us about the relationship between the predictors and ‘mpg’?
(1 points)The intercept represents mpg when weight, horsepower, and year are zero. The weight coefficient of -.0064 indicates that as weight increases, mpg decreases. Cars with higher weight have lower gas mileage.
plot(Auto_lm_simple)
Residuals vs Fitted Values The points are generally scattered around zero, so this shows the assumptions are mostly met; however, the plot shows the slight curve in the red line, implying some slight nonlinearity.
#R-squared
r2 <- summary(Auto_lm_simple)$r.squared
r2
## [1] 0.8083189
#Adjusted R-squared
adj_r2 <- summary(Auto_lm_simple)$adj.r.squared
adj_r2
## [1] 0.8068368
The r2 is .8083, indicating the model can explain about 80.83% of the variance. Adjusted r2 is .8068, which shows the model still explains around 80.68% of the variation in mpg after adjusting.
weight and horsepower added to the ‘weight’,
‘horsepower’, and ‘year’ as predictors (X) and report the adjusted
R-squared. (1 point)Auto_lm_int <- lm(mpg ~ weight * horsepower + year, data = Auto)
summary(Auto_lm_int)
##
## Call:
## lm(formula = mpg ~ weight * horsepower + year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9146 -1.8987 -0.0386 1.5536 12.6333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.577e+00 3.911e+00 0.915 0.361
## weight -1.185e-02 5.868e-04 -20.198 <2e-16 ***
## horsepower -2.236e-01 2.063e-02 -10.837 <2e-16 ***
## year 7.749e-01 4.508e-02 17.190 <2e-16 ***
## weight:horsepower 5.790e-05 5.020e-06 11.534 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.963 on 387 degrees of freedom
## Multiple R-squared: 0.8574, Adjusted R-squared: 0.8559
## F-statistic: 581.5 on 4 and 387 DF, p-value: < 2.2e-16
The adjusted R-squared for the interaction model is .8559.
#OG model
#R-squared
r2 <- summary(Auto_lm_simple)$r.squared
r2
## [1] 0.8083189
#Adjusted R-squared
adj_r2 <- summary(Auto_lm_simple)$adj.r.squared
adj_r2
## [1] 0.8068368
r2 <- summary(Auto_lm_int)$r.squared
r2
## [1] 0.8573517
#Adjusted R-squared
adj_r2 <- summary(Auto_lm_int)$adj.r.squared
adj_r2
## [1] 0.8558773
Yes! The adding the interaction term improved the model. The r2 and adj r2 increased to about .8574 and .8559, respectively. The new model explains about 85% of the variance, rather than only about 80% with the old model.
End of Exam. Please submit this RMD file along with a knitted HTML report. Failed to submit HTML will lead to 1pt deduction.