This is the R portion of your mid-term exam. You will analyze the
Auto dataset, which contains information about various car models
(similar to mtcar). Follow the instructions carefully and
write your R code in the provided chunks. You will be graded on the
correctness of your code, the quality of your analysis, and your
interpretation of the results.
Total points: 10 Time allowed: 45 minutes
Good luck!
Auto, and display the first few rows. (1 points)# Your code here
library(readr)
auto <- read_csv("auto.csv")
## Rows: 392 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (8): mpg, cylinders, displacement, horsepower, weight, acceleration, yea...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ stringr 1.5.1
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
# Your code here
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
[Your answer here] There is 392 observations and 9 variables.
# Your code here
auto_num<- select(auto, mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin)
corr_matrix <- cor(auto_num)
print(corr_matrix)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
library(corrplot)
## corrplot 0.94 loaded
corrplot(corr_matrix, method="circle", type="upper", order="hclust",
tl.col="black", tl.srt=45)
plot() or ggplot()). Add a title and proper
axis labels. You don’t need to interpret the result here but you should
know how. (1 points)# Your code here
ggplot(auto_num, aes(x = mpg, y = weight)) +
geom_point() +
labs(title = "Mpg vs Weight", x = "Miles per gallon", y = "Weight")
boxplot() or ggplot()). You don’t need to
interpret the result here but you should know how. (1 points)# Your code here
ggplot(auto_num, aes(x=factor(origin), y=mpg))+
geom_boxplot(aes(fill=factor(origin)))
# Your code here
auto_num_lm <- lm(mpg ~ weight + horsepower + year , data = auto_num)
summary(auto_num_lm)
##
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = auto_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7911 -2.3220 -0.1753 2.0595 14.3527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.372e+01 4.182e+00 -3.281 0.00113 **
## weight -6.448e-03 4.089e-04 -15.768 < 2e-16 ***
## horsepower -5.000e-03 9.439e-03 -0.530 0.59663
## year 7.487e-01 5.212e-02 14.365 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.8068
## F-statistic: 545.4 on 3 and 388 DF, p-value: < 2.2e-16
weight. What
do they tell us about the relationship between the predictors and ‘mpg’?
(1 points)[Your interpretation here] The intercept and coefficient
weight are both significant. As for the coefficients
weight and horsepower negatively affect
mpg meaning that on average when weight goes
up or horsepower goes up mpg will go down
based on the value of the coefficients.
# Your code here
par(mfrow = c(2, 2))
plot(auto_num_lm)
[Your comments here] For residuals vs fitted there is some pattern but I think the linearity is good. Q-Q residuals are all mostly in a straight line with some deviation towards the end which is not too concerning I’d say normality is present. Scale location has constant variance confirming homoscedasticity. As for residuals vs leverage there is some outliers and most plots are not towards the center which might indicate some problems in regards to outliers.
# Your code here
auto_sum_lm = summary(auto_num_lm)
auto_sum_lm$r.squared
## [1] 0.8083189
auto_sum_lm$adj.r.squared
## [1] 0.8068368
[Your interpretation here] These both account for 80% of the variance within the model making this model a decent model.
weight and
horsepower and report whether your model improved based on
adjusted R-squared. (1 point)# Your code here
auto_num_lm2 <- lm(mpg ~ weight + horsepower + year + weight*horsepower , data = auto_num)
summary(auto_num_lm2)
##
## Call:
## lm(formula = mpg ~ weight + horsepower + year + weight * horsepower,
## data = auto_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9146 -1.8987 -0.0386 1.5536 12.6333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.577e+00 3.911e+00 0.915 0.361
## weight -1.185e-02 5.868e-04 -20.198 <2e-16 ***
## horsepower -2.236e-01 2.063e-02 -10.837 <2e-16 ***
## year 7.749e-01 4.508e-02 17.190 <2e-16 ***
## weight:horsepower 5.790e-05 5.020e-06 11.534 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.963 on 387 degrees of freedom
## Multiple R-squared: 0.8574, Adjusted R-squared: 0.8559
## F-statistic: 581.5 on 4 and 387 DF, p-value: < 2.2e-16
[Your observation here] The model did improve from .81 to .86 making this a better more accurate model. —
End of Exam. Please submit this RMD file along with a knitted HTML report.