This is the R portion of your mid-term exam. You will analyze the
Auto dataset, which contains information about various car models
(similar to mtcar). Follow the instructions carefully and
write your R code in the provided chunks. You will be graded on the
correctness of your code, the quality of your analysis, and your
interpretation of the results.
Total points: 10 Time allowed: 45 minutes
Good luck!
Auto, and display the first few rows. (1 points)# Your code here
library(ggplot2)
library(MASS)
Auto <- read.csv("Auto.csv")
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
# Your code here
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
nrow(Auto)
## [1] 392
There are 9 variables starting with mpg
ending with name, and there are 392
observations.
# Your code here
Auto2 <- Auto[,-9]
cor(Auto2)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
plot() or ggplot()). Add a title and proper
axis labels. You don’t need to interpret the result here but you should
know how. (1 points)# Your code here
ggplot(Auto, aes(x = mpg, y = weight)) +
geom_point() +
labs(title = "Miles per gallon by Weight", x = "Miles per gallon", y = "")
boxplot() or ggplot()). You don’t need to
interpret the result here but you should know how. (1 points)# Your code here
ggplot(Auto, aes(x=factor(origin), y=mpg)) +
geom_boxplot(aes(fill=factor(origin))) +
ggtitle("Miles per gallon based on origin") +
xlab("Miles per Gallon") +
ylab("Origin")
# Your code here
Auto_lm <- lm(mpg ~ weight + horsepower + year , data = Auto)
summary(Auto_lm)
##
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7911 -2.3220 -0.1753 2.0595 14.3527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.372e+01 4.182e+00 -3.281 0.00113 **
## weight -6.448e-03 4.089e-04 -15.768 < 2e-16 ***
## horsepower -5.000e-03 9.439e-03 -0.530 0.59663
## year 7.487e-01 5.212e-02 14.365 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.8068
## F-statistic: 545.4 on 3 and 388 DF, p-value: < 2.2e-16
weight. What
do they tell us about the relationship between the predictors and ‘mpg’?
(1 points)When all X’s/means are 0 then the intercept(mpg) is the
average value for mpg at -1.372e+01, but with a one unit
increase in weight we see a -6.448e-03 increase to the
intercept, when holding all other variables constant. This coefficient
also matters since the p-value holds significance.
# Your code here
par(mfrow = c(2, 2))
plot(Auto_lm)
Using the Auto linear regression data, we can see based on the Q-Q Residuals plot, the Normality assumption, with a majority of values following along the dotted line, though they do sway off at the beginning and end, but stay stable in the middle, meaning Normality can be assumed.
# Your code here
summary(Auto_lm)
##
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7911 -2.3220 -0.1753 2.0595 14.3527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.372e+01 4.182e+00 -3.281 0.00113 **
## weight -6.448e-03 4.089e-04 -15.768 < 2e-16 ***
## horsepower -5.000e-03 9.439e-03 -0.530 0.59663
## year 7.487e-01 5.212e-02 14.365 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.8068
## F-statistic: 545.4 on 3 and 388 DF, p-value: < 2.2e-16
The R^2 and adjusted R^2 are shown to be 80.8% and 80.7%, the R^2
shows how much variance can be explained by the model, and only using 3
variables, we can explain around eighty percent of variance in
mpg, meaning this model can predict better in the
future.
weight and
horsepower and report whether your model improved based on
adjusted R-squared. (1 point)# Your code here
Auto_lm_int <- lm(mpg ~ weight + horsepower + year + weight*horsepower , data = Auto)
summary(Auto_lm_int)
##
## Call:
## lm(formula = mpg ~ weight + horsepower + year + weight * horsepower,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9146 -1.8987 -0.0386 1.5536 12.6333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.577e+00 3.911e+00 0.915 0.361
## weight -1.185e-02 5.868e-04 -20.198 <2e-16 ***
## horsepower -2.236e-01 2.063e-02 -10.837 <2e-16 ***
## year 7.749e-01 4.508e-02 17.190 <2e-16 ***
## weight:horsepower 5.790e-05 5.020e-06 11.534 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.963 on 387 degrees of freedom
## Multiple R-squared: 0.8574, Adjusted R-squared: 0.8559
## F-statistic: 581.5 on 4 and 387 DF, p-value: < 2.2e-16
The overall model has improved, the R^2 has increased from 80 to 85 percent, explaining a higher variance, and the interaction term is significant, and with that it has made horsepower a significant predictor as well. Though the model has become a bit more complicated, it still explains quite better.
End of Exam. Please submit this RMD file along with a knitted HTML report.