library(datasets)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
Regression analysis using transmission type, weight, and 1/4 mile time as explanatory variables leads to the conclusion that manual cars get on average 2.9 more mpg than automatic cars, when the effects of weight and 1/4 mile time are ignored.
The data of this project are extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973/74 models).
The data consists of 32 observations on 11 variables.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
In order to make the data more readable, the variable am
is going to be recoded with 0 as auto and 1 as manual.
Also, in order to use regression models, the variable am
needs to be converted to a factor
The same changes will be applied to variables vs
,
gear
, and carb
.
# Decoding
mtcars$am <- gsub("0", "auto", mtcars$am)
mtcars$am <- gsub("1", "manual", mtcars$am)
# Factoring
mtcars$am <- factor(mtcars$am)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
dim(mtcars) ## 32 observations and 11 variables
## [1] 32 11
head(mtcars) ## some observations to better understand mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 manual 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 manual 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 manual 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 auto 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 auto 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 auto 3 1
str(mtcars) ## variable types after coersion
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "auto","manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
In order to get a first glance into a possible difference between manual and automatic cars, a boxplot will be showcased.
boxplot1 <- mtcars %>%
ggplot(aes(x = am,
y = mpg,
fill = am)) +
geom_boxplot() +
labs(title = "Relation between manual and automatic cars",
x = "Type of car")
boxplot1
According to the grah, there seems to be a better mileage on average with manual cars than with automatic cars. We can verify this by using statistical inference.
We will use t.test
function in order to verify our
hypothesis that manual cars get better gas mileage than automatic
cars.
t.test(mpg ~ am, mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group auto and group manual is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group auto mean in group manual
## 17.14737 24.39231
The 95% confidence interval shown in the output of t.test does not contain 0, so we can conclude that the difference in mpg between manual and automatic transmissions is in fact significant.
The p-value for this comparisson is 0.001374, which is smaller than 0.05.
However, there may have other variables that can play other role in
determination of MPG, as cyl
, disp
,
hp
, wt
and others. For example, it is common
sense that the heavier the car, more likely he will fuel
consumption.
** Refer to appendix**
The graph shows that MPG has correlations with other variables than
just am
. To obtain a more accurate model, we need
predicting MPG in correlation with other variables than am
.
Lets use some models to evaluate the correlations.
Let’s start by regressing mpg on just am
am_model <- lm(mpg ~ am, mtcars)
summary(am_model)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The R2 value for this model is only 0.3598, which means that fitting mpg on am alone only explains about 36% of the variance in mpg. The p-value of this model are low (0.000285).
Before making any conclusions on the effect of transmission type on fuel efficiency, we look at the variances between several variables in the dataset.
Building a model that regresses mpg on all other variables in the dataset will explain more of the variance.
full_model <- lm(mpg ~ ., mtcars)
summary(full_model)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6533 -1.3325 -0.5166 0.7643 4.7284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.31994 23.88164 1.060 0.3048
## cyl -1.02343 1.48131 -0.691 0.4995
## disp 0.04377 0.03058 1.431 0.1716
## hp -0.04881 0.03189 -1.531 0.1454
## drat 1.82084 2.38101 0.765 0.4556
## wt -4.63540 2.52737 -1.834 0.0853 .
## qsec 0.26967 0.92631 0.291 0.7747
## vs1 1.04908 2.70495 0.388 0.7032
## ammanual 0.96265 3.19138 0.302 0.7668
## gear4 1.75360 3.72534 0.471 0.6442
## gear5 1.87899 3.65935 0.513 0.6146
## carb2 -0.93427 2.30934 -0.405 0.6912
## carb3 3.42169 4.25513 0.804 0.4331
## carb4 -0.99364 3.84683 -0.258 0.7995
## carb6 1.94389 5.76983 0.337 0.7406
## carb8 4.36998 7.75434 0.564 0.5809
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.823 on 16 degrees of freedom
## Multiple R-squared: 0.8867, Adjusted R-squared: 0.7806
## F-statistic: 8.352 on 15 and 16 DF, p-value: 6.044e-05
As expected, the full model has a higher R2 value (0.8867). But the output of summary shows that none of the coefficients are significant at the 0.05 level.
Excluding variables that are correlated with transmission type will
introduce bias in the coefficients. However, including unnecessary
regressors will inflate the model’s variance. We will use the
step
function in R to determine which variables to include
in our final model.
step_model <- step(full_model,
direction = "both",
trace = FALSE)
summary(step_model)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ammanual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The Residual standard error of this model is 2.459 on 28 degrees of freedom.
The adjusted R-squared has increased to 0.8336 and the coefficients are significant.
confint(step_model)['ammanual', ]
## 2.5 % 97.5 %
## 0.04573031 5.82594408
Diagnostic plotting using base graphics shows that the residuals are uncorrelated with the fitted values. The quantile-quantile plot indicates that the distributon of the residiuals is roughly normal.
par(mfrow = c(2,2))
plot(step_model)
According to the residual plots, the following underlying assumptions
can be varified: 1. The Residuals vs. Fitted plot shows no consistent
pattern, supporting the accuracy of the independence assumption. 2. The
Normal Q-Q plot indicates that the residuals are normally distributed
because the points lie closely to the line. 3. The Scale-Location plot
confirms the constant variance assumption, as the points are randomly
distributed. 4. The Residuals vs. Leverage argues that no outliers are
present, as all values fall well within the 0.5 bands.
pairs(mtcars,
panel = panel.smooth,
main = "Pair Graph of Motor Trend Car Road Test")
2. Scatter Plot of MPG vs. Weight by Transmission
ggplot(mtcars,
aes(x=wt,
y=mpg,
group=am,
color=am,
height=3,
width=3)) +
geom_point() +
scale_colour_discrete(labels=c("Automatic",
"Manual")) +
xlab("weight") +
ggtitle("Scatter Plot of MPG vs. Weight by Transmission")