Motor Trend, an automobile industry magazine, is interested in exploring the relationship between a set of car variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
The “mtcars” dataset, that comes with R, will be used. The dataset contains 11 variables and 32 observations.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Some of the variables include:
A. Check which transmission is better.
Let’s change the class of the am variable
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <-c("Automatic", "Manual")
From the results in appendices 2 and 3, it is safe to assume that
manual transmission is better than automatic transmission as it has
higher mpg on average (if other variables are ignored).
Therefore, would like to test the hypothesis:
i. Null hypothesis: There is no difference between manual and automatic
transmission ii. Alternative hypothesis: Manual transmission is better
than automatic.
ttest<- t.test(mpg~am, data=mtcars)
ttest
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group Automatic and group Manual is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
pvalue<- ttest$p.value
##to get R-squared value
fit2 <-lm(mpg~am, data=mtcars)
r_squared<- summary(fit2)$r.squared
As the p-value of 0.0013736 is less than 0.05, we reject the null hypothesis that there is no significant difference between the average mpg measured in cars with either manual or automatic transmission. We can also assume with that the manual transmission is better for MPG, considering other variables are not considered. The only downside to this is low R-square values, which is 0.3597989.
Result of Appendix 1 showed that there are other variables with stronger correlations with mpg (either -/+). These variables might have contributed to the significant difference obtained from the t-test above. Will need to explore this further in the second section.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mpg_ave<- mtcars %>%
group_by(am) %>%
summarise(average_mpg = mean(mpg, na.rm=TRUE),sd_mpg = sd(mpg, na.rm = TRUE))
mpg_ave
## # A tibble: 2 x 3
## am average_mpg sd_mpg
## <fct> <dbl> <dbl>
## 1 Automatic 17.1 3.83
## 2 Manual 24.4 6.17
To check if other variables contribute to the statistically significant difference between the mpg measured in manual and automatic transmission types.
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
Perform multiple linear regression First, perform a multiple linear regression
fit <- lm(mpg~., data = mtcars)
best_fit<- step(fit, direction = "both", trace = FALSE)
summary(best_fit)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
best_rsquared <- summary(best_fit)$r.squared
Based on the stepwise algorithm, the best prediction model of mpg should contain the following variables:
+. cyl: number of cylinders
+. hp: gross horsepower
+. wt: weight (every 1000 lbs)
+. am: transmission (especially manual).
Without considering other variables, the manual transmission provides higher mpg rate compared to automatic transmission. Despite the significant p-value, it might be stretch to conclude that only manual transmission result in better/higher mpg. The multiple linear regression supported this notion; a decrease in gross horsepower, weight, having either 6 or 8 cylinders and the presence of manual transmission would result in higher mpg in car.
Appendix 1:First, lets check the relationship between MPG and other variables.
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
data("mtcars")
mpgcor<- cor(mtcars$mpg,mtcars[,-1])
corrplot(mpgcor, method = "number")
Based on the correlation plot in appendix 1, it is safe to assume that the variables cyl, disp, wt, and hp have the highest correlation with mpg, although negative.
Appendix 2: visualizing the distribution of the two transmission types
boxplot(mpg~am, data = mtcars,
xlab = "Transmission Type (am)",
ylab = "Miles/US gallon (mpg)",
main = "MPG by Transmission Type")
Boxplot of the distribution of the mpg of all the observes cars, based on their type of transmission.
Appendix 3: Residual check and plots
plot(best_fit)