You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”
Let’s find out!!!
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Here is the dataset “mtcars” and the following are the specifications upon which the cars are categorized and evaluated for their performance.
data(mtcars)
dim(mtcars)
## [1] 32 11
head(mtcars)
There are 32 cars with descriptions on 11 different categorical variables.
mpg - Miles/(US) gallon cyl - Number of cylinders disp- Displacement (cu.in.) hp - Gross horsepower drat- Rear axle ratio wt - Weight (1000 lbs) qsec- 1/4 mile time vs - Engine (0 = V-shaped, 1 = straight) am - Transmission (0 = automatic, 1 = manual) gear- Number of forward gears carb- Number of carburetors
Converting the variables with discrete values to factor variables
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
Let us explore the statistics of our response variable mpg (miles per gallon);
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
g <- ggplot(data = mtcars, aes(x=am, y=mpg), draw_quantiles=TRUE)
g+theme_bw() + geom_violin(fill = "darkkhaki") + labs(title="Violin Plot", x="Transmission (0 = automatic, 1 = manual)",y="Miles Per Gallon")
The above exploratory violin plot compares Automatic and Manualtransmission MPG. The graph leads us to believe that there is a significant increase in MPG for vehicles with a manual transmission than automatic. The shape of the distribution (extremely wide in the middle) indicates the mpg for automatic transmission cars are highly concentrated around the median and around the first quantile for manual.
ggplot(mtcars, aes(x = hp, y = mpg, color = vs)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The above graph shows that the relationship is slightly linear with one outlier. It conveys that for staright engines, lesser horsepower yields higher mpg between the mean and max valus where as for V-shaped engines, inspite of the higher horse power the values for mpg has fallen below the mean in the fiest quantile range.
ggpairs(data = mtcars %>% select(mpg,hp,disp,vs,am))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the correlation coefficient is higher for displacement of the engine though the relationship is inversly related. The response variable mpg has a linear distribution with right skew.
Ho : The null hypothesis states that the there is no difference between the transmission types.
Ha : The alternative hypothesis states that there is a significant difference between the mpg for different transmission types.
T-Test transmission type and MPG
testResults <- t.test(mpg ~ am, data=mtcars)
testResults$p.value
## [1] 0.001373638
With a p-value as low as 0.0013, we can confidently reject the null hypothesis that the difference between transmission types is 0.
testResults$estimate
## mean in group 0 mean in group 1
## 17.14737 24.39231
The difference estimate between the 2 transmissions is 7.24494 MPG in favor of manual.
Since there are more than two explanatory variables, we can go for a multivariate regression model and fit the full model for data. All samples are independent of each other.
model <- lm(mpg ~ ., data = mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## am1 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
Since none of the coefficients have a p-value less than 0.05 we cannot conclude which variables are more statistically significant.
Backward Elimination method will be more reliable;
red_model <- step(model, direction = "backward", trace = FALSE)
summary(red_model)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## am1 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
The new model has 4 variables (cylinders, horsepower, weight, transmission). The R-squared value of 0.8659 confirms that this model explains about 87% of the variance in MPG. The p-values also are statistically significantly because they have a p-value less than 0.05. The coefficients conclude that increasing the number of cylinders from 4 to 6 with decrease the MPG by 3.03. Further increasing the cylinders to 8 with decrease the MPG by 2.16. Increasing the horsepower is decreases MPG 3.21 for every 100 horsepower. Weight decreases the MPG by 2.5 for each 1000 lbs increase. A Manual transmission improves the MPG by 1.81.
Residual Plot
The plots conclude:
sum((abs(dfbetas(red_model)))>1)
## [1] 0
There is a difference in MPG based on transmission type. A manual transmission will have a higher MPG than automatic transmission. However, it seems that weight, horsepower, & number of cylinders are more statistically significant when determining the MPG.