In this document we are analyzing the data from various cars, the special itnerest is to explore the relationship between the MPG (miles per gallon) and the type of transmission (Manual or Automatic). Two questions are expected to be addressed: 1. Is an automatic or manual transmission better for MPG 2. Quantify the MPG difference between automatic and manual transmissions
The steps taken are:
Process the raw data Explore the data using plots to visualize relationships Model selection and model examination to see which model fit best to our data and help us better answer our questions Conclusions to answer the questions
# Loading the required libraries from the start
library(datasets)
library(ggplot2)
library(GGally)
library(knitr)
For this exercise, the mtcars dataset will be used, this is a well known dataset that comes with the base R. The data is imported, then the non-continuous varaiables are converted to factor variables. Before this, we will get the correlation table between variables.
# Extracting the data
data(mtcars)
# Getting all the correlations
corr_list <- cor(mtcars$mpg, mtcars)
# Converting the non-continuous variables to factor
# Converting am
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
# Converting vs
mtcars$vs <- as.factor(mtcars$vs)
levels(mtcars$vs) <- c("V-shaped", "Straight")
# Converting cyl
mtcars$cyl <- as.factor(mtcars$cyl)
# Converting gear
mtcars$gear <- as.factor(mtcars$gear)
Showing the dimensions and types of data in the dataset.
# Showing info of the data
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "V-shaped","Straight": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Visualizing the first 5 rows of the dataset.
# Visualizing the first rows of data
kable(head(mtcars, 5))
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | V-shaped | Manual | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | V-shaped | Manual | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | Straight | Manual | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | Straight | Automatic | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | V-shaped | Automatic | 3 | 2 |
Plotting the relationship between tha main parameters for this exercise.
# Plotting the relationship between the variables of interest
p1 <- ggplot(mtcars, aes(x=am, y=mpg))
p1 <- p1 + geom_boxplot(aes(fill = am))
p1
Boxplot of the two mains variables
The previous plot shows that a relationship between the transmission type and MPG exists, and seemingly, manual transmission has a higher MPG than its automatic counterpart. However, it is a good practice to look at the correlations of the variables before fitting a model.
# Ordering the correlations and getting the highest-than-am ones
corr_list <- corr_list[,order(-abs(corr_list[1,]))]
corr_list
## mpg wt cyl disp hp drat vs
## 1.0000000 -0.8676594 -0.8521620 -0.8475514 -0.7761684 0.6811719 0.6640389
## am carb gear qsec
## 0.5998324 -0.5509251 0.4802848 0.4186840
ind <- which(names(corr_list) == "am")
corr_var <- names(corr_list)[1:ind]
corr_var
## [1] "mpg" "wt" "cyl" "disp" "hp" "drat" "vs" "am"
All the variables with a higher correlation than am are extracted, and then all their relationships are plotter using ggpairs.
# Figure 2
p2 <- ggpairs(data=mtcars[, corr_var], mapping = ggplot2::aes(color = am))
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Pair plot of all the highly correlated quantities
The first model to fit is the one only considering mpg as outcome and am as the only predictor.
fit1 <- lm(mpg ~ am, mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
This model gives a really small p-values which indicates that it is a good fit to our data. However, this model have a small R-squared value in thus cannot explain much of the variance.
Now a model including all the variable will be taken into account.
fit_all <- lm(mpg ~ ., data=mtcars)
summary(fit_all)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2015 -1.2319 0.1033 1.1953 4.3085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.09262 17.13627 0.881 0.3895
## cyl6 -1.19940 2.38736 -0.502 0.6212
## cyl8 3.05492 4.82987 0.633 0.5346
## disp 0.01257 0.01774 0.708 0.4873
## hp -0.05712 0.03175 -1.799 0.0879 .
## drat 0.73577 1.98461 0.371 0.7149
## wt -3.54512 1.90895 -1.857 0.0789 .
## qsec 0.76801 0.75222 1.021 0.3201
## vsStraight 2.48849 2.54015 0.980 0.3396
## amManual 3.34736 2.28948 1.462 0.1601
## gear4 -0.99922 2.94658 -0.339 0.7382
## gear5 1.06455 3.02730 0.352 0.7290
## carb 0.78703 1.03599 0.760 0.4568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.616 on 19 degrees of freedom
## Multiple R-squared: 0.8845, Adjusted R-squared: 0.8116
## F-statistic: 12.13 on 12 and 19 DF, p-value: 1.764e-06
Even though this model have a high R-squared and explain much of the variance, its p-value are really high, meaning low significance.
To find the best fit to model our data, the step function is going to be used parting from the fit_all model.
fit_best <- step(fit_all, direction="both",trace=FALSE)
summary(fit_best)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
In this model we have a R-squared of 0.8497 meaning a high variance explanation and also the p-values are low enough to mean that this is a significant model. The variable taken into account are wt, qsec and am.
To really determine if this model its good enough, the residuals will be plotted.
# Plotting residuals
par(mfrow=c(2,2))
p3 <- plot(fit_best)
The answer to the first question is that the manual transmission has a higher MPG than the automatic one. This can be seen in all three models fitted as a positive coefficient for the am-manual.
The second question is better answered when looking at the best fitted model (fit_best), with the formula \(mpg \sim wt + qsec + am\). Said model indicates that cars with manual transmission spend 2.936 more miles per gallon than automatic cars. The p-value is less than our tolerance of 0.05, which indicates a good significance.
When looking at the residuals vs fitted value plot we can see that there is no abnormal variance in the sense that it all appears to be randomly distributed, however, it is still a large residual error.
Even though the model was a good fit to our data, and the relationship between transmission and mpg seems feasible for the year the data was taken, I cannot establish and strong conclusion due to the low number of observations and and how scarcely separated the data is -the overlap in the fitting parameters is almost non-existant-.