Goal of the Project

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”

Let’s find out!!!

loading the required libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Loading the data

Here is the dataset “mtcars” and the following are the specifications upon which the cars are categorized and evaluated for their performance.

data(mtcars)
dim(mtcars)
## [1] 32 11
head(mtcars)

There are 32 cars with descriptions on 11 different categorical variables.

mpg - Miles/(US) gallon cyl - Number of cylinders disp- Displacement (cu.in.) hp - Gross horsepower drat- Rear axle ratio wt - Weight (1000 lbs) qsec- 1/4 mile time vs - Engine (0 = V-shaped, 1 = straight) am - Transmission (0 = automatic, 1 = manual) gear- Number of forward gears carb- Number of carburetors

Converting the variables with discrete values to factor variables

mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)

Let us explore the statistics of our response variable mpg (miles per gallon);

summary(mtcars$mpg) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Exploratory Data Analysis

Since the main goal of our analysis is to decide which type of transmission is better for obtaining higher mpg, let us check the relation between mpg and am.

g <- ggplot(data = mtcars, aes(x=am, y=mpg), draw_quantiles=TRUE)
g+theme_bw() + geom_violin(fill = "darkkhaki") + labs(title="Violin Plot",  x="Transmission (0 = automatic, 1 = manual)",y="Miles Per Gallon")

The above exploratory violin plot compares Automatic and Manualtransmission MPG. The graph leads us to believe that there is a significant increase in MPG for vehicles with a manual transmission than automatic. The shape of the distribution (extremely wide in the middle) indicates the mpg for automatic transmission cars are highly concentrated around the median and around the first quantile for manual.

Let us check the relationship between horsepower and mpg differentiating between the type of engine.

ggplot(mtcars, aes(x = hp, y = mpg, color = vs)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The above graph shows that the relationship is slightly linear with one outlier. It conveys that for staright engines, lesser horsepower yields higher mpg between the mean and max valus where as for V-shaped engines, inspite of the higher horse power the values for mpg has fallen below the mean in the fiest quantile range.

Let us check the pairwise correlation between our desired variables

ggpairs(data = mtcars %>% select(mpg,hp,disp,vs,am))  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the correlation coefficient is higher for displacement of the engine though the relationship is inversly related. The response variable mpg has a linear distribution with right skew.

Inference

Hypothesis Test and Confidence Interval

Ho : The null hypothesis states that the there is no difference between the transmission types.

Ha : The alternative hypothesis states that there is a significant difference between the mpg for different transmission types.

T-Test transmission type and MPG

testResults <- t.test(mpg ~ am, data=mtcars)
testResults$p.value
## [1] 0.001373638

With a p-value as low as 0.0013, we can confidently reject the null hypothesis that the difference between transmission types is 0.

testResults$estimate
## mean in group 0 mean in group 1 
##        17.14737        24.39231

The difference estimate between the 2 transmissions is 7.24494 MPG in favor of manual.

Modelling

Since there are more than two explanatory variables, we can go for a multivariate regression model and fit the full model for data. All samples are independent of each other.

model <- lm(mpg ~ ., data = mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## am1          1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

Since none of the coefficients have a p-value less than 0.05 we cannot conclude which variables are more statistically significant.

Backward Elimination method will be more reliable;

red_model <- step(model, direction = "backward", trace = FALSE)
summary(red_model)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## am1          1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

The new model has 4 variables (cylinders, horsepower, weight, transmission). The R-squared value of 0.8659 confirms that this model explains about 87% of the variance in MPG. The p-values also are statistically significantly because they have a p-value less than 0.05. The coefficients conclude that increasing the number of cylinders from 4 to 6 with decrease the MPG by 3.03. Further increasing the cylinders to 8 with decrease the MPG by 2.16. Increasing the horsepower is decreases MPG 3.21 for every 100 horsepower. Weight decreases the MPG by 2.5 for each 1000 lbs increase. A Manual transmission improves the MPG by 1.81.

Residuals & Diagnostics

Residual Plot

The plots conclude:

  1. The randomness of the Residuals vs. Fitted plot supports the assumption of independence
  2. The points of the Normal Q-Q plot following closely to the line conclude that the distribution of residuals is normal
  3. The Scale-Location plot random distribution confirms the constant variance assumption
  4. Since all points are within the 0.05 lines, the Residuals vs. Leverage concludes that there are no outliers
sum((abs(dfbetas(red_model)))>1)
## [1] 0

Conclusion

There is a difference in MPG based on transmission type. A manual transmission will have a higher MPG than automatic transmission. However, it seems that weight, horsepower, & number of cylinders are more statistically significant when determining the MPG.