1 Executive Summary

This detailed analysis has been performed to fulfill the requirements of the course project for the course Regression Models offered by the Johns Hopkins University on Data science specialization. In this project, we will analyze the mtcars data set and explore the relationship between a set of variables and miles per gallon (MPG) which will be our outcome.

The main objectives of this research are as follows

The key takeway from our analysis was

2 Exploratory analysis

We load in the data set, perform the necessary data transformations by factoring the necessary variables and look at the data, in the following section.

Our main focus is the relationship between transmission “Automatic VS Manual” variable (binary: 0 = automatic, 1 = manual) and the number of miles per US gallon “mpg” variable. For ease of use, let’s build a “transmission” factor variable out of the “Automatic VS Manual” variable:

2.1 Load and transform the data

data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$transmission <- factor(mtcars$am,labels=c('Automatic','Manual'))

2.2 Data visualisation

Visualising “transmission” and “MPG” variables can help us understand the main patterns in the data. The figure 1 shows that manual cars tend to have higher MPG values. This is confirmed by Figure 2, as it appears clearly that the distribution of MPG for manual cars is higher than for automatic cars.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
##'Figure 1: Histogram of Miles per US Gallon values'
plot1 <- ggplot(mtcars, aes(x=mpg, fill = transmission)) + 
  geom_histogram(binwidth = 1, col = "black") +
  ggtitle('Figure 1: Histogram of Miles per US Gallon values')


##Figure 2: Boxplot of MPG values for each transmission
plot2 <- ggplot(mtcars, aes(x=transmission, y=mpg, fill = transmission)) +
  geom_boxplot(adjust = 1) + geom_jitter(size = 2) +
  ggtitle('Figure 2: Boxplot of MPG values for each transmission')

Thus, this gives an indication about the answer of the first question: manual cars seem to be better for MPG.

However, this MPG difference is not necessarily explained by the transmission only. The transmission might indeed be related to other variables in the dataset that may explain that difference. We can visualise the relationships between pairs of variables with a pair plot. In the figure 3 we notice that the transmission is highly related to “disp” (Displacement), “drat” (Rear axle ratio) and “cyl” (number of cylinders), and those variables are all highly correlated to MPG. We have to be careful when quantifying the effect of transmission on MPG and make sure we also take into account the adjustments from the other variables.

3 Exploratory Data Analysis

In this section, we dive deeper into our data and explore various relationships between variables of interest. Initially, we plot the relationships between all the variables of the dataset (see Figure 2 in the appendix). From the plot, we notice that variables like cyl, disp, hp, drat, wt, vs and transmission seem to have some strong correlation with mpg. But we will use linear models to quantify that in the regression analysis section.

init_model <- lm(mpg ~ ., data = mtcars)
summary(init_model)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        23.87913   20.06582   1.190   0.2525  
## cyl6               -2.64870    3.04089  -0.871   0.3975  
## cyl8               -0.33616    7.15954  -0.047   0.9632  
## disp                0.03555    0.03190   1.114   0.2827  
## hp                 -0.07051    0.03943  -1.788   0.0939 .
## drat                1.18283    2.48348   0.476   0.6407  
## wt                 -4.52978    2.53875  -1.784   0.0946 .
## qsec                0.36784    0.93540   0.393   0.6997  
## vs1                 1.93085    2.87126   0.672   0.5115  
## am                  1.21212    3.21355   0.377   0.7113  
## gear4               1.11435    3.79952   0.293   0.7733  
## gear5               2.52840    3.73636   0.677   0.5089  
## carb2              -0.97935    2.31797  -0.423   0.6787  
## carb3               2.99964    4.29355   0.699   0.4955  
## carb4               1.09142    4.44962   0.245   0.8096  
## carb6               4.47757    6.38406   0.701   0.4938  
## carb8               7.25041    8.36057   0.867   0.3995  
## transmissionManual       NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

Since we are interested in the effects of car transmission type on mpg, we plot boxplots of the variable mpg when transmission is Automatic or Manual (see Figure 3 in the appendix). This plot clearly depicts an increase in the mpg when the transmission is Manual.

base_model <- lm(mpg ~ transmission, data = mtcars)
summary(base_model)
## 
## Call:
## lm(formula = mpg ~ transmission, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          17.147      1.125  15.247 1.13e-15 ***
## transmissionManual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

3.1 Regression Analysis

In this section, we start building linear regression models based on the different variables and try to find out the best model fit and compare it with the base model which we have using anova. After model selection, we also perform analysis of residuals.

3.2 Model building and selection

best_model <- step(init_model, direction = "both")

Like we mentioned earlier, based on the pairs plot where several variables seem to have high correlation with mpg, We build an initial model with all the variables as predictors, and perfom stepwise model selection to select significant predictors for the final model which is the best model. This is taken care by the step method which runs lm multiple times to build multiple regression models and select the best variables from them using both forward selection and backward elimination methods by the AIC algorithm.

The best model obtained from the above computations consists of the variables, cyl, wt and hp as confounders and transmission as the independent variable. Details of the model are depicted below.

summary(best_model)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## am           1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

From the above model details, we observe that the adjusted R2 value is 0.84 which is the maximum obtained considering all combinations of variables. Thus, we can conclude that more than 84% of the variability is explained by the above model.“Figure 4”)

3.3 Base_model VS Best_model

In the following section, we compare the base model with only transmission as the predictor variable and the best model which we obtained earlier containing confounder variables also.

anova(base_model, best_model)
## Analysis of Variance Table
## 
## Model 1: mpg ~ transmission
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     26 151.03  4    569.87 24.527 1.688e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the above results, the p-value obtained is highly significant and we reject the null hypothesis that the confounder variables cyl, hp and wt don’t contribute to the accuracy of the model.

3.4 Residuals and Diagnostics

In this section, we shall study the residual plots of our regression model and also compute some of the regression diagnostics for our model to find out some interesting leverage points (often called as outliers) in the data set.

Residual plots

From the above plots, we can make the following observations,

The points in the Residuals vs. Fitted plot seem to be randomly scattered on the plot and verify the independence condition. The Normal Q-Q plot consists of the points which mostly fall on the line indicating that the residuals are normally distributed. The Scale-Location plot consists of points scattered in a constant band pattern, indicating constant variance. There are some distinct points of interest (outliers or leverage points) in the top right of the plots. We now compute some regression diagnostics of our model to find out these interesting leverage points as shown in the following section. We compute top three points in each case of influence measures.

leverage <- hatvalues(best_model)
tail(sort(leverage),3)
##       Toyota Corona Lincoln Continental       Maserati Bora 
##           0.2777872           0.2936819           0.4713671
influential <- dfbetas(best_model)
tail(sort(influential[,6]),3)
## Chrysler Imperial          Fiat 128     Toyota Corona 
##         0.3507458         0.4292043         0.7305402

Looking at the above cars, we notice that our analysis was correct, as the same cars are mentioned in the residual plots.

4 Inference

We also perform a t-test assuming that the transmission data has a normal distribution and we clearly see that the manual and automatic transmissions are significatively different.

t.test(mpg ~ transmission, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by transmission
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

5 Conclusion

Based on the observations from our best fit model, we can conclude the following,

Cars with Manual transmission get more miles per gallon mpg compared to cars with Automatic transmission. (1.8 adjusted by hp, cyl, and wt). mpg will decrease by 2.5 (adjusted by hp, cyl, and transmission) for every 1000 lb increase in wt. mpg decreases negligibly with increase of hp. If number of cylinders, cyl increases from 4 to 6 and 8, mpg will decrease by a factor of 3 and 2.2 respectively (adjusted by hp, wt, and transmission).

6 Appendix

## Warning: package 'GGally' was built under R version 3.1.3

“Figure 4: Best Model”)