This detailed analysis has been performed to fulfill the requirements of the course project for the course Regression Models offered by the Johns Hopkins University on Data science specialization. In this project, we will analyze the mtcars data set and explore the relationship between a set of variables and miles per gallon (MPG) which will be our outcome.
The main objectives of this research are as follows
The key takeway from our analysis was
We load in the data set, perform the necessary data transformations by factoring the necessary variables and look at the data, in the following section.
Our main focus is the relationship between transmission “Automatic VS Manual” variable (binary: 0 = automatic, 1 = manual) and the number of miles per US gallon “mpg” variable. For ease of use, let’s build a “transmission” factor variable out of the “Automatic VS Manual” variable:
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$transmission <- factor(mtcars$am,labels=c('Automatic','Manual'))
Visualising “transmission” and “MPG” variables can help us understand the main patterns in the data. The figure 1 shows that manual cars tend to have higher MPG values. This is confirmed by Figure 2, as it appears clearly that the distribution of MPG for manual cars is higher than for automatic cars.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
##'Figure 1: Histogram of Miles per US Gallon values'
plot1 <- ggplot(mtcars, aes(x=mpg, fill = transmission)) +
geom_histogram(binwidth = 1, col = "black") +
ggtitle('Figure 1: Histogram of Miles per US Gallon values')
##Figure 2: Boxplot of MPG values for each transmission
plot2 <- ggplot(mtcars, aes(x=transmission, y=mpg, fill = transmission)) +
geom_boxplot(adjust = 1) + geom_jitter(size = 2) +
ggtitle('Figure 2: Boxplot of MPG values for each transmission')
Thus, this gives an indication about the answer of the first question: manual cars seem to be better for MPG.
However, this MPG difference is not necessarily explained by the transmission only. The transmission might indeed be related to other variables in the dataset that may explain that difference. We can visualise the relationships between pairs of variables with a pair plot. In the figure 3 we notice that the transmission is highly related to “disp” (Displacement), “drat” (Rear axle ratio) and “cyl” (number of cylinders), and those variables are all highly correlated to MPG. We have to be careful when quantifying the effect of transmission on MPG and make sure we also take into account the adjustments from the other variables.
In this section, we dive deeper into our data and explore various relationships between variables of interest. Initially, we plot the relationships between all the variables of the dataset (see Figure 2 in the appendix). From the plot, we notice that variables like cyl, disp, hp, drat, wt, vs and transmission seem to have some strong correlation with mpg. But we will use linear models to quantify that in the regression analysis section.
init_model <- lm(mpg ~ ., data = mtcars)
summary(init_model)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## am 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## transmissionManual NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
Since we are interested in the effects of car transmission type on mpg, we plot boxplots of the variable mpg when transmission is Automatic or Manual (see Figure 3 in the appendix). This plot clearly depicts an increase in the mpg when the transmission is Manual.
base_model <- lm(mpg ~ transmission, data = mtcars)
summary(base_model)
##
## Call:
## lm(formula = mpg ~ transmission, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## transmissionManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
In this section, we start building linear regression models based on the different variables and try to find out the best model fit and compare it with the base model which we have using anova. After model selection, we also perform analysis of residuals.
best_model <- step(init_model, direction = "both")
Like we mentioned earlier, based on the pairs plot where several variables seem to have high correlation with mpg, We build an initial model with all the variables as predictors, and perfom stepwise model selection to select significant predictors for the final model which is the best model. This is taken care by the step method which runs lm multiple times to build multiple regression models and select the best variables from them using both forward selection and backward elimination methods by the AIC algorithm.
The best model obtained from the above computations consists of the variables, cyl, wt and hp as confounders and transmission as the independent variable. Details of the model are depicted below.
summary(best_model)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## am 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
From the above model details, we observe that the adjusted R2 value is 0.84 which is the maximum obtained considering all combinations of variables. Thus, we can conclude that more than 84% of the variability is explained by the above model.“Figure 4”)
In the following section, we compare the base model with only transmission as the predictor variable and the best model which we obtained earlier containing confounder variables also.
anova(base_model, best_model)
## Analysis of Variance Table
##
## Model 1: mpg ~ transmission
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the above results, the p-value obtained is highly significant and we reject the null hypothesis that the confounder variables cyl, hp and wt don’t contribute to the accuracy of the model.
In this section, we shall study the residual plots of our regression model and also compute some of the regression diagnostics for our model to find out some interesting leverage points (often called as outliers) in the data set.
Residual plots
From the above plots, we can make the following observations,
The points in the Residuals vs. Fitted plot seem to be randomly scattered on the plot and verify the independence condition. The Normal Q-Q plot consists of the points which mostly fall on the line indicating that the residuals are normally distributed. The Scale-Location plot consists of points scattered in a constant band pattern, indicating constant variance. There are some distinct points of interest (outliers or leverage points) in the top right of the plots. We now compute some regression diagnostics of our model to find out these interesting leverage points as shown in the following section. We compute top three points in each case of influence measures.
leverage <- hatvalues(best_model)
tail(sort(leverage),3)
## Toyota Corona Lincoln Continental Maserati Bora
## 0.2777872 0.2936819 0.4713671
influential <- dfbetas(best_model)
tail(sort(influential[,6]),3)
## Chrysler Imperial Fiat 128 Toyota Corona
## 0.3507458 0.4292043 0.7305402
Looking at the above cars, we notice that our analysis was correct, as the same cars are mentioned in the residual plots.
We also perform a t-test assuming that the transmission data has a normal distribution and we clearly see that the manual and automatic transmissions are significatively different.
t.test(mpg ~ transmission, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by transmission
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
Based on the observations from our best fit model, we can conclude the following,
Cars with Manual transmission get more miles per gallon mpg compared to cars with Automatic transmission. (1.8 adjusted by hp, cyl, and wt). mpg will decrease by 2.5 (adjusted by hp, cyl, and transmission) for every 1000 lb increase in wt. mpg decreases negligibly with increase of hp. If number of cylinders, cyl increases from 4 to 6 and 8, mpg will decrease by a factor of 3 and 2.2 respectively (adjusted by hp, wt, and transmission).
## Warning: package 'GGally' was built under R version 3.1.3
“Figure 4: Best Model”)