Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”
Our analysis below leads to the following conclusions:
Manual transmission get more miles per gallon mpg compared to cars with Automatic transmission. 1.8 adjusted for hp, cyl, and wt. Mpg decreases by 2.5 (adjusted by hp, cyl, and am) for every 1000 lb increase in wt. Mpg decreases marginally with increase of hp. If number of cylinders, cyl increases from 4 to 6 and 8, mpg will decrease by a factor of 3 and 2.2 respectively (adjusted by hp, wt, and am).
We will now explore the basis for the above summary and conclusions.
Data for this analysis is mtcars and we begin by laoding the data, factors and also
data("mtcars")
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c("Automatic","Manual"))
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
The first step of my analysis is to create a Box Plot of the data and and also averages/means of mpg per tranmission type
by(mtcars$mpg,mtcars$am,mean) ##mean in boxplot
## mtcars$am: Automatic
## [1] 17.14737
## --------------------------------------------------------
## mtcars$am: Manual
## [1] 24.39231
by(mtcars$mpg,mtcars$am,sd)
## mtcars$am: Automatic
## [1] 3.833966
## --------------------------------------------------------
## mtcars$am: Manual
## [1] 6.166504
Given the focus of our analysis we immediately see that MPG for an automatic transmission is 17.15 miles per gallon and a Manual transmission yiels 24.39 miles per gallon.
Interestingly the Automatic mpg data has a standard deviation of 3.83 mpg and the Manual data standard deviation is 6.17 mpg.
While the data on the prima facie may imply that the manual has a more efficient economy rate further statistical analysis is warranted. It is too simplistic to use this data to draw a conclusion as a number of other varaiable such as weight, horsepower will impact the analyis.
We will begin our analysis by assuming the mtdata is normally distributed and testing the hypothesis that our mpg statistics above originate from the same distribution and are not signifcanrtly different.
t.test(mpg ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
Based on the t-test results, we reject the null hypothesis that the mpg distributions for manual and automatic transmissions are the same.
Our initial model inculdes all variables as predictors of mpg.
model1<-lm(mpg ~ ., data = mtcars)
summary(model1)
However, some varaibles are not relevant or impact the integrity of the model .Consequently we perform stepwise model selection in order to select significant predictors for the best model.The step function will perform this selection by running lm repeatedly to build multiple regression models and select the best variables from them using both forward selection and backward elimination methods using AIC algorithm. This ensures that we have included useful variables while omitting ones that do not contribute significantly to predicting mpg.
bestmodel <- step(model1, direction = "both")
summary(bestmodel)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
anova(bestmodel,model1)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + hp + wt + am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 151.03
## 2 15 120.40 11 30.623 0.3468 0.9588
The adjusted R^2 explains approx 84% of variability. Anova also shows that , the p-value obtained is significant and we reject the null hypothesis that the confounder variables cyl, hp and wt don’t contribute to the accuracy of the model.
We will now analyse the residual plots of our regression model and also compute some of the regression diagnostics for our model.
par(mfrow=c(2, 2))
plot(bestmodel)
The points in the Residuals vs. Fitted plot seem to be randomly scattered on the plot and support independence conditions.
The Normal Q-Q plot consists of the points which mostly fall on the line implying residuals are normally distributed. The Scale-Location plot consists of points scattered in a constant band pattern, indicating constant variance. There are some distinct outliers in the top right of the plots.
boxplot(mpg ~ am, data = mtcars, col = (c("red","blue")), ylab = "Miles Per Gallon", xlab = "Transmission Type")
pairs(mpg ~ ., data = mtcars)