This document is part of the Regression Models Course Project (given by John Hopkins University at coursera.com).
This project focuses on creating regression models and their diagnostics relating a data-set of automobiles.
This analysis is performed for the Motor Trend, a magazine about the automobile industry. By looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) as outcome. We are particularly interested to explore:
In order to get answer to these questions, both univariate and multivariate regression models were created and evaluated after a comprehensive exploratory data analysis. The most promising model was selected according to step-wise selection of the most impacting variables which contribute to the difference of MPG between automatic and manual transmission cars. These variables found to be the weight of the cars and their quarter mile time.
According to the selected model, manual transmission cars cover in average ~2.94 more miles per gallon in relation to automatic tranmission cars.
Data was obtained in R CRAN and its documentation can be found here.
This dataset consists of 32 observations of 11 variables:
1. mpg - Miles/(US) gallon
2. cyl - Number of cylinders
3. disp - Displacement (cu.in.)
4. hp - Gross horsepower
5. drat - Rear axle ratio
6. wt - Weight (1000 lbs)
7. qsec - 1/4 mile time
8. vs - Engine (0 = V-engine, 1 = straight engine)
9. am - Transmission (0 = automatic, 1 = manual)
10. gear - Number of forward gears
11. carb - Number of carburetors
The table below shows descriptive statistics of the dataset. Naturally we will focus on the mpg and am variables (MPG & Transmission).
mpg | cyl | disp | hp | drat |
---|---|---|---|---|
Min. :10.40 | Min. :4.000 | Min. : 71.1 | Min. : 52.0 | Min. :2.760 |
1st Qu.:15.43 | 1st Qu.:4.000 | 1st Qu.:120.8 | 1st Qu.: 96.5 | 1st Qu.:3.080 |
Median :19.20 | Median :6.000 | Median :196.3 | Median :123.0 | Median :3.695 |
Mean :20.09 | Mean :6.188 | Mean :230.7 | Mean :146.7 | Mean :3.597 |
3rd Qu.:22.80 | 3rd Qu.:8.000 | 3rd Qu.:326.0 | 3rd Qu.:180.0 | 3rd Qu.:3.920 |
Max. :33.90 | Max. :8.000 | Max. :472.0 | Max. :335.0 | Max. :4.930 |
wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|
Min. :1.513 | Min. :14.50 | 0:18 | 0:19 | Min. :3.000 | Min. :1.000 |
1st Qu.:2.581 | 1st Qu.:16.89 | 1:14 | 1:13 | 1st Qu.:3.000 | 1st Qu.:2.000 |
Median :3.325 | Median :17.71 | NA | NA | Median :4.000 | Median :2.000 |
Mean :3.217 | Mean :17.85 | NA | NA | Mean :3.688 | Mean :2.812 |
3rd Qu.:3.610 | 3rd Qu.:18.90 | NA | NA | 3rd Qu.:4.000 | 3rd Qu.:4.000 |
Max. :5.424 | Max. :22.90 | NA | NA | Max. :5.000 | Max. :8.000 |
From preliminary evaluation of the graphs we can see several points:
Basic T-test (displayed below) show significant difference between Manual and Automatic gears regarding fuel consumption. the p-value of the test is 0.00137 which supports the observation as seen in the above boxplot. However, further investigation is needed in order to get a more comprehensive picture. Next section will be dedicated for this objective.
t.statistic | df | p.value | lower.CL | upper.CL | automatic.mean | manual.mean | |
---|---|---|---|---|---|---|---|
-3.767 | 18.332 | 0.001 | -11.28 | -3.21 | 17.147 | 24.392 |
Linear regression is a basic modeling tool which will be used here in attempt to find a connection between MPG and transmission type.
First, we will do a univariate linear regression to see the direct effect of transmission over MPG (assuming all other variables does not influence the outcome):
##
## Call:
## lm(formula = mpg ~ am, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Although this model shows a significant connection between the two variables (MPG by transmission), it has a small value of R2 () which indicates that only a small percentage of the variance is explained by this model. Hence, a better model is required in order to quantify more accurately the MPG difference between automatic and manual transmissions.
In order to evaluate what variable may contribute to variance explanation, a step-wise selection was applied:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 9.617781 | 6.9595930 | 1.381946 | 0.1779152 |
wt | -3.916504 | 0.7112016 | -5.506882 | 0.0000070 |
qsec | 1.225886 | 0.2886696 | 4.246676 | 0.0002162 |
am1 | 2.935837 | 1.4109045 | 2.080819 | 0.0467155 |
According to this step-wise selection there are two variables that contribute the most (in addition to transmission) to variance explanation: The weight of the car (wt) and quarter mile time (qsec). The variance explained by such model that include those 3 variables is approximately 85% (according to R2 value of 0.85).
Therefore, the best model to show the connection between MPG and transmission should include also these 2 variables. Addition of more variables will increase the true standard error which in turn will increase the variation in the model and affect the significance of it.
The model:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 9.617781 | 6.9595930 | 1.381946 | 0.1779152 |
wt | -3.916504 | 0.7112016 | -5.506882 | 0.0000070 |
qsec | 1.225886 | 0.2886696 | 4.246676 | 0.0002162 |
am1 | 2.935837 | 1.4109045 | 2.080819 | 0.0467155 |
As mentioned above, this model explains 85% of the variance in MPG. According to the am coefficient (transmission) we can conclude that on average, manual transmission cars cover 2.94 more miles per gallon in relation to automatic tranmission cars.
In order to truely accept the selected model, it is crucial to assess its p-value using the residuals:
From the plots created above, several observations can be seen:
This report was made in order to answer questions regarding the connection between MPG and tranmission type using only linear regression modeling. We found that on average, manual transmission cars cover more miles per gallon than automatic tranmission cars.
However, it is important to mention that other types of modeling and/or exploration of the data may reveal additional insights.