Motor Trend is a magazine about the automobile industry. It is interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome), particularly:
Project progression:
This is a linear regression model project. In searching answer to these questions;I will use major statistical analyses processes to verify, quantify and justify my model selection. In these steps, I will offer some statistical inference and immediate conclusion as if my model is a good fit.
At the onset, I did some very basic exploratory data analysis(EDA) meaning little slicing and dicing the ‘mtcars’ dataset. Data manipulation is designed to get the ‘am’ variable factored into two levels(Auto, Manual), as per project instruction.
In my regression model summary, I did try to analyze the ‘summary-result’ as detail as possible to justify that ‘Manual-transmission’ definitely hold upper mileage benefit.I did a simpler residual analysis to verify my model efficiency.
In addition,I did some multivariable model analysis with variable adjustment and interaction to validate that none of the other models offer better mileage gain than my(fit01)model. Finally, I did use ‘anova’ function to prove that my, lm(mpg ~ am) model is the right answer choice for the project questions.
Load the ‘mtcars’ data set and implement some exploratory data analysis. Design a regression model and execute some detail statistical analysis.
Our linear model analysis should adhere to these instucted criteria:
Your report should:
# loading 'mtcars' data set
data(mtcars)
# a brief data display
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# displaying 'mtcars' data summary
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
# data dimension
dim(mtcars)
## [1] 32 11
# summarizing 'mpg' values based on list-factor(auto/manual) transmission only
by(mtcars$mpg, INDICES = list(mtcars$am), summary)
## : 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.95 17.30 17.15 19.20 24.40
## --------------------------------------------------------
## : 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 21.00 22.80 24.39 30.40 33.90
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# factoring 'am' variable elements of 'mtcars' datasets
summary(mtcars$am <- factor(mtcars$am))
## 0 1
## 19 13
# creating new levels with factored 'am-variable' data elements
levels(mtcars$am) <- c("Auto", "Manual")
# quick view of the new 'level-set'
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 Manual 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 Manual 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 Manual 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 Auto 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 Auto 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 Auto 3 1
# separating 'auto' levels only into a new 'level-set'
Auto_Data <- mtcars[mtcars$am == "Auto",]
# separating only 'manual' levels into new 'level-set' Manual_Data
Manual_Data <- mtcars[mtcars$am == "Manual",]
# separating 'mpg' mean by 'Auto' and 'Manual' level
summarise(group_by(mtcars, am), mn = mean(mpg))
## # A tibble: 2 x 2
## am mn
## <fctr> <dbl>
## 1 Auto 17.14737
## 2 Manual 24.39231
# doing t-test for verifying level-mpg-mean values
t.test(Auto_Data$mpg, Manual_Data$mpg)
##
## Welch Two Sample t-test
##
## data: Auto_Data$mpg and Manual_Data$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
We are doing ‘t-test’ only with ‘mpg-mean’ data related to ‘auto & manual’ levels to examine that these mean values are truly representattive of their group and carries a level of statistical significance.
t-test analysis:
Our 95% confidence interval( -11.280194, to -3.209684 ) range does not contain zero,it is all negative values.
p-value = 0.001374 is close to zero, which is ( 0.001347 < 0.05 ) at 0.05 level a statistically significant one.
We can reject the assumed null hypothesis[ auto_mean == manual_mean ] at 0.05 level.
Also ( Auto_mean = 17.15 < Manual_mean = 24.39 ), indicates the direction of the factored-element mean is significant and truly representative.
My regression model will try to Substantiate project question: “Is an automatic or manual transmission better for MPG”
# designing first linear-model with new level with summary
fit01 <- lm(mpg ~ am, mtcars)
summary(fit01)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The intercept 17.147 is the mean mileage for Automatic Transmission. The estimated mean for ‘Manual Transmission’ is intercept plus the slope ( 17.147 + 7.245 ) = 24.392.
Coefficient - Estimate:
The intercept with this model is essentially the expected value of mileage attained from a car with Auto transmission, while the slope is the Manual transmission.
So we can surmise that ‘Manual’ transmission has better MPG than ‘auto-transmission’.
Coefficient - t value:
Coefficient - Pr(>|t|):
We can infer that ‘amManual’ has higher level of ‘p-value’ significance than ‘amAuto’.
Residual - Standar Error:
Residual standar error measure the quality of a linear regression fit. The Residaul standard error is the average amount that the response(mileage) will deviate from the true regression line. In this model, the actual mileage varies between two transmission can deviate from the true regression line by approximately 4.902 miles on average. In other words, given that the mean mileage for ‘amAuto’ are 17.147 mile and that the Residual standard Error is 4.902.
R-squared, Adjusted R-squared:
The R-squared static provides of how well the model is fitting the actual data.
In our calculation multiple R^2 is 0.3598 or rougly 35% of the variance found in the response variable(mpg) can be explained by the predictor variable am(auto/manual).
Adjusted R^2 is 0.3385
In both cases we see R^2 values in range 0 < (0.3598, 0.3385) < 1 supports a good correlation between these two variables. This indicates a good linear model fit.
# Confidence Interval of this model [ fit01 ] with 'amAuto' coefficients
sumCoef <- summary(fit01)$coefficients
sumCoef[1, 1] + c(-1, 1) * qt(0.975, df = fit01$df ) * sumCoef[1, 2]
## [1] 14.85062 19.44411
# Now let's do the confidence interval of 'amManual' slope coefficients
(sumCoef[2,1] + c(-1, 1) * qt(0.975, df = fit01$df) * sumCoef[2, 2])
## [1] 3.64151 10.84837
Analysis: So we can interpret these interval with 95% confidence that as we switch transmission from ‘auto’ to ‘manual’ average mileage increases 3.64151 to 10.84837 mile.
Inference: we can say that manual transmission definitely produces better gas-mileage than automatic one.
# resid function returns residuals of the linear model(fit01).
residual <- resid(fit01)
# a visual of the estimated residuals with model 'fit01'
summary(residual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.3920 -3.0920 -0.2974 0.0000 3.2440 9.5080
Analysis: We can see very clearly that all the negative values, residuals (-9.3920 -3.0920 -0.2974) = -12.7814 nearly equates ( 3.2440 + 9.5080 ) = 12.752. We know residuals must sum to.. 0, apparently (12.752 -12.7814) = -0.0294 is almost close to 0. A good measurment of accurate model fit.
# Plotting Residual vs fitted value
par(mfrow = c(1,2))
plot(residual, pch='*', xlab = "Fitted values", ylab = "Residuals")
abline(0,0)
# Normality of residuals(errors)
qqnorm(residual, pch='*')
qqline(residual)
Figure 1: Residual plot
Residual vs. fitted: Residual points are in a pattern and symmetrically distributed on and below the 0-line.
Residaul Q-Q plot: It is obvious that our model(fit01) residual(error) values roughly falling on a line in a normal QQ plot.
These distribution verifies our model(fit01) design with potential effectiveness.
# drawing plots
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
# dotted plot
ggplot(mtcars, aes(x = factor(am), y = mpg, color=factor(am), shape = factor(am))) + geom_point(size = 3) +geom_line(colour = "black") + xlab ("Auto and Manual Transmission") + ylab ("Mileage") +ggtitle("mileage distribution by 'transmission and cylinder'")
Figure 2: Dot plot: lm( mpg ~ am)
It is obvious that ‘Manual’ transmission getting incremental mileage.
We know omitting variables from regressors may results in bias in the coefficients of interest ( unless the regressors are uncorrelated with the omitted ones). So to avoid bias, I have decided to do a very generalized ‘mpg’ measurements in connection to all the relevant regressor variables of ‘mtcars’ dataset regardless of correlations.
# mpg vs. all relevant regressors variables into a new linear model
fit02 <- lm(mpg ~ ., data = mtcars)
summary(fit02)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## amManual 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Nested models: Inspecting into ‘fit02’ model ‘Coefficients-Estimate’ we can surmise that variables( cyl, hp, wt, carb ) are losing more mileage than any other variables. Mileage regression is in negative territory with these variables. The rest of the variables are are gaining somewhat but trivial mileage.
# Obtaining residual plot of fitted model ( fit02 )
par(mfrow = c(2,2))
residual02 <- resid(fit02)
plot(fit02)
Figure 4: Detail Residual plot ‘model-fit02’
Analysis: The ‘Residual vs Fitted’ plot is not exactly a smooth residual distribution. We can see from our model(fit02), Residuals: (-3.4506 -1.6044 -0.1196 + 1.2193 + 4.6271) = 0.6718 is not sum to zero. Our ‘Normal QQ plot’ visually shows a residual normality. The ‘scale-Location’ plot shows some sort of linear distribution of residuals. Finally, our Residuals vs Leverage plot shows no large outlying data point holding any significant leverage.
Adjustment and interaction between multiple variables with ‘am’:
so I will experiment with some nested linear model with these variables adding with factored ‘am’ regressor. This also called adjustment and interaction, by adding more regressor into the linear model to investigate the role of a third/fourth variable on to the relationship with outcome variable ‘mpg’. These added variable can distort, or confound the linear relationhsip between (outcome-regressor) and offer a renewed perspective about possible variable influence.
# variable adjustment with possible relationship with 'cyl'
fit03 <- lm(mpg ~ am + cyl, data = mtcars)
summary(fit03)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.522443 2.6031842 13.261621 7.694408e-14
## amManual 2.567035 1.2914280 1.987749 5.635445e-02
## cyl -2.500958 0.3608282 -6.931159 1.284560e-07
We can see from this model that ‘t-value = -6.93’ relatively away from 0, which indicates that there is a minimal relationship between ‘mpg - (am + cyl)’ model.
# variable adjustment with possible relationship with 'hp' interactive
fit04 <- lm(mpg ~ am + cyl + hp + cyl * hp, data = mtcars)
summary(fit04)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.11301276 6.05912399 7.280427 7.864211e-08
## amManual 3.76083652 1.19924857 3.135994 4.106089e-03
## cyl -2.91100083 0.94381374 -3.084296 4.668292e-03
## hp -0.17859201 0.06030039 -2.961706 6.309893e-03
## cyl:hp 0.01854061 0.00769141 2.410560 2.300796e-02
This model (fit04) t-value correlated with [ cyl, hp = -3.08, -2.96 ] is not far away from 0. We can say there exist a fainted relationship with between mileage change and (cyl + hp) predictor variable.
# variable adjustment with possible relationship with 'wt'
fit05 <- lm(mpg ~ am + cyl + hp + wt, data = mtcars)
summary(fit05)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14653575 3.10478079 11.642218 4.944804e-12
## amManual 1.47804771 1.44114927 1.025603 3.141799e-01
## cyl -0.74515702 0.58278741 -1.278609 2.119166e-01
## hp -0.02495106 0.01364614 -1.828433 7.855337e-02
## wt -2.60648071 0.91983749 -2.833632 8.603218e-03
We added another variable ‘wt’ into this linear model. Estimated corresponding slope, t-values all stayed in negative territory without being far away from 0. So we can infer that mileage relations will not be significant with this new variable adjustment.
# On this model(fit06) we added predictor variable 'qsec' with the long list
fit06 <- lm(mpg ~ am + cyl + hp + wt + qsec + wt * qsec, data = mtcars)
summary(fit06)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.19506547 28.08508859 -0.8970976 0.37822499
## amManual 2.33701788 1.67062488 1.3988885 0.17413091
## cyl -0.34220286 0.71814858 -0.4765071 0.63785167
## hp -0.02606609 0.01516391 -1.7189563 0.09798310
## wt 13.97226697 9.64311981 1.4489364 0.15978360
## qsec 3.28252329 1.52838218 2.1477110 0.04162103
## wt:qsec -0.93236290 0.52242265 -1.7846908 0.08645298
None of these interaction offers any new height of observation in t-values far from standard error. The only difference now ‘p-value’ is significantly out of range towards accepting null values. An example of ‘simpsons paradox’.
Inference: So, we can effectively assume that adding multiple variables into the linear-model wouldn’t make any difference in pursuasion of mileage gain/loss on mileage coefficient-slope.
We know ‘ANOVA’ test is useful for comparing two or more model for statistical significance. It is conceptually similar to multiple two-sample t-test.
anova(fit01, fit03, fit04, fit05, fit06)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + hp + cyl * hp
## Model 4: mpg ~ am + cyl + hp + wt
## Model 5: mpg ~ am + cyl + hp + wt + qsec + wt * qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 271.36 1 449.53 79.2791 3.162e-09 ***
## 3 27 181.49 2 89.87 7.9246 0.00216 **
## 4 27 170.00 0 11.50
## 5 25 141.76 2 28.24 2.4902 0.10322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis: By analysing all four nested models with ‘anova’ function, we are witnessing, Model-2 has second highest level( 0.001) of significance. There are no obvious mileage gain even with newer adjusted nested models.
The only second most significant model is ‘fit03 = am + cyl’ with multivariable combination.Let’s have a box-plot visual with this adjustment Model-2( fit03 ).
boxplot(mpg ~ factor(am)+cyl, data=mtcars, col=c("salmon","dodgerblue2"), xlab="Transmission-Cylinder", ylab="Mileage", main="mileage variation with (am+cyl)")
Figure 4: Box plot of ( mpg ~ am + cyl)
Manual transmission still carries higher mileage with 4-cylinder combination
All throughout these statistical verification processes, it is obvious that ‘amManual’ transmission holds a significant mileage gain in comparison to ‘amAuto’ cars. Our ‘t-test’, ‘confidence interval’ and residual analysis offer a clear mileage preference for ‘manual-transmission’ cars.
Even multivariable analysis with variable adjustment and interaction forcefully confirms that “A manual transmission car is better for MPG, rather than an automatic one.”