Regression_Model

Executive Summary

Motor Trend is a magazine about the automobile industry. It is interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome), particularly:

“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”

Project progression:

This is a linear regression model project. In searching answer to these questions;I will use major statistical analyses processes to verify, quantify and justify my model selection. In these steps, I will offer some statistical inference and immediate conclusion as if my model is a good fit.

At the onset, I did some very basic exploratory data analysis(EDA) meaning little slicing and dicing the ‘mtcars’ dataset. Data manipulation is designed to get the ‘am’ variable factored into two levels(Auto, Manual), as per project instruction.

In my regression model summary, I did try to analyze the ‘summary-result’ as detail as possible to justify that ‘Manual-transmission’ definitely hold upper mileage benefit.I did a simpler residual analysis to verify my model efficiency.

In addition,I did some multivariable model analysis with variable adjustment and interaction to validate that none of the other models offer better mileage gain than my(fit01)model. Finally, I did use ‘anova’ function to prove that my, lm(mpg ~ am) model is the right answer choice for the project questions.

Project Question criteria and report writing instruction

Load the ‘mtcars’ data set and implement some exploratory data analysis. Design a regression model and execute some detail statistical analysis.

Our linear model analysis should adhere to these instucted criteria:

Interpreting the coefficients and slopes correctly.
Doing some basic relevant exploratory data analyses.
Fitting some multivariable linear models and evaluate reasoning for model selection.
portraying a residual plot and with some diagnostics analysis.
quantifying uncertainty in their(models) inferencial conclusions and/or perform an inference correctly.
answering the questions of interest or detail why the question(s) is (are) not answerable?

Your report should:

Include an executive summary about project design progression.
Written in a PDF printout format and compiled (using knitr) with a R markdown document.
Concise and roughly the equivalent of 2 pages or less for the main text.
Supporting figures in an appendix can be included up to 5 total pages.

1. EDA: Exploratory Data Analysis

# loading 'mtcars' data set
data(mtcars)

# a brief data display
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# displaying 'mtcars' data summary
summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

# data dimension
dim(mtcars)

## [1] 32 11

# summarizing 'mpg' values based on list-factor(auto/manual) transmission only
by(mtcars$mpg, INDICES = list(mtcars$am), summary)

## : 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.95   17.30   17.15   19.20   24.40 
## -------------------------------------------------------- 
## : 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   21.00   22.80   24.39   30.40   33.90

2. DM: Data manipulation with ‘t-test’

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# factoring 'am' variable elements of 'mtcars' datasets
summary(mtcars$am <- factor(mtcars$am))

##  0  1 
## 19 13

# creating new levels with factored 'am-variable' data elements 
levels(mtcars$am) <- c("Auto", "Manual")

# quick view of the new 'level-set'
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs     am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0 Manual    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0 Manual    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1 Manual    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1   Auto    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0   Auto    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1   Auto    3    1

# separating 'auto' levels only into a new 'level-set'
Auto_Data <- mtcars[mtcars$am == "Auto",]

# separating only 'manual' levels into new 'level-set' Manual_Data
Manual_Data <- mtcars[mtcars$am == "Manual",]

# separating 'mpg' mean by 'Auto' and 'Manual' level
summarise(group_by(mtcars, am), mn = mean(mpg))

## # A tibble: 2 x 2
##       am       mn
##   <fctr>    <dbl>
## 1   Auto 17.14737
## 2 Manual 24.39231

# doing t-test for verifying level-mpg-mean values
t.test(Auto_Data$mpg, Manual_Data$mpg)

## 
##  Welch Two Sample t-test
## 
## data:  Auto_Data$mpg and Manual_Data$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

We are doing ‘t-test’ only with ‘mpg-mean’ data related to ‘auto & manual’ levels to examine that these mean values are truly representattive of their group and carries a level of statistical significance.

t-test analysis:

Our 95% confidence interval( -11.280194, to -3.209684 ) range does not contain zero,it is all negative values.
p-value = 0.001374 is close to zero, which is ( 0.001347 < 0.05 ) at 0.05 level a statistically significant one.
We can reject the assumed null hypothesis[ auto_mean == manual_mean ] at 0.05 level.
Also ( Auto_mean = 17.15 < Manual_mean = 24.39 ), indicates the direction of the factored-element mean is significant and truly representative.

3. Regression analysis with linear model

My regression model will try to Substantiate project question: “Is an automatic or manual transmission better for MPG”

# designing first linear-model with new level with summary 
fit01 <- lm(mpg ~ am, mtcars)
summary(fit01)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Linear model summary-result analysis:

The intercept 17.147 is the mean mileage for Automatic Transmission. The estimated mean for ‘Manual Transmission’ is intercept plus the slope ( 17.147 + 7.245 ) = 24.392.

Coefficient - Estimate:

The intercept with this model is essentially the expected value of mileage attained from a car with Auto transmission, while the slope is the Manual transmission.

amAuto, the [Auto Transmission] cars in average attains 17.147 MPG.
The coefficients slope[ amManual ], indicates mileage increases by 7.245 MPG.

So we can surmise that ‘Manual’ transmission has better MPG than ‘auto-transmission’.

Coefficient - t value:

We can see that out t-static values are 15.247 and 19.353 = [15.247 + 4.106] both are relatively far away from zero and large relative to corresponding standard error values, which indicates we could reject the null hypothesis meaning [ auto != manual ].

Coefficient - Pr(>|t|):

Since our p-values for the intercept[1.13e-15] and slope[0.000285] indicates that ‘amManual’ has smaller p-value than ‘amAuto’.

We can infer that ‘amManual’ has higher level of ‘p-value’ significance than ‘amAuto’.

Residual - Standar Error:

Residual standar error measure the quality of a linear regression fit. The Residaul standard error is the average amount that the response(mileage) will deviate from the true regression line. In this model, the actual mileage varies between two transmission can deviate from the true regression line by approximately 4.902 miles on average. In other words, given that the mean mileage for ‘amAuto’ are 17.147 mile and that the Residual standard Error is 4.902.

R-squared, Adjusted R-squared:

The R-squared static provides of how well the model is fitting the actual data.

In our calculation multiple R^2 is 0.3598 or rougly 35% of the variance found in the response variable(mpg) can be explained by the predictor variable am(auto/manual).
Adjusted R^2 is 0.3385
In both cases we see R^2 values in range 0 < (0.3598, 0.3385) < 1 supports a good correlation between these two variables. This indicates a good linear model fit.

Calculating confidence interval for the ‘Intercept-Slope’ of this model:

# Confidence Interval of this model [ fit01 ] with 'amAuto' coefficients
sumCoef <- summary(fit01)$coefficients
sumCoef[1, 1] + c(-1, 1) * qt(0.975, df = fit01$df ) * sumCoef[1, 2]

## [1] 14.85062 19.44411

# Now let's do the confidence interval of 'amManual' slope coefficients
(sumCoef[2,1] + c(-1, 1) * qt(0.975, df = fit01$df) * sumCoef[2, 2])

## [1]  3.64151 10.84837

Analysis: So we can interpret these interval with 95% confidence that as we switch transmission from ‘auto’ to ‘manual’ average mileage increases 3.64151 to 10.84837 mile.

Inference: we can say that manual transmission definitely produces better gas-mileage than automatic one.

Residual analysis for model selection

# resid function returns residuals of the linear model(fit01).
residual <- resid(fit01)

# a visual of the estimated residuals with model 'fit01'
summary(residual)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -9.3920 -3.0920 -0.2974  0.0000  3.2440  9.5080

Analysis: We can see very clearly that all the negative values, residuals (-9.3920 -3.0920 -0.2974) = -12.7814 nearly equates ( 3.2440 + 9.5080 ) = 12.752. We know residuals must sum to.. 0, apparently (12.752 -12.7814) = -0.0294 is almost close to 0. A good measurment of accurate model fit.

# Plotting Residual vs fitted value
par(mfrow = c(1,2))
plot(residual, pch='*', xlab = "Fitted values", ylab = "Residuals")
abline(0,0)

# Normality of residuals(errors)
qqnorm(residual, pch='*')
qqline(residual)

Figure 1: Residual plot

Residual vs. fitted: Residual points are in a pattern and symmetrically distributed on and below the 0-line.

Residaul Q-Q plot: It is obvious that our model(fit01) residual(error) values roughly falling on a line in a normal QQ plot.

These distribution verifies our model(fit01) design with potential effectiveness.

plots of the regression model

# drawing plots
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

# dotted plot
ggplot(mtcars, aes(x = factor(am), y = mpg, color=factor(am), shape = factor(am))) + geom_point(size = 3)  +geom_line(colour = "black") + xlab ("Auto   and   Manual Transmission") + ylab ("Mileage") +ggtitle("mileage distribution by 'transmission and cylinder'")

Figure 2: Dot plot: lm( mpg ~ am)

It is obvious that ‘Manual’ transmission getting incremental mileage.

4. Multivariable analysis with nested model testing

We know omitting variables from regressors may results in bias in the coefficients of interest ( unless the regressors are uncorrelated with the omitted ones). So to avoid bias, I have decided to do a very generalized ‘mpg’ measurements in connection to all the relevant regressor variables of ‘mtcars’ dataset regardless of correlations.

# mpg vs. all relevant regressors variables into a new linear model
fit02 <- lm(mpg ~ ., data = mtcars)
summary(fit02)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## amManual     2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Nested models: Inspecting into ‘fit02’ model ‘Coefficients-Estimate’ we can surmise that variables( cyl, hp, wt, carb ) are losing more mileage than any other variables. Mileage regression is in negative territory with these variables. The rest of the variables are are gaining somewhat but trivial mileage.

# Obtaining residual plot of fitted model ( fit02 )
par(mfrow = c(2,2))
residual02 <- resid(fit02)
plot(fit02)

Figure 4: Detail Residual plot ‘model-fit02’

Analysis: The ‘Residual vs Fitted’ plot is not exactly a smooth residual distribution. We can see from our model(fit02), Residuals: (-3.4506 -1.6044 -0.1196 + 1.2193 + 4.6271) = 0.6718 is not sum to zero. Our ‘Normal QQ plot’ visually shows a residual normality. The ‘scale-Location’ plot shows some sort of linear distribution of residuals. Finally, our Residuals vs Leverage plot shows no large outlying data point holding any significant leverage.

Adjustment and interaction between multiple variables with ‘am’:

so I will experiment with some nested linear model with these variables adding with factored ‘am’ regressor. This also called adjustment and interaction, by adding more regressor into the linear model to investigate the role of a third/fourth variable on to the relationship with outcome variable ‘mpg’. These added variable can distort, or confound the linear relationhsip between (outcome-regressor) and offer a renewed perspective about possible variable influence.

# variable adjustment with possible relationship with 'cyl'
fit03 <- lm(mpg ~ am + cyl, data = mtcars)
summary(fit03)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 34.522443  2.6031842 13.261621 7.694408e-14
## amManual     2.567035  1.2914280  1.987749 5.635445e-02
## cyl         -2.500958  0.3608282 -6.931159 1.284560e-07

We can see from this model that ‘t-value = -6.93’ relatively away from 0, which indicates that there is a minimal relationship between ‘mpg - (am + cyl)’ model.

# variable adjustment with possible relationship with 'hp' interactive
fit04 <- lm(mpg ~ am + cyl + hp + cyl * hp, data = mtcars)
summary(fit04)$coef

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 44.11301276 6.05912399  7.280427 7.864211e-08
## amManual     3.76083652 1.19924857  3.135994 4.106089e-03
## cyl         -2.91100083 0.94381374 -3.084296 4.668292e-03
## hp          -0.17859201 0.06030039 -2.961706 6.309893e-03
## cyl:hp       0.01854061 0.00769141  2.410560 2.300796e-02

This model (fit04) t-value correlated with [ cyl, hp = -3.08, -2.96 ] is not far away from 0. We can say there exist a fainted relationship with between mileage change and (cyl + hp) predictor variable.

# variable adjustment with possible relationship with 'wt'
fit05 <- lm(mpg ~ am + cyl + hp + wt, data = mtcars)
summary(fit05)$coef

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 36.14653575 3.10478079 11.642218 4.944804e-12
## amManual     1.47804771 1.44114927  1.025603 3.141799e-01
## cyl         -0.74515702 0.58278741 -1.278609 2.119166e-01
## hp          -0.02495106 0.01364614 -1.828433 7.855337e-02
## wt          -2.60648071 0.91983749 -2.833632 8.603218e-03

We added another variable ‘wt’ into this linear model. Estimated corresponding slope, t-values all stayed in negative territory without being far away from 0. So we can infer that mileage relations will not be significant with this new variable adjustment.

# On this model(fit06) we added predictor variable 'qsec' with the long list 
fit06 <- lm(mpg ~ am + cyl + hp + wt + qsec + wt * qsec, data = mtcars)
summary(fit06)$coef

##                 Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) -25.19506547 28.08508859 -0.8970976 0.37822499
## amManual      2.33701788  1.67062488  1.3988885 0.17413091
## cyl          -0.34220286  0.71814858 -0.4765071 0.63785167
## hp           -0.02606609  0.01516391 -1.7189563 0.09798310
## wt           13.97226697  9.64311981  1.4489364 0.15978360
## qsec          3.28252329  1.52838218  2.1477110 0.04162103
## wt:qsec      -0.93236290  0.52242265 -1.7846908 0.08645298

None of these interaction offers any new height of observation in t-values far from standard error. The only difference now ‘p-value’ is significantly out of range towards accepting null values. An example of ‘simpsons paradox’.

Inference: So, we can effectively assume that adding multiple variables into the linear-model wouldn’t make any difference in pursuasion of mileage gain/loss on mileage coefficient-slope.

5. ANOVA - test for multiple-model statistical significance

We know ‘ANOVA’ test is useful for comparing two or more model for statistical significance. It is conceptually similar to multiple two-sample t-test.

anova(fit01, fit03, fit04, fit05, fit06)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + hp + cyl * hp
## Model 4: mpg ~ am + cyl + hp + wt
## Model 5: mpg ~ am + cyl + hp + wt + qsec + wt * qsec
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 271.36  1    449.53 79.2791 3.162e-09 ***
## 3     27 181.49  2     89.87  7.9246   0.00216 ** 
## 4     27 170.00  0     11.50                      
## 5     25 141.76  2     28.24  2.4902   0.10322    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis: By analysing all four nested models with ‘anova’ function, we are witnessing, Model-2 has second highest level( 0.001) of significance. There are no obvious mileage gain even with newer adjusted nested models.

The only second most significant model is ‘fit03 = am + cyl’ with multivariable combination.Let’s have a box-plot visual with this adjustment Model-2( fit03 ).

boxplot(mpg ~ factor(am)+cyl, data=mtcars,  col=c("salmon","dodgerblue2"), xlab="Transmission-Cylinder", ylab="Mileage", main="mileage variation with (am+cyl)")

Figure 4: Box plot of ( mpg ~ am + cyl)

Manual transmission still carries higher mileage with 4-cylinder combination

Conclusion:

All throughout these statistical verification processes, it is obvious that ‘amManual’ transmission holds a significant mileage gain in comparison to ‘amAuto’ cars. Our ‘t-test’, ‘confidence interval’ and residual analysis offer a clear mileage preference for ‘manual-transmission’ cars.

Even multivariable analysis with variable adjustment and interaction forcefully confirms that “A manual transmission car is better for MPG, rather than an automatic one.”

Regression_Model_Project

Md Ahmed

June 13th, 2017