Executive Summary

In this project, we research the relationship between a set of variables and miles per gallon (MPG) - fuel consumption (outcome). After a quick exploratory data analysis, we made a t-test to inference if there is a significative difference in mileage consumption by cars transmission type (manual and automatic). And this test demonstrates that it has, in fact - cars with manual transmission run at about 7 MPG more than those with automatic transmission. After, we continue to use available “mtcars” data set, adjusting a regression model in order to explore more this relationship. Several models are tested thru a stepwise selection technique. The final chosen model points out that keeping constant both weight and time for ? distance (qsec) the apparent increase in mileage related with manual transmission type is reduced to 2.94 mpg.

Brief data description

The data was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption (mpg) plus 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

  1. mpg => Miles/(US) gallon fuel consumption;
  2. cyl => Number of cylinders;
  3. disp => Displacement (cu.in.)
  4. hp => Gross horsepower;
  5. drat => Rear axle ratio;
  • wt => Weight (lb/1000);
  • qsec => 1/4 mile time;
  • vs => V/S;
  • am => Transmission (0 = automatic, 1 = manual);
  • gear => Number of forward gears;
  • carb => Number of carburetors

  • Data Processing

    library(datasets)  
    df <- mtcars
    df[1:4,]  #above 4 first lines of 'mtcars' data
    ##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
    ## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    ## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    ## Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    ## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

    We have in total 32 rows corresponding to each car used in this study.


    Exploratory Data Analysis

    A summary about all data accompanied of 6 box plots could see in Appendix I. These plots show how continuous variables are briefly distributed.


    Subsetting observations

    To compare two modes of transmission, let’s subdivide mpg data in two subsets: ‘df_auto’ and ‘df_manual’.

    df_auto <- df[ df$am==0, ]
    df_manual <- df[ df$am==1, ]

    Hypothesis test

    We perform a T-test to do an inference with these hypothesis (two-way):

    • H0: mpg w/auto mode is equal than mpg/automatic
    • Ha: mpg w/auto mode is not equal than mpg/automatic
    T_test <- t.test(df_auto$mpg, df_manual$mpg)
    print(T_test)
    ## 
    ##  Welch Two Sample t-test
    ## 
    ## data:  df_auto$mpg and df_manual$mpg
    ## t = -3.7671, df = 18.332, p-value = 0.001374
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  -11.280194  -3.209684
    ## sample estimates:
    ## mean of x mean of y 
    ##  17.14737  24.39231

    The p-value is 0.00137 which is less than 0.05 (alpha=95%) ==> H0 rejected. Thus, the mpg of am=1 subset (manual) is significant larger than am=0 subset (auto)

    See Appendix II to visually experience this difference and mean values for each subset => 7.245

    Inference => The auto transmission isn’t better than manual for mpg consumption.


    Regression Analysis

    It seems that mpg w/manual transmission compared with auto is better. However this premise is based on all other features of both car types (e.g: auto cars and manual cars have same weight, displacement or cylinder numbers distribution). It worths to be deeply investigated by regression models. Let’s do it!

    But what variables should be participate of our model to explain the relationship between mpg consumption and transmission automatic? This major problem can be solved by a technical known as stepwise selection. See more details and development of cases in Appendix III - Stepwise Regression.

    Chosen the final model
    "mpg ~ wt + qsec + am"

    we obtain main statistics to study its importance:

    bestfit <- lm(mpg ~ wt + qsec + am, data=mtcars)
    summary(bestfit)
    ## 
    ## Call:
    ## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   9.6178     6.9596   1.382 0.177915    
    ## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
    ## qsec          1.2259     0.2887   4.247 0.000216 ***
    ## am            2.9358     1.4109   2.081 0.046716 *  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 2.459 on 28 degrees of freedom
    ## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
    ## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

    This model which includes 3 variables: wt, qsec and am. It explains almost 85% of phenomenon variability (R2). Two first coefficients (beta1 and beta2) are very statistically significant (less than 1% level) and must be present in model (greater than 0).

    The summary model presents that if “wt”and “qsec” (time to achieve 1/4 mile) remain constant, transmission type (one unit increasing) ’contributes 9.61 + (2.94)*am to mileage consumption on average. At a risk (p-value) of 0.047 (almost 5%).

    In Appendix IV we plot residuals and diagnostic for this best fit model. We may see that residuals are reasonably adjusted in normal distribution.

    Conclusion

    The main questions for this study can be answered this way:

    1. “Is an automatic or manual transmission better for MP?”
      A.: Manual transmission is better - higher mileage, less fuel consumption.
    2. “Could be measured the MPG difference between automatic and manual transmissions?”
      A.: In average, the manual transmission allows that cars run more 7.245 MPG than automatic ones. But when under the influence of other variables, such as weight and qsec, these difference is significatively decreased.
    We got a regression model that adjusts a relation between mpg and three motor cars variables. Anyway is difficult to choose and interpret, precisely, the best model in this study - 32 records is too less to get rid of some outliers that could prejudice regression adjustment and increase the other part that’s not explained (residual). A good way to continue this analysis is to collect more observations.

    APPENDIX - auxiliary plots and steps to analysis

    ** I. BOX PLOTS OF CONTINUOS VARIABLES **

    [go back to Exploratory Data Analysis]
    ##       mpg             cyl             disp             hp       
    ##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
    ##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
    ##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
    ##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
    ##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
    ##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
    ##       drat             wt             qsec             vs        
    ##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
    ##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
    ##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
    ##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
    ##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
    ##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
    ##        am              gear            carb      
    ##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
    ##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
    ##  Median :0.0000   Median :4.000   Median :2.000  
    ##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
    ##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
    ##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

    (miles per gallon, displacement, horse power, Rear axle ratio, weight and 1/4 mile time)

    ** II. BOX PLOT OF mpg X transmission mode **

    [go back to Statistical Inference]

    [go back to Regression Analysis]

    ** III. STEPWISE REGRESSION ITERACTIONS **

    Selecting a subset of predictor variables from a larger set is a controversial topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package. stepAIC( ) performs stepwise model selection by exact AIC. http://www.statmethods.net/stats/regression.html

    You may follow all sequence of iteractions until final selection:

    ## Start:  AIC=70.9
    ## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
    ## 
    ##        Df Sum of Sq    RSS    AIC
    ## - cyl   1    0.0799 147.57 68.915
    ## - vs    1    0.1601 147.66 68.932
    ## - carb  1    0.4067 147.90 68.986
    ## - gear  1    1.3531 148.85 69.190
    ## - drat  1    1.6270 149.12 69.249
    ## - disp  1    3.9167 151.41 69.736
    ## - hp    1    6.8399 154.33 70.348
    ## - qsec  1    8.8641 156.36 70.765
    ## <none>              147.49 70.898
    ## - am    1   10.5467 158.04 71.108
    ## - wt    1   27.0144 174.51 74.280
    ## 
    ## Step:  AIC=68.92
    ## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
    ## 
    ##        Df Sum of Sq    RSS    AIC
    ## - vs    1    0.2685 147.84 66.973
    ## - carb  1    0.5201 148.09 67.028
    ## - gear  1    1.8211 149.40 67.308
    ## - drat  1    1.9826 149.56 67.342
    ## - disp  1    3.9009 151.47 67.750
    ## - hp    1    7.3632 154.94 68.473
    ## <none>              147.57 68.915
    ## - qsec  1   10.0933 157.67 69.032
    ## - am    1   11.8359 159.41 69.384
    ## - wt    1   27.0280 174.60 72.297
    ## 
    ## Step:  AIC=66.97
    ## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
    ## 
    ##        Df Sum of Sq    RSS    AIC
    ## - carb  1    0.6855 148.53 65.121
    ## - gear  1    2.1437 149.99 65.434
    ## - drat  1    2.2139 150.06 65.449
    ## - disp  1    3.6467 151.49 65.753
    ## - hp    1    7.1060 154.95 66.475
    ## <none>              147.84 66.973
    ## - am    1   11.5694 159.41 67.384
    ## - qsec  1   15.6830 163.53 68.200
    ## - wt    1   27.3799 175.22 70.410
    ## 
    ## Step:  AIC=65.12
    ## mpg ~ disp + hp + drat + wt + qsec + am + gear
    ## 
    ##        Df Sum of Sq    RSS    AIC
    ## - gear  1     1.565 150.09 63.457
    ## - drat  1     1.932 150.46 63.535
    ## <none>              148.53 65.121
    ## - disp  1    10.110 158.64 65.229
    ## - am    1    12.323 160.85 65.672
    ## - hp    1    14.826 163.35 66.166
    ## - qsec  1    26.408 174.94 68.358
    ## - wt    1    69.127 217.66 75.350
    ## 
    ## Step:  AIC=63.46
    ## mpg ~ disp + hp + drat + wt + qsec + am
    ## 
    ##        Df Sum of Sq    RSS    AIC
    ## - drat  1     3.345 153.44 62.162
    ## - disp  1     8.545 158.64 63.229
    ## <none>              150.09 63.457
    ## - hp    1    13.285 163.38 64.171
    ## - am    1    20.036 170.13 65.466
    ## - qsec  1    25.574 175.67 66.491
    ## - wt    1    67.572 217.66 73.351
    ## 
    ## Step:  AIC=62.16
    ## mpg ~ disp + hp + wt + qsec + am
    ## 
    ##        Df Sum of Sq    RSS    AIC
    ## - disp  1     6.629 160.07 61.515
    ## <none>              153.44 62.162
    ## - hp    1    12.572 166.01 62.682
    ## - qsec  1    26.470 179.91 65.255
    ## - am    1    32.198 185.63 66.258
    ## - wt    1    69.043 222.48 72.051
    ## 
    ## Step:  AIC=61.52
    ## mpg ~ hp + wt + qsec + am
    ## 
    ##        Df Sum of Sq    RSS    AIC
    ## - hp    1     9.219 169.29 61.307
    ## <none>              160.07 61.515
    ## - qsec  1    20.225 180.29 63.323
    ## - am    1    25.993 186.06 64.331
    ## - wt    1    78.494 238.56 72.284
    ## 
    ## Step:  AIC=61.31
    ## mpg ~ wt + qsec + am
    ## 
    ##        Df Sum of Sq    RSS    AIC
    ## <none>              169.29 61.307
    ## - am    1    26.178 195.46 63.908
    ## - qsec  1   109.034 278.32 75.217
    ## - wt    1   183.347 352.63 82.790
    ## Stepwise Model Path 
    ## Analysis of Deviance Table
    ## 
    ## Initial Model:
    ## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
    ## 
    ## Final Model:
    ## mpg ~ wt + qsec + am
    ## 
    ## 
    ##     Step Df   Deviance Resid. Df Resid. Dev      AIC
    ## 1                             21   147.4944 70.89774
    ## 2  - cyl  1 0.07987121        22   147.5743 68.91507
    ## 3   - vs  1 0.26852280        23   147.8428 66.97324
    ## 4 - carb  1 0.68546077        24   148.5283 65.12126
    ## 5 - gear  1 1.56497053        25   150.0933 63.45667
    ## 6 - drat  1 3.34455117        26   153.4378 62.16190
    ## 7 - disp  1 6.62865369        27   160.0665 61.51530
    ## 8   - hp  1 9.21946935        28   169.2859 61.30730
    [go to Conclusion]

    ** IV. RESIDUALS AND DIAGNOSTICS FOR FINAL REGRESSION MODEL **

    1. The Residuals vs. Fitted plot shows no consistent pattern, supporting the accuracy of the independence assumption.
    2. The Normal Q-Q plot indicates that the residuals are normally distributed because the points lie closely to the line.
    3. The Scale-Location plot confirms the constant variance assumption, as the points are randomly distributed.
    4. The Residuals vs. Leverage argues that no outliers are present, as all values fall well within the 0.5 bands.