Executive summary

Most car manufacturers today compete against each other to make their cars more efficient i.e. increase the mileage or miles per gallon(mpg) of the car. This document aims to analyze the relation between mileage and other properties of the car, with a focus on transmission i.e. whether the car being automatic or manual makes a difference in the mileage. The dataset used is the mtcars dataset from the datasets package in R.

Exloratory data analysis

Basic exploration of the dataset shows that manual cars(1) have a higher mileage than automatic cars(0).

##        0        1 
## 17.14737 24.39231

This does not imply that manual cars have a better mpg than automatic cars as there could be other factors at play here. The relationships between some of the factors that could affect mpg are given below

Regression analysis

We now try to analyze which of the above factors could be responsible for the mpg difference and to what extent.

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The the single variable linear model with am as the predictor and mpg as the outcome shows that transmission type is statistcally significant(p<0.05) and manual cars have a higher mileage than automatic cars(7.245 units higher) but this only explains ~30% of the data as shown by the R square result.

Thus we include more variables that could be responsible.

## 
## Call:
## lm(formula = mpg ~ am + disp + hp + wt + cyl, data = mtcars1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5952 -1.5864 -0.7157  1.2821  5.5725 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.20280    3.66910  10.412 9.08e-11 ***
## am           1.55649    1.44054   1.080  0.28984    
## disp         0.01226    0.01171   1.047  0.30472    
## hp          -0.02796    0.01392  -2.008  0.05510 .  
## wt          -3.30262    1.13364  -2.913  0.00726 ** 
## cyl         -1.10638    0.67636  -1.636  0.11393    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared:  0.8551, Adjusted R-squared:  0.8273 
## F-statistic:  30.7 on 5 and 26 DF,  p-value: 4.029e-10

Variance inflation factors

##        am      disp        hp        wt       cyl 
##  2.553064 10.401420  4.501859  6.079452  7.209456

The variance inflation factor shows that some of the variables used are collinear. Thus variables disp and cyl are removed

## 
## Call:
## lm(formula = mpg ~ am + hp + wt, data = mtcars1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
## am           2.083710   1.376420   1.514 0.141268    
## hp          -0.037479   0.009605  -3.902 0.000546 ***
## wt          -2.878575   0.904971  -3.181 0.003574 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

Variance inflation factors

##       am       hp       wt 
## 2.271082 2.088124 3.774838

The variance inflation factor now shows that the variables are not as correlated. However the residuals vs fitted plot(refer top left plot) shows some heteroskedasticity(variance increases as we go further along the x axis, the red line is curved). This is resolved using the box-cox transformation.

## 
## Call:
## lm(formula = dist_new ~ am + hp + wt, data = mtcars1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17137 -0.06955 -0.03865  0.07218  0.26567 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.7491397  0.1165798  32.159  < 2e-16 ***
## am           0.0516749  0.0607202   0.851 0.401970    
## hp          -0.0016850  0.0004237  -3.976 0.000448 ***
## wt          -0.1757558  0.0399224  -4.402 0.000142 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1119 on 28 degrees of freedom
## Multiple R-squared:  0.8724, Adjusted R-squared:  0.8587 
## F-statistic: 63.79 on 3 and 28 DF,  p-value: 1.24e-12

Thus the top left plot had reduced heteroskedasticity than before.(red line is less curved)

Results

The the single variable linear model with am as the predictor and mpg as the outcome shows that transmission type is statistcally significant(p<0.05) and manual cars have a higher mileage than automatic cars(7.245 units higher) but this only explains ~30% of the data as shown by the R square result. When more variables like horsepower(hp) and weight(wt) are included in the model, the transmission type(am) is not very statistically significant (p>0.05) whereas the other variables are (p<0.05). Thus the car being automatic or manual is not a very major determiner when it comes to assessing the mileage of a vehicle.