Regression Models: Final Project

Instructions

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

  • Is an automatic or manual transmission better for MPG?
  • Quantify the MPG difference between automatic and manual transmissions?

Executive Summary: An analysis of the automobile industry

This study is focusing on relationship between the Miles Per Gallon (MPG) and Transmission type (automatic or manual) from mtcars dataset. This dataset is extracted from the 1974 Motor Trend US magazine. With the analysis of the both stated metrics, the performance of the cars are evaluated thus distinguishing the MPG between automatic and manual transmission cars.

Exploratory Data Analysis

  • The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
  • The data consist of 32 oberservations with 11 variables
data("mtcars")
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am, labels = c("Automatic","Manual"))
  • This observation is based on the figure plotted in Appendix 1 and Appendix 2.
  • There is a higher throughput of MPG in manual transmission as compared to automatic transmission.
  • Most of the readings for automatic transmission are around 15-20 MPGs.
  • In pair graph, the cyl, disp, hp & wt variables seems like having a correlation with MPGs.

  • The mean MPG for automatic transmission is 17.1473684 Miles/gallon
  • The mean MPG for manual transmission is 24.3923077 Miles/gallon
  • The highest MPG for automatic transmission is 24.4 Miles/gallon
  • The highest MPG for manual transmission is 33.9 Miles/gallon

Statistical inference

Testing the null hypothesis that the MPG of automatic and manual transmisison came from the same population.

result <- t.test(mpg ~ am,data = mtcars)
result$p.value
## [1] 0.001373638
result$estimate
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

From this, we get the value of 0.0013736 for p-value thus reject the null hypothesis. It is because the p-value shows that the MPGs of automatic and manual trasmission are indeeed came from different population.

Model Analysis

  1. Fit the full model by taking into account all the variables
model_full <- lm(mpg ~ .,data = mtcars)
summary(model_full)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## amManual     1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124
  • Residual standard error: 2.65 on 21 degress of freedeom.
  • Adjusted R-square values: 0.8066
  • The model can explain about 80.6% of the variance of the MPG variable
  1. Fit the base model by take into account the transmission type variable
model_base <- lm(mpg ~ factor(am), data = mtcars)
summary(model_base)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        17.147      1.125  15.247 1.13e-15 ***
## factor(am)Manual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
  • Residual standard error: 4.902 on 30 degress of freedeom.
  • Adjusted R-square values: 0.3385
  • The model can explain about 33.8% of the variance of the MPG variable
  1. Fit the extended model by take into account the variable that is likely to have a correlation with MPG(from the pair graphs). The variables are cyl, hp, wt and am.
model_extended <- lm(mpg ~ am + hp + wt + cyl, data = mtcars)
summary(model_extended)
## 
## Call:
## lm(formula = mpg ~ am + hp + wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## amManual     1.80921    1.39630   1.296  0.20646    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10
  • Residual standard error: 2.509 on 27 degress of freedeom.
  • Adjusted R-square values: 0.8267
  • The model can explain about 82.7% of the variance of the MPG variable

With all 3 different regression models fitted, we choose the third model which is the “extended model” based on its highest adjusted r-square values.

Now is comparing between all the three models.

anova(model_full,model_base,model_extended)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ factor(am)
## Model 3: mpg ~ am + hp + wt + cyl
##   Res.Df    RSS  Df Sum of Sq       F    Pr(>F)    
## 1     15 120.40                                    
## 2     30 720.90 -15   -600.49  4.9874  0.001759 ** 
## 3     26 151.03   4    569.87 17.7489 1.476e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model_extended)$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## amManual     1.80921138 1.39630450  1.295714 2.064597e-01
## hp          -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt          -2.49682942 0.88558779 -2.819404 9.081408e-03
## cyl6        -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8        -2.16367532 2.28425172 -0.947214 3.522509e-01

P-value in the Model_extended(cyl+hp+wt+am) is lower than base model. Thus this reject the null hypothesis that confounder variables (cyl+hp+wt) don’t contribute to the accuracy of the model.

Residuals Analysis

Now, plotting the residual plots for the selected model: Extended model (cyl+hp+wt+am)

par(mfrow = c(2, 2))
plot(model_extended)

Observation:

  • Residuals vs Fitted: Seems to be randomly scattered and there is no significant pattern identified.
  • The Normal Q−Q: Most of the plots are near to the line thus saying that redsiduals are distributed normally.
  • Scale−Location: All the points are randomly distributed and scattered thus inidicates a constant variance
  • Residuals vs Leverage: The are some insteresting points. Possible some leverage points.

Conclusion

Based the observations from selected model:

  • Manual transmission cars is more efficient than automatic transmission.
  • MPG decrease slighly with increase of hp

Appendixes

Appendix 1: Boxplot of Miles/gallon(MPG) by Transmission type
boxplot(mpg~am, data=mtcars, main="Boxplot of Miles/gallon(MPG) by Transmission type", ylab="Miles/gallon", xlab="Transmission type")

Appendix 2: Point plot of Miles/gallon(MPG) vs Transmission type
ggplot(mtcars, aes(x = factor(am), y =mpg ))+geom_point()

Appendix 3: Pair graphs
pairs(mtcars, panel=panel.smooth, main="Pair Graph of 1974 Motor Trend")