Introduction

In this project I have been tasked with pretending that I work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). The are particularly interested in the following two questions:

Exploratory Data Analysis

First I will extract the necessary data.

data(mtcars)
dim(mtcars)
## [1] 32 11
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Accoring to R: Motor Trend Car Road Tests the data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs V/S [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

Now I will construct a boxplot for further analysis.

boxplot(mpg ~ am, data = mtcars, col = (c("pink","green")), 
        xlab = "Automatic (0) vs. Manual (1)",
        ylab="Miles per Gallon", 
        main="Miles per Gallon vs. Transmission")

I will fit a linear model to see the fit summary for the mpg given transmission type.

fit1 <- lm(mpg ~ am, data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

From this summary we see that 33.8% of the regression variance is explained by our model. It is necessary to check if any of the other variables play a large role in the outcome of the model.

fit2 <- lm(mpg ~ ., data = mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
cor(mtcars)[1,]
##        mpg        cyl       disp         hp       drat         wt 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594 
##       qsec         vs         am       gear       carb 
##  0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251

From this data we can see that there are a couple of other variables that are highly correlated with the mpg values. So, I will do a linear regression of the top five variables along with am against mpg to see a visual representation of explore the effect of these variables. These variables are namely cyl, disp, hp, drat, and wt.

fit3 <- lm(mpg ~ am + cyl+ disp + hp + drat + wt, data = mtcars)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ am + cyl + disp + hp + drat + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.437 -1.574 -0.688  1.310  5.551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.04938    7.60553   4.740 7.31e-05 ***
## am           1.37506    1.56866   0.877  0.38906    
## cyl         -1.03335    0.72405  -1.427  0.16590    
## disp         0.01257    0.01195   1.052  0.30307    
## hp          -0.02887    0.01444  -1.999  0.05658 .  
## drat         0.48586    1.49495   0.325  0.74788    
## wt          -3.27472    1.15685  -2.831  0.00903 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.549 on 25 degrees of freedom
## Multiple R-squared:  0.8557, Adjusted R-squared:  0.8211 
## F-statistic: 24.72 on 6 and 25 DF,  p-value: 2.266e-09

Conclusion

par(mfrow = c(2, 2))
plot(fit3)

Based on the analysis above, we can see that 82.11% of the variance can be explained by this model. Also, cyl, disp, hp, drat and wt had an effect on the correlation between mpg and am. Therefore, we have observed a quantified difference between automatic and manual transmissions is approximately 1.37 MPG.