Coursera Motor Trend Project

EXPLORATORY ANALYSIS

1. Load Libraries and Data

data("mtcars")
library(car)

2. View some details about data

dim(mtcars) # no of rows and columns

## [1] 32 11

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

From the data we can deduce that the data consist of 32 rows(observations) and 11 variables.
All the data is of class numeric and this is definetely not correct thus we move to the next stage of transforming those variables.

3. Transform the data

mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

Lets check the basic summary for the data and for a plot of the same you can check below

table(mtcars$am) # No of Automatic and Manual transmision vehicles

## 
## Automatic    Manual 
##        19        13

aggregate(mpg ~ am, data = mtcars, mean) # MPG mean per transimision

##          am      mpg
## 1 Automatic 17.14737
## 2    Manual 24.39231

From this we can generally say that the Automatic travel less miles per gallon compared to those of manual transmission ant this is also confirmed visually by boxplot one below.
Below we test if the difference is statistically significant.
Null Hypothesis - The difference is no significant

Auto <- mtcars[mtcars$am == "Automatic",]$mpg
NonAuto <- mtcars[mtcars$am == "Manual",]$mpg

t.test(Auto, NonAuto)

## 
##  Welch Two Sample t-test
## 
## data:  Auto and NonAuto
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

The t.test give us a P-Value of 0.001375 thus reject the null hypothesis at 5% significance level thus conclusion that the difference is significant.

Regression Analysis

Here we start with a very simple model that is mpg regressed by am - Transmission.

fitone <- lm(mpg ~ am, data = mtcars)

summary(fitone)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

This shows us that the average MPG for automatic is 17.1 MPG, while manual is 7.2 MPG higher. From the model we get an adjusted R-squared of 33.85% this quite a low variance explained by the model.
Due to little variance explained by the model let examine other variable that are might be relevent to explain more variance to build a multivariate linear regression.

fitall <- lm(mpg ~ ., data = mtcars)

vif(fitall)

##            GVIF Df GVIF^(1/(2*Df))
## cyl  128.120962  2        3.364380
## disp  60.365687  1        7.769536
## hp    28.219577  1        5.312210
## drat   6.809663  1        2.609533
## wt    23.830830  1        4.881683
## qsec  10.790189  1        3.284842
## vs     8.088166  1        2.843970
## am     9.930495  1        3.151269
## gear  50.852311  2        2.670408
## carb 503.211851  5        1.862838

summary(fitall)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## amManual     1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

Including all the variables as the model has improved that is by explaining 77.9% of the variance which is given by R-squared.
Furter from the varance inflation factor and summaries of the model we can see that some regressors are highly correlated thus we can eliminate them to reduce the standard error. This include: * disp - Displacement
* Vs * gear - Number of forward gears * drat - Rear axle ratio * carb - Number of carburetors

bestfit <- lm(mpg ~ cyl + hp + wt + qsec + am, data = mtcars)

summary(bestfit)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9511 -1.4244 -0.1767  1.3666  4.2187 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 21.57617   11.27271   1.914   0.0671 . 
## cyl6        -1.90950    1.72992  -1.104   0.2802   
## cyl8        -0.22716    2.87047  -0.079   0.9376   
## hp          -0.02481    0.01515  -1.637   0.1141   
## wt          -2.96274    0.97728  -3.032   0.0056 **
## qsec         0.61917    0.55987   1.106   0.2793   
## amManual     2.83270    1.67020   1.696   0.1023   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.4 on 25 degrees of freedom
## Multiple R-squared:  0.8721, Adjusted R-squared:  0.8414 
## F-statistic: 28.42 on 6 and 25 DF,  p-value: 5.196e-10

This works as expected the model has improved significantly to attain a R-squared of 84.14% and reduced our residual standard error to 2.4 from 2.8.

Conclusion

Here lets run an anova test between the three models to see if they are significantly different from one another.

anova(fitone, bestfit, fitall)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + qsec + am
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     25 143.98  5    576.91 14.3746 2.886e-05 ***
## 3     15 120.40 10     23.58  0.2938     0.972    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This results in a p-value of 2.886e-05, and we can claim the bestFit model is significantly better than our fitone simple model.This is confirmed by the residuals for non-normality (Appendix - Diagnostic plots ) and can see they are all normally distributed and homoskedastic.

Appendix

This section contain supporting plots for our analysis.

Boxplot One

boxplot(mpg ~ am, data = mtcars, col = c("red", "orange"),
        
        main = "MPG VS Transmission Type",ylab = "Miles Per Gallon", xlab = "Transmission Type")

The boxplot show a clear difference which correspond to our t test carried above.

Diagnostic plots

Lets plot some diagnostic plots for the best modelaccording to our analysis.

par(mfrow = c(2, 2))

plot(bestfit)

This plots show all the linear regression models assumptions are met.