PA_RegressionModel

Instructions

You work for ‘Motor Trend’, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”

Executive summary

This project uses a data set of a collection of cars and we are exploring the relationship between a set of variables and miles per gallon (MPG). First, we assess the relationship between miles per gallon and the type of transformation. Results show that manual transmission is 7.2 miles higher than automatic transmission. This model suffers from lower r-squared, a measure of goodness-of-fit. The second model added to the first model two variables namely weight and mile time. The goodness-of-fit increased significantly and the two variables appear to be statistically significant. When including all the varibales to assess the relationship between mpg and all other regressors, we found that all the variables relationship with outcome, mpg, expect wt. We then assess the significance of adding regressors in these models and found that the second model which comprises of ‘mpg’ as dependent variable and ‘am’, ‘wt’ and ‘qsec’ as independent variables.

Data description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).They are stored in R as ‘mtcars’.

#Look at the top 3 rows of the dataset 
head(mtcars,3)

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Explolatary data analysis

We first see the features of the data set and its variables. The ‘summary’ function shows descpiptive statistics for each variables as shown below.

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Secondly, the ‘str’ function helps to know more on the class for each variables and the size of the datasets. As indicated below the data set contains 18 variables collected on 32 observations.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Data visualization

We visualize a scatter plot of miles/US gallon and weight (1000 lbs) by the number of cylinders

#Load library
library(ggplot2)

#Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x=wt, y=mpg, color=cyl))+
geom_point(size=4)+
xlab("Weight (1000 lbs") + ylab ("Miles /(USD) gallon") + 
ggtitle("Scatter plot: Miles per USD gallon vs Weight (1000 lbs)")

Fitting and selection of models

We first assess the relationship of our outcome variable ‘mpg’ with other predictors. The following correlation matrix shows the correlation coefficients between the possible pairs.

# Correlation between variables.
mcorr <- round (cor(mtcars, method = "pearson"), 2)
upper<-mcorr
upper[upper.tri(mcorr)]<-""
upper<-as.data.frame(upper)
upper

##        mpg   cyl  disp    hp  drat    wt  qsec    vs   am gear carb
## mpg      1                                                         
## cyl  -0.85     1                                                   
## disp -0.85   0.9     1                                             
## hp   -0.78  0.83  0.79     1                                       
## drat  0.68  -0.7 -0.71 -0.45     1                                 
## wt   -0.87  0.78  0.89  0.66 -0.71     1                           
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17     1                     
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74     1               
## am     0.6 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17    1          
## gear  0.48 -0.49 -0.56 -0.13   0.7 -0.58 -0.21  0.21 0.79    1     
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57 0.06 0.27    1

We first model ‘mpg’ as a function of ‘am’ only.

fit1 <- lm(mpg ~ am, data = mtcars)
summary(fit1)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

Results: As the p-value is less than 0.5 percent level of signifance, both the intercept (mean of the reference group 0 =automatic transmission) and the second term (mean difference between group 1 = manual transmission with group 0) are statistically significant. The result reveal that manual transmission is 7.2 miles per gallon higher than automatic transmission. One would notice, however, how small R-squared is, 0.36, indicating the regressor doesn’t explain the variation in ‘mpg’ pretty well. So we need to add more variables.

Adding more variables

Looking at the correlation matrix, variables are correlated among themselves. If all included in the model, the model would suffer from multicolinearity. We thus identify variables that are highly correted with ‘mpg’ and with small correlation pairs.

fit2 <-lm (mpg ~ am + wt + qsec, mtcars)
summary(fit2)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## am           2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04

Results: The two added variables appear to be statistically significant. As expected by adding two variables, the coefficient for transmission decreased sharply, partly indicating how biased the model fit1 was when these variables were ommitted.

We also assess this relationship when we include all the regressors.

# We first fit our outcome variable 'mpg' with other all other variables
fit3 <- lm (mpg ~ ., mtcars)
summary(fit3)$coef

##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

Results:By including all the variables, only weight is statistically significant at 10% level of significance, others are not. There is a 3.7 decline in miles per gallon for every 1000 lbs incraese in weight at 10 percent signifinace level.

We then use anova to assess the significance of the regressors that are added in the models. The null hypothesis, Ho is that the added regressors are not significant.

anova(fit1, fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + qsec
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 169.29  2    551.61 39.2687 8.025e-08 ***
## 3     21 147.49  7     21.79  0.4432    0.8636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results: We can see that by adding two variables namely ‘wt’ and ‘qsec’ the model is statistically significant. In addition, the R-squared for model 2 is too high 0.85, explaining how better these variables explain the outcome.

Residual diagnosis

We also assess residuals for normality and overall residuals for model 2.

#Check for normality
shapiro.test(fit2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit2$residuals
## W = 0.9411, p-value = 0.08043

Residual models

#Model Residuals 
par(mfrow = c (2,2))
plot(fit2)