You work for ‘Motor Trend’, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
This project uses a data set of a collection of cars and we are exploring the relationship between a set of variables and miles per gallon (MPG). First, we assess the relationship between miles per gallon and the type of transformation. Results show that manual transmission is 7.2 miles higher than automatic transmission. This model suffers from lower r-squared, a measure of goodness-of-fit. The second model added to the first model two variables namely weight and mile time. The goodness-of-fit increased significantly and the two variables appear to be statistically significant. When including all the varibales to assess the relationship between mpg and all other regressors, we found that all the variables relationship with outcome, mpg, expect wt. We then assess the significance of adding regressors in these models and found that the second model which comprises of ‘mpg’ as dependent variable and ‘am’, ‘wt’ and ‘qsec’ as independent variables.
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).They are stored in R as ‘mtcars’.
#Look at the top 3 rows of the dataset
head(mtcars,3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
We first see the features of the data set and its variables. The ‘summary’ function shows descpiptive statistics for each variables as shown below.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Secondly, the ‘str’ function helps to know more on the class for each variables and the size of the datasets. As indicated below the data set contains 18 variables collected on 32 observations.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
We visualize a scatter plot of miles/US gallon and weight (1000 lbs) by the number of cylinders
#Load library
library(ggplot2)
#Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x=wt, y=mpg, color=cyl))+
geom_point(size=4)+
xlab("Weight (1000 lbs") + ylab ("Miles /(USD) gallon") +
ggtitle("Scatter plot: Miles per USD gallon vs Weight (1000 lbs)")
We first assess the relationship of our outcome variable ‘mpg’ with other predictors. The following correlation matrix shows the correlation coefficients between the possible pairs.
# Correlation between variables.
mcorr <- round (cor(mtcars, method = "pearson"), 2)
upper<-mcorr
upper[upper.tri(mcorr)]<-""
upper<-as.data.frame(upper)
upper
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1
## cyl -0.85 1
## disp -0.85 0.9 1
## hp -0.78 0.83 0.79 1
## drat 0.68 -0.7 -0.71 -0.45 1
## wt -0.87 0.78 0.89 0.66 -0.71 1
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1
## am 0.6 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1
## gear 0.48 -0.49 -0.56 -0.13 0.7 -0.58 -0.21 0.21 0.79 1
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1
We first model ‘mpg’ as a function of ‘am’ only.
fit1 <- lm(mpg ~ am, data = mtcars)
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
Results: As the p-value is less than 0.5 percent level of signifance, both the intercept (mean of the reference group 0 =automatic transmission) and the second term (mean difference between group 1 = manual transmission with group 0) are statistically significant. The result reveal that manual transmission is 7.2 miles per gallon higher than automatic transmission. One would notice, however, how small R-squared is, 0.36, indicating the regressor doesn’t explain the variation in ‘mpg’ pretty well. So we need to add more variables.
Looking at the correlation matrix, variables are correlated among themselves. If all included in the model, the model would suffer from multicolinearity. We thus identify variables that are highly correted with ‘mpg’ and with small correlation pairs.
fit2 <-lm (mpg ~ am + wt + qsec, mtcars)
summary(fit2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## am 2.935837 1.4109045 2.080819 4.671551e-02
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
Results: The two added variables appear to be statistically significant. As expected by adding two variables, the coefficient for transmission decreased sharply, partly indicating how biased the model fit1 was when these variables were ommitted.
We also assess this relationship when we include all the regressors.
# We first fit our outcome variable 'mpg' with other all other variables
fit3 <- lm (mpg ~ ., mtcars)
summary(fit3)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
Results:By including all the variables, only weight is statistically significant at 10% level of significance, others are not. There is a 3.7 decline in miles per gallon for every 1000 lbs incraese in weight at 10 percent signifinace level.
We then use anova to assess the significance of the regressors that are added in the models. The null hypothesis, Ho is that the added regressors are not significant.
anova(fit1, fit2, fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + qsec
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 39.2687 8.025e-08 ***
## 3 21 147.49 7 21.79 0.4432 0.8636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results: We can see that by adding two variables namely ‘wt’ and ‘qsec’ the model is statistically significant. In addition, the R-squared for model 2 is too high 0.85, explaining how better these variables explain the outcome.
We also assess residuals for normality and overall residuals for model 2.
#Check for normality
shapiro.test(fit2$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit2$residuals
## W = 0.9411, p-value = 0.08043
#Model Residuals
par(mfrow = c (2,2))
plot(fit2)