This project is the Regression Models course project by coursera. The goal of this project is to explore and see the relationship between a set of variables with miles per gallon (mpg). We will use the mtcars dataset from R as the data for the project. The analysis are done to address the following questions:
The result from the t-test analysis shows that manual transmission car types have better miles per gallon compared to automatic transmission car types. On average, the difference between the two types are about 7 miles per gallon when only the transmission types are used in the model. After testing with other variables such as weight and horsepower, we found out that manual transmission only contributes to an average increase of 2.08 miles per gallon compared to automatic transmission with those other variables held constant.
We first load and read the top rows of the data
library(datasets)
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We transform some variables into factors and see the structure of the data
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am,labels = c("Automatic","Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
Here we see the statistical summary of miles per gallon (mpg)
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.42 19.20 20.09 22.80 33.90
We compare the transmission type of Automatic vs Manual by grouping them and observe the average mile per gallon for each type
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mtcars %>%
group_by(am) %>%
summarize(mean = mean(mpg))
## # A tibble: 2 x 2
## am mean
## <fctr> <dbl>
## 1 Automatic 17.14737
## 2 Manual 24.39231
The result shows that manual transmission has an average mpg of 24.392 while automatic transmission has an average mpg of 17.147 which is approximately 7 average mpg lower.
We subset the data into a subset containing only automatic and another only manual transmission type and perform a t-test
automatic <- mtcars[mtcars$am == "Automatic", ]
manual <- mtcars[mtcars$am == "Manual", ]
t.test(automatic$mpg, manual$mpg)
##
## Welch Two Sample t-test
##
## data: automatic$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
The resulting t-test shows that the p value is significant (0.001374) which concludes a statistically significant difference between the two transmission types.
Simple linear regression with mpg as the dependent variable and transmission type (am) as the independent variable
modelFit <- lm(mpg ~ am, data = mtcars)
summary(modelFit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The result shows an R^2 value of only 35.98% which can be interpreted as only about 36% of the variability of the mpg performance can be explained by the model. The result is not high enough to be concluded with definite certainity to explain the model so we should analyze the other variables with multiple linear regression analysis.
We try to select which model is the best by looking at the Anova comparison of model with multiple variables
modelFit1 <- lm(mpg ~ am , data = mtcars)
modelFit2 <- lm(mpg ~ am + wt, data = mtcars)
modelFit3 <- lm(mpg ~ am + wt + hp , data = mtcars)
modelFit4 <- lm(mpg ~ am + wt + hp + cyl, data = mtcars)
modelFit5 <- lm(mpg ~ am + wt + hp + cyl + disp, data = mtcars)
modelFit6 <- lm(mpg ~ ., data = mtcars)
anova(modelFit1, modelFit2, modelFit3, modelFit4, modelFit5, modelFit6)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
## Model 4: mpg ~ am + wt + hp + cyl
## Model 5: mpg ~ am + wt + hp + cyl + disp
## Model 6: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 55.1371 2.129e-06 ***
## 3 28 180.29 1 98.03 12.2126 0.003259 **
## 4 26 151.03 2 29.27 1.8230 0.195569
## 5 25 150.41 1 0.62 0.0768 0.785409
## 6 15 120.40 10 30.01 0.3738 0.939655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova result shows that modelfit3 is the best model to use indicated by the low p-value of below 0.05. The model includes transmission types (am), weight (wt) and horsepower (hp) as the independent variables.
We see the summary of the selected best model here
summary(modelFit3)
##
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## amManual 2.083710 1.376420 1.514 0.141268
## wt -2.878575 0.904971 -3.181 0.003574 **
## hp -0.037479 0.009605 -3.902 0.000546 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11
The result shows an R^2 of 83.99% which can be interpreted as about 84% of the variability of the mpg performance is explained by the model. Both weight and horsepower have a small p-value of below 0.05 indicating a strong statistical relationship between those variables with mpg. With the addition of the variables, we found out that manual transmission only contributes to an average increase of 2.08 miles per gallon compared to automatic transmission with the other variables held constant.
par(mfrow = c(2,2))
plot(modelFit3)
Here we can see the boxplot of miles per gallon vs transmission types
boxplot(mpg ~ am, data = mtcars, xlab = "Transmission Type", ylab = "Miles per Gallon")
The plot also shows that manual transmission type yields higher mpg than automatic transmission.
Here we see the pairs relationship plot of all the variables in the dataset with mpg
pairs(mpg ~ ., data = mtcars)