This report based on the analysis of the 1974 Motor Trend US magazine data regarding comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). In this report, regression models and exploratory data analyses used to explore the relationship between transmission type (am) and fuel consumption (MPG). Based on the t-test result, we conclude that there is a substantial diffrence in fuel consumption between cars with automatic (am=1) and manual (am=0) transmissions. Cars with manual transmissions get 3 to 11 miles per gallon more than cars with automatic transmissions on average. Akaike information criterion (AIC) is used in parsimonious model selection, where adjusted R-squared value is used to reach parsimonious model. Based on the *Selected model output, we conclude that keeping number of cylinders, horse power and weight constant, cars with manual transmission have 1.8 miles per gallon (MPG) more compared to automatic transmission cars.
Visual analysis of mtcars data (refer to Appendix) lets us identify strong negative correlation in pairs: mpg~cyl, mpg~disp, mpg~hp, mpg~wt. The analysis also shows the need to assign factor levels to some variables in the dataset.
library(ggplot2)
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
p1 <- pairs(mtcars,panel=panel.smooth)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
Lets perform a hypothesis test (Welch Two Sample t-test) on transmission type and miles per gallon relationship with the following parameters: \(# H0: There is no difference in means of automatic transmission and manual transmission MPGs.\) \(# Ha: There is true difference in means.\)
test <- t.test(mtcars$mpg~mtcars$am)
test$statistic;test$parameter;test$p.value
## t
## -3.767123
## df
## 18.33225
## [1] 0.001373638
test$conf.int
## [1] -11.280194 -3.209684
## attr(,"conf.level")
## [1] 0.95
As a result, we got t-value at -3.7671231 with 18.3322516 degrees of freedom, and p-value at 0.0013736. With such tiny p-value, we reject the null hypothesis, and based on the confidence interval conclude that we are 95% sure that automatic transmission cars get 3 to 11 miles per gallon less compared to manual transmission cars.
We will use Akaike information criterion (AIC) in parsimonious model selection process.
fit0 <- lm(mpg ~ ., data=mtcars)
summary(fit0)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## am1 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
Model1 <- step(fit0,k=2)
## Start: AIC=76.4
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 5 13.5989 134.00 69.828
## - gear 2 3.9729 124.38 73.442
## - am 1 1.1420 121.55 74.705
## - qsec 1 1.2413 121.64 74.732
## - drat 1 1.8208 122.22 74.884
## - cyl 2 10.9314 131.33 75.184
## - vs 1 3.6299 124.03 75.354
## <none> 120.40 76.403
## - disp 1 9.9672 130.37 76.948
## - wt 1 25.5541 145.96 80.562
## - hp 1 25.6715 146.07 80.588
##
## Step: AIC=69.83
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 2 5.0215 139.02 67.005
## - disp 1 0.9934 135.00 68.064
## - drat 1 1.1854 135.19 68.110
## - vs 1 3.6763 137.68 68.694
## - cyl 2 12.5642 146.57 68.696
## - qsec 1 5.2634 139.26 69.061
## <none> 134.00 69.828
## - am 1 11.9255 145.93 70.556
## - wt 1 19.7963 153.80 72.237
## - hp 1 22.7935 156.79 72.855
##
## Step: AIC=67
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - drat 1 0.9672 139.99 65.227
## - cyl 2 10.4247 149.45 65.319
## - disp 1 1.5483 140.57 65.359
## - vs 1 2.1829 141.21 65.503
## - qsec 1 3.6324 142.66 65.830
## <none> 139.02 67.005
## - am 1 16.5665 155.59 68.608
## - hp 1 18.1768 157.20 68.937
## - wt 1 31.1896 170.21 71.482
##
## Step: AIC=65.23
## mpg ~ cyl + disp + hp + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - disp 1 1.2474 141.24 63.511
## - vs 1 2.3403 142.33 63.757
## - cyl 2 12.3267 152.32 63.927
## - qsec 1 3.1000 143.09 63.928
## <none> 139.99 65.227
## - hp 1 17.7382 157.73 67.044
## - am 1 19.4660 159.46 67.393
## - wt 1 30.7151 170.71 69.574
##
## Step: AIC=63.51
## mpg ~ cyl + hp + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - qsec 1 2.442 143.68 62.059
## - vs 1 2.744 143.98 62.126
## - cyl 2 18.580 159.82 63.466
## <none> 141.24 63.511
## - hp 1 18.184 159.42 65.386
## - am 1 18.885 160.12 65.527
## - wt 1 39.645 180.88 69.428
##
## Step: AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
##
## Df Sum of Sq RSS AIC
## - vs 1 7.346 151.03 61.655
## <none> 143.68 62.059
## - cyl 2 25.284 168.96 63.246
## - am 1 16.443 160.12 63.527
## - hp 1 36.344 180.02 67.275
## - wt 1 41.088 184.77 68.108
##
## Step: AIC=61.65
## mpg ~ cyl + hp + wt + am
##
## Df Sum of Sq RSS AIC
## <none> 151.03 61.655
## - am 1 9.752 160.78 61.657
## - cyl 2 29.265 180.29 63.323
## - hp 1 31.943 182.97 65.794
## - wt 1 46.173 197.20 68.191
summary(Model1)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## am1 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
The trade-off between the goodness of fit of the model and the complexity of the model is reached with lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars). Here is the comparison of the \(full\) \(model\) and \(selected\) \(model\):
sumTab <- data.frame(1:4,1:4)
sumTab[,1] <- rbind("Residual standard error: 2.833 on 15 degrees of freedom","Multiple R-squared: 0.8931","Adjusted R-squared: 0.779","p-value: 0.000124")
sumTab[,2] <- rbind("Residual standard error: 2.41 on 26 degrees of freedom","Multiple R-squared: 0.8659","Adjusted R-squared: 0.8401","p-value: 1.506e-10")
colnames(sumTab) <- c("Full model","Selected model")
sumTab
## Full model
## 1 Residual standard error: 2.833 on 15 degrees of freedom
## 2 Multiple R-squared: 0.8931
## 3 Adjusted R-squared: 0.779
## 4 p-value: 0.000124
## Selected model
## 1 Residual standard error: 2.41 on 26 degrees of freedom
## 2 Multiple R-squared: 0.8659
## 3 Adjusted R-squared: 0.8401
## 4 p-value: 1.506e-10
As we can see from the table above, althoguh multiple R-squared is significantly higher for Full model, we have much higher adjusted R-squared for Selected model. Residual standard error is lower for Selected model. These characteristics mean that the Selected model can explain 86% of variability in MPG variablew with lower number of predictors then in Full model. Which was the goal of our research.
The residuals plot confirms the following assumptions: 1. The Residuals vs. Fitted plot confirms the independence assumption since no pattern is discovered. 2. The Normal Q-Q plot confirms normal distribution of residuals as the points follow a line pattern. 3. The Scale-Location plot confirms that variance is constant as there is no pattern discovered. 4. The Residuals vs. Leverage confirms that no influential outliers are present.
Our findings show that when number of cylinders, horse power and weight remain constant, cars with manual transmission have 1.8 miles per gallon (MPG) compared to automatic transmission cars.
Graph 1.0 Scatterplots of the 1974 Motor Trend US magazine data
pairs(mtcars,panel=panel.smooth)
Graph 2.0 Miles per gallon and weight relationship
ggplot(data=mtcars,aes(x=wt,y=mpg,col=hp)) + geom_point(stat="identity") + geom_smooth() + facet_wrap(~am, scales = "free") + labs(x="Weight (lb/1000)",y="Miles/(US) gallon",title="Miles per gallon and weight relationship based on transmission type")
Graph 3.0 Residuals plot
par(mfrow = c(2, 2))
plot(Model1)
Graph 4.0 Miles per gallon by Transmission type
boxplot(mtcars$mpg ~ mtcars$am, xlab="Transmission type", ylab="Miles per gallon",main="Miles per gallon by Transmission type")