Coursera’s Regression Model Porject: Motor Trend Analysis

Motor Trend, an automobile industry magazine, is interested in exploring the relationship between a set of car variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions?

The “mtcars” dataset, that comes with R, will be used. The dataset contains 11 variables and 32 observations.

Load and visualize dataset

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Some of the variables include:

  1. mpg: miles/US gallon
  2. cyl: number of cylinders
  3. wt: weight (per 1000lbs)
  4. am: transmission (1=manual, 0=automatic)
  5. disp: displacement

Data analysis

Question 1: Is an automatic or manual transmission better for MPG?

A. Check which transmission is better.

Let’s change the class of the am variable

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <-c("Automatic", "Manual")

From the results in appendices 2 and 3, it is safe to assume that manual transmission is better than automatic transmission as it has higher mpg on average (if other variables are ignored).
Therefore, would like to test the hypothesis:
i. Null hypothesis: There is no difference between manual and automatic transmission ii. Alternative hypothesis: Manual transmission is better than automatic.

ttest<- t.test(mpg~am, data=mtcars)
ttest
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group Automatic and group Manual is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231
pvalue<- ttest$p.value

##to get R-squared value
fit2 <-lm(mpg~am, data=mtcars)
r_squared<- summary(fit2)$r.squared

As the p-value of 0.0013736 is less than 0.05, we reject the null hypothesis that there is no significant difference between the average mpg measured in cars with either manual or automatic transmission. We can also assume with that the manual transmission is better for MPG, considering other variables are not considered. The only downside to this is low R-square values, which is 0.3597989.

Result of Appendix 1 showed that there are other variables with stronger correlations with mpg (either -/+). These variables might have contributed to the significant difference obtained from the t-test above. Will need to explore this further in the second section.

Quantify the MPG difference between automatic and manual transmissions”

  1. Simple summary of MPG distribution
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mpg_ave<- mtcars %>%
  group_by(am) %>%
  summarise(average_mpg = mean(mpg, na.rm=TRUE),sd_mpg = sd(mpg, na.rm = TRUE))
mpg_ave
## # A tibble: 2 x 3
##   am        average_mpg sd_mpg
##   <fct>           <dbl>  <dbl>
## 1 Automatic        17.1   3.83
## 2 Manual           24.4   6.17

To check if other variables contribute to the statistically significant difference between the mpg measured in manual and automatic transmission types.

mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

Perform multiple linear regression First, perform a multiple linear regression

fit <- lm(mpg~., data = mtcars)
best_fit<- step(fit, direction = "both", trace = FALSE)
summary(best_fit)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10
best_rsquared <- summary(best_fit)$r.squared

Based on the stepwise algorithm, the best prediction model of mpg should contain the following variables:

+. cyl: number of cylinders
+. hp: gross horsepower
+. wt: weight (every 1000 lbs)
+. am: transmission (especially manual).  

Executive Summary

Without considering other variables, the manual transmission provides higher mpg rate compared to automatic transmission. Despite the significant p-value, it might be stretch to conclude that only manual transmission result in better/higher mpg. The multiple linear regression supported this notion; a decrease in gross horsepower, weight, having either 6 or 8 cylinders and the presence of manual transmission would result in higher mpg in car.

Apendices

Appendix 1:First, lets check the relationship between MPG and other variables.

## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
data("mtcars")
mpgcor<- cor(mtcars$mpg,mtcars[,-1])
corrplot(mpgcor, method = "number")

Based on the correlation plot in appendix 1, it is safe to assume that the variables cyl, disp, wt, and hp have the highest correlation with mpg, although negative.

Appendix 2: visualizing the distribution of the two transmission types

boxplot(mpg~am, data = mtcars,
        xlab = "Transmission Type (am)",
        ylab = "Miles/US gallon (mpg)",
        main = "MPG by Transmission Type")

Boxplot of the distribution of the mpg of all the observes cars, based on their type of transmission.

Appendix 3: Residual check and plots

plot(best_fit)