Coursera Regression Models Project

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”

“Quantify the MPG difference between automatic and manual transmissions”

Our first step is to load the dataset mtcars and take a look to the data

data(mtcars)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

We see that the variable that we are interested, am, is numeric. For our statistical analysis we need to convert it to a categorical variable. R calls categorical variables as “factors”.

Second step. We convert am from num to factor. The type of factors are called levels. From the help section, we see Transmission (0 = automatic, 1 = manual). To do that, we will use the plyr package and the mapvalues function. We check again the mtcars to see that the changes were saved. The good part of the plyr package is that what we are doing with the data is clearly explained.

library(plyr)

## Warning: package 'plyr' was built under R version 4.0.4

mtcars$am <- factor(mtcars$am)
mtcars$am <- mapvalues(mtcars$am, from = c("0", "1"), to = c("automatic", "manual"))
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

levels(mtcars$am)

## [1] "automatic" "manual"

Exploratory Data Analysis. We use the Tukey plots created by Dr. John Tukey in 1977. They are also called box plots. With that we will quickly compare graphically automatic and manual transmission. We will use the ggplot2 library.

# Basic box plot
library(ggplot2)
ggplot(mtcars, aes(x=mpg, y=am)) + 
  geom_boxplot(fill="gray")+
  labs(title="Miles Per Gallon Versus Transmission",x="Miles Per Gallon", y = "Transmission")+
  theme_classic()

From the Box plot we can see that the variance of automatic is different than the variance of manual because the data is more spread in manual. For the Ttest we need to assume that the distributions of automatic and manual are normal. So, we will create a density plot to check this assumption because otherwise the results of the Ttest would not be valid.

library(ggplot2)
ggplot(mtcars, aes(mpg, fill = am, color = am)) + geom_density(alpha = 0.2) + facet_wrap(.~am, ncol=1) + xlim(c(0,60))

From the plots we can say that is reasonable assum that the distribution of mpg for automatic and manual are both normal. We will use that and the assumption that variance are different to test using the Welch t-test method if the means of automatic and manual are equal.

automatic <- mtcars[ which(mtcars$am== "automatic"), ] 
manual <- mtcars[ which(mtcars$am== "manual"), ]
#OPTION 1
t.test(automatic$mpg, manual$mpg, var.equal=FALSE, paired=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  automatic$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

#OPTION 2
t.test(automatic$mpg, manual$mpg, alternative="two.sided", var.equal=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  automatic$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

If the p-value is less than 0.05, we reject the null hypothesis that there’s no difference between the means and conclude that a significant difference does exist. As the p-value is 0.001, we reject the null hypothesis.

#"Quantify the MPG difference between automatic and manual transmissions" 
mean(automatic$mpg) - mean(manual$mpg)

## [1] -7.244939

The best predictor of the differnce between automatic and manual, is the mean difference. The same value can be obtained using the lm function. For example:

#"Quantify the MPG difference between automatic and manual transmissions" 
lm(mpg ~ am, mtcars)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Coefficients:
## (Intercept)     ammanual  
##      17.147        7.245

Executive Summary:

We can conclude that vehicles with manual transmission have better mpg than automatic.

Coursera Regression Models Project

Illya Bjazevic

5/6/2021