This is the Regression Models Course Project of Coursera Data Science specialization.

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

Load the data and show its structure

df = mtcars
dim(df)
## [1] 32 11
head(df)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The dataset includes 32 observations of 11 variables. The variables are:

  • mpg: Miles per US gallon
  • cyl: Number of cylinders
  • disp: Displacement (cubic inches)
  • hp: Gross horsepower
  • drat: Rear axle ratio
  • wt: Weight (lb / 1000)
  • qsec: 1 / 4 mile time
  • vs: V/S
  • am: Transmission (0 = automatic, 1 = manual)
  • gear: Number of forward gears
  • carb: Number of carburetors

Exploratoty data analysis


By looking at the boxplot of mileage vs trasmission type in the appendix, it seems the automatic transmission could be better in fuel effectiveness than the manual one. But it should be confirmed by a statistics later. In other case this is impossible to give an definite answer: if automatic transmission is really better for fuel economy or this effect is due to unclear data sample.


The boxplot of mileage vs cylinder number in the appendix shows a clear dependency between mileage and cylinder count. This dependency is much more clear than one between the mileage and the gearbox type - automatic or manual.
As it can be seen from scatterpolt matrix, the mileage has good correlation with the variables cyl, hp, vt. These predictors are choosed for the model selection as the best correlated with mpg variable. This model also is compared with one with all the predictors included (see Appendix).

The analysis of the dependency between the mileage and transmission type.

The linear regression estimation.

fitLr <- lm(mpg ~ cyl + hp + wt + am, df)
summary(fitLr)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4765 -1.8471 -0.5544  1.2758  5.6608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.14654    3.10478  11.642 4.94e-12 ***
## cyl         -0.74516    0.58279  -1.279   0.2119    
## hp          -0.02495    0.01365  -1.828   0.0786 .  
## wt          -2.60648    0.91984  -2.834   0.0086 ** 
## am           1.47805    1.44115   1.026   0.3142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8267 
## F-statistic: 37.96 on 4 and 27 DF,  p-value: 1.025e-10


It is easy to see what the p - value for mileage are more than 0.3. It means what this linear model coefficient for am is very probably to be zero. So, this model cannot answer the question of interest. The linear regression with all the regressors included can be found in the Appendix. The reiduals plot for the chosen linear model can be found in Appendix also.

The ridge regression estimation

library(ridge)
ridge <- linearRidge(mpg ~ cyl + hp + wt + am, df)
summary(ridge)
## 
## Call:
## linearRidge(formula = mpg ~ cyl + hp + wt + am, data = df)
## 
## 
## Coefficients:
##              Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  35.76620              NA                  NA               NA
## cyl          -0.78228        -7.77867             4.44349            1.751
## hp           -0.02459        -9.38833             4.01090            2.341
## wt           -2.44628       -13.32689             4.11122            3.242
## am            1.58182         4.39470             3.32648            1.321
##             Pr(>|t|)   
## (Intercept)       NA   
## cyl          0.08002 . 
## hp           0.01925 * 
## wt           0.00119 **
## am           0.18646   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge parameter: 0.03558268, chosen automatically, computed using 2 PCs
## 
## Degrees of freedom: model 3.537 , variance 3.163 , residual 3.912

As can be seen from the summary of the ridge regression estimates, the p - value for the am coefficient is too big (0.19) to consider the coefficient at the mileage variable to be not equal to zero.

The executive summary

auto <- mtcars[mtcars$am == 1, ]$mpg
manual <- mtcars[mtcars$am == 0, ]$mpg
print(paste0('The difference in MPG between automatic and manual transmissions = ', mean(auto) - mean(manual)))
## [1] "The difference in MPG between automatic and manual transmissions = 7.24493927125506"

But as we know, we can’t push aside other relevant predictors. So this estimate is not statistically confirm.

Appendix.

Linear regression with all the regressors included.

fitLr_1 <- lm(mpg ~ ., df)
summary(fitLr_1)
## 
## Call:
## lm(formula = mpg ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07


It is easy to see what all the p - value for am are more than 0.2. It means what all this linear model coefficient are very probably to be zero. So, this model cannot answer the question of interest.

Let’s take a look at the boxplot of the am data

boxplot(mpg ~ am, data=df, col=c('blue', 'green'), 
        xlab='Transmission (0 = Automatic, 1 = Manual', 
ylab='Miles per Gallon', main='Boxplot of MPG vs. Transmission')

Boxplot of mileage by cylinder number.

boxplot(mtcars$mpg ~ mtcars$cyl, data=mtcars, col=(c('red', 'green', 'blue')), ylab='miles per gallon', xlab='number of cylinders', main='Mileage by Cylinder number')

Scatterplot matrix

pairs(mpg ~ ., data = df)

t - statistics for mpg for the automatic or manual transmission:

t.test(mpg ~ am, data = df)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Resudual plot for the chosen model for linear regression

par(mfrow = c(2,2))
plot(fitLr)