This is the Regression Models Course Project of Coursera Data Science specialization.

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

Is an automatic or manual transmission better for MPG

Quantify the MPG difference between automatic and manual transmissions

Load the data and show its structure

df = mtcars
dim(df)

## [1] 32 11

head(df)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The dataset includes 32 observations of 11 variables. The variables are:

mpg: Miles per US gallon
cyl: Number of cylinders
disp: Displacement (cubic inches)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (lb / 1000)
qsec: 1 / 4 mile time
vs: V/S
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors

Exploratoty data analysis

By looking at the boxplot of mileage vs trasmission type in the appendix, it seems the automatic transmission could be better in fuel effectiveness than the manual one. But it should be confirmed by a statistics later. In other case this is impossible to give an definite answer: if automatic transmission is really better for fuel economy or this effect is due to unclear data sample.

The boxplot of mileage vs cylinder number in the appendix shows a clear dependency between mileage and cylinder count. This dependency is much more clear than one between the mileage and the gearbox type - automatic or manual.
As it can be seen from scatterpolt matrix, the mileage has good correlation with the variables cyl, hp, vt. These predictors are choosed for the model selection as the best correlated with mpg variable. This model also is compared with one with all the predictors included (see Appendix).

The analysis of the dependency between the mileage and transmission type.

The linear regression estimation.

fitLr <- lm(mpg ~ cyl + hp + wt + am, df)
summary(fitLr)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4765 -1.8471 -0.5544  1.2758  5.6608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.14654    3.10478  11.642 4.94e-12 ***
## cyl         -0.74516    0.58279  -1.279   0.2119    
## hp          -0.02495    0.01365  -1.828   0.0786 .  
## wt          -2.60648    0.91984  -2.834   0.0086 ** 
## am           1.47805    1.44115   1.026   0.3142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8267 
## F-statistic: 37.96 on 4 and 27 DF,  p-value: 1.025e-10

It is easy to see what the p - value for mileage are more than 0.3. It means what this linear model coefficient for am is very probably to be zero. So, this model cannot answer the question of interest. The linear regression with all the regressors included can be found in the Appendix. The reiduals plot for the chosen linear model can be found in Appendix also.

The ridge regression estimation

library(ridge)
ridge <- linearRidge(mpg ~ cyl + hp + wt + am, df)
summary(ridge)

## 
## Call:
## linearRidge(formula = mpg ~ cyl + hp + wt + am, data = df)
## 
## 
## Coefficients:
##              Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  35.76620              NA                  NA               NA
## cyl          -0.78228        -7.77867             4.44349            1.751
## hp           -0.02459        -9.38833             4.01090            2.341
## wt           -2.44628       -13.32689             4.11122            3.242
## am            1.58182         4.39470             3.32648            1.321
##             Pr(>|t|)   
## (Intercept)       NA   
## cyl          0.08002 . 
## hp           0.01925 * 
## wt           0.00119 **
## am           0.18646   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge parameter: 0.03558268, chosen automatically, computed using 2 PCs
## 
## Degrees of freedom: model 3.537 , variance 3.163 , residual 3.912

As can be seen from the summary of the ridge regression estimates, the p - value for the am coefficient is too big (0.19) to consider the coefficient at the mileage variable to be not equal to zero.

The executive summary

Is an automatic or manual transmission better for MPG? This question cannot be ansewred with this data sample because the uncertainty of the p - value in the linear regression and in the ridge regression for am: p - value is too large for the certain answer.
Quantify the MPG difference between automatic and manual transmissions Supposing the data in the mtcars are normal distrubited and the difference in MPG between automatic and manual transmissions depends only on transmission type, pushing aside all the relevant predictors. The relevant t - statistics for this estimate can be found in the Appendix.

auto <- mtcars[mtcars$am == 1, ]$mpg
manual <- mtcars[mtcars$am == 0, ]$mpg
print(paste0('The difference in MPG between automatic and manual transmissions = ', mean(auto) - mean(manual)))

## [1] "The difference in MPG between automatic and manual transmissions = 7.24493927125506"

But as we know, we can’t push aside other relevant predictors. So this estimate is not statistically confirm.

Appendix.

Linear regression with all the regressors included.

fitLr_1 <- lm(mpg ~ ., df)
summary(fitLr_1)

## 
## Call:
## lm(formula = mpg ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

It is easy to see what all the p - value for am are more than 0.2. It means what all this linear model coefficient are very probably to be zero. So, this model cannot answer the question of interest.

Let’s take a look at the boxplot of the am data

boxplot(mpg ~ am, data=df, col=c('blue', 'green'), 
        xlab='Transmission (0 = Automatic, 1 = Manual', 
ylab='Miles per Gallon', main='Boxplot of MPG vs. Transmission')

Boxplot of mileage by cylinder number.

boxplot(mtcars$mpg ~ mtcars$cyl, data=mtcars, col=(c('red', 'green', 'blue')), ylab='miles per gallon', xlab='number of cylinders', main='Mileage by Cylinder number')

Scatterplot matrix

pairs(mpg ~ ., data = df)

t - statistics for mpg for the automatic or manual transmission:

t.test(mpg ~ am, data = df)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Resudual plot for the chosen model for linear regression

par(mfrow = c(2,2))
plot(fitLr)

Coursera - Regression Models Course Project

Andrei Keino

9 июля 2018 г