Regression Models Coursera Assignment

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”

# loading the data
data(mtcars)
# Renaming am variable to Transmission (0 = automatic, 1 = manual
names(mtcars)[names(mtcars) == 'am'] <- 'Transmission'
mtcars$Transmission <- ifelse(mtcars$Transmission == 0, 'Automatic', 'Manual')
mtcars$Transmission <- as.factor(mtcars$Transmission)
knitr::kable(mtcars,align='c')

	mpg	cyl	disp	hp	drat	wt	qsec	vs	Transmission	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	Manual	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	Manual	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	Manual	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	Automatic	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	Automatic	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	Automatic	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	Automatic	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	Automatic	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	Automatic	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	Automatic	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	Automatic	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	Automatic	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	Automatic	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	Automatic	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	Automatic	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	Automatic	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	Automatic	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	Manual	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	Manual	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	Manual	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	Automatic	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	Automatic	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	Automatic	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	Automatic	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	Automatic	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	Manual	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	Manual	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	Manual	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	Manual	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	Manual	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	Manual	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	Manual	4	2

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##     Transmission      gear            carb      
##  Automatic:19    Min.   :3.000   Min.   :1.000  
##  Manual   :13    1st Qu.:3.000   1st Qu.:2.000  
##                  Median :4.000   Median :2.000  
##                  Mean   :3.688   Mean   :2.812  
##                  3rd Qu.:4.000   3rd Qu.:4.000  
##                  Max.   :5.000   Max.   :8.000

T-test

Running a two sample t-test to determine if two population means are equal.

t.test( mpg ~ Transmission, paired=F, var.equal=F, data=mtcars, alternative='less')

## 
##  Welch Two Sample t-test
## 
## data:  mpg by Transmission
## t = -3.7671, df = 18.332, p-value = 0.0006868
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -3.913256
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

The small p-value indicates that the test is significant and we can reject the null hypothesis stating that the two means are the same and accept the alternative hypothesis stating that the mean value of mpg of group 0 (Transmission type automatic) is less than the mpg of group 1 (Transmission type manual).

Fitting linear models

Since our outcome variable is continuous and not binary we are going to use linear regression models.

fit <- lm(mpg~Transmission,data = mtcars)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ Transmission, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          17.147      1.125  15.247 1.13e-15 ***
## TransmissionManual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Our initial model that uses just one predictor variable explains only 36% of the data variability which is pretty low as seen by the R square value. This indicates that we should include more predictors in our models since they may be related with the variability observed in mpg.

Since one variable is not sufficient we are going to examine all the available variables offered with the dataset to try to find correlation between them and the outcome variable, mpg.

require(GGally);
#g <-  ggpairs(mtcars, lower= list(continuous=wrap('smooth')))
ggpairs(mtcars, 
        upper = list(continuous=wrap('smooth')),
        lower = list(continuous = "cor"), 
        diag = list(continuous = "density"),
        axisLabels = 'none',corSize=10)

## alternative code similar results (much faster)
# pairs(~., data = mtcars)

By looking at the our pairs plot we can visually observed a lot of connections and correlations between our variable. Unfortunately we can see high correlation values between the predictors which may cause bias to our analysis if we ommit them. The high correlation values also suggest that if we include other regressors, we would inflate the variance inflation factor. Despite this consequence we might want to include certain variables, even if they dramatically inflate our variance.

# fitting a model with all the predictors
all <- lm(mpg~., data=mtcars)
#summary(all)
require(car)

## Loading required package: car

## Warning: package 'car' was built under R version 3.2.4

vif(all)

##          cyl         disp           hp         drat           wt 
##    15.373833    21.620241     9.832037     3.374620    15.164887 
##         qsec           vs Transmission         gear         carb 
##     7.527958     4.965873     4.648487     5.357452     7.908747

The results above verify our previous pair plot which indicate high correlation value between the variables (also known as multicollinearity) which causes variance inflation as seen by the high VIF values (values greater than 10). Multicollinearity increases the standard errors of the coefficients. Increased standard errors in turn means that coefficients for some independent variables may be found not to be significantly different from 0. In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. Note: VIF is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors.

Dealing with multicollinearity

Remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn’t drastically reduce the R-squared. Consider using stepwise regression, which is the method we used in order to create the optimal model.

Linear Models - Multiple regressors

Moving on to linear models with more regressors. First i will define a model with all the regressors and then find the most influential ones (predictors that have a significant contribution to the variability explained). We are going to use the step function with direction backwards to do this. For more help please search ?step on r.

all <- lm(mpg~., data=mtcars)
#stepping through the model backwards removing regressors who don't contribute to the variability explained in the model
step.model <- step(all, direction="backward")
#summary(step.model)

For our analysis we are using the adjusted R-squared which is a modified version of R-squared that has been adjusted for the number of predictors in the model. This is mainly because, every time you add a predictor to a model, the R-squared increases, even if due to chance alone. It never decreases. Consequently, a model with more terms may appear to have a better fit simply because it has more terms.

The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.

data.frame(all.model=  summary(all)$adj.r.squared, step.model= summary(step.model)$adj.r.square, row.names = 'R squared')

##           all.model step.model
## R squared 0.8066423  0.8335561

The all.models uses all available regressors (10 predictors used) but still has worse performance than our stepwise model which uses just 3 predictors. This can be easily observed by the adjusted R-squared values. The stepwise model explains 83% of the variability observed in the miles per gallon variable compared to the 80% explained by the model that uses all the predictors. Since less predictors is always desirable in linear models we are going to choose the stepwise model in order to proceed with our analysis.

Diagnostic Plots for stepwise model

1. Residuals vs Fitted

# for plotting all diagnostic plots 
# for (i in 1:4){
#       plot(step.model, which=i)
# }

plot(step.model,which = 1)

This plot shows if residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship.

In our model the residuals are equally spread around a horizontal line without distinct patterns, that is a good indication that we don’t have non-linear relationships.

2. Normal Q-Q

# Q-Q plot
plot(step.model, which=2)

This plot shows if residuals are normally distributed. The residuals approximately follow a straight line well and do not deviate severely.

3. Scale-Location

plot(step.model, which=3)

It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). In our model we see a horizontal line with equally (randomly) spread points with the exception of a few outliers.

4. Residuals vs Leverage

plot(step.model, which = 4)

This plot helps us to find influential cases if any. Not all outliers are influential in linear regression analysis. Even though data can have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values. They could be extreme cases against a regression line and can alter the results if we exclude them from analysis. Another way to put it is that they don’t get along with the trend in the majority of the cases.

In our case we can see thar Chrysler Imperial, Merc 230 and Fiat 128 are influential with Chrystler Imperial being the most extreme. Same results were observed in the Scale location plot and the Residual vs Leverage plot drawn below.

plot(step.model, which = 5)

Conclusions

Using our stepwise model we reach the following conclusions. For our analysis we used a 95% confidence intervals.

## [1] "We are 95% confident that a 1000lb increase in weight will result in a decrease of the miles achieved with a gallon of gas somewhere in the range of -5.37 -2.46 and on average we will observed a decrease of 3.92 miles per gallon."

## [1] "We are 95% confident that for each aditional second increase, required to run a .25 mile (approximately 400 meters) will result in a increase of the miles achieved with a gallon of gas somewhere in the range of 0.63 1.82 and on average we will observed an increase of 1.23 miles per gallon."

## [1] "We are 95% confident that using a manual transmission type vehicle will increase the miles achieved with a gallon of gas somewhere in the range of 0.05 5.83 and on average we will observed an increase of 1.23 miles per gallon."