Basic Regression Modeling

Summary

This project explores the relationship between miles-per-gallon and other variables in the mtcars data set. The analysis attempts to determine whether an automatic or manual transmission is better for MPG, and quantifies the MPG difference. First, I will focus on inference with a simple linear regression model and then a multiple regression model. Both models supported the conclusion that manual transmissions have on average significantly higher MPG’s than automatic transmissions. In the univariate model, the mean MPG difference is 7.245 MPG; the average MPG for cars with automatic transmissions is 17.147 MPG, and the average MPG for cars with manual transmissions is 24.392 MPG.

library(ggplot2)
library(gridExtra)
library(GGally)
library(reshape2)
library(ggcorrplot)

First, we take a look at the data.

my_data <- mtcars

head(my_data, 5)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

str(my_data)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

table(is.na(my_data))

## 
## FALSE 
##   352

We can see that there are no missing values, but some of the variables marked as numeric should probably be factors. We can see from the scatter plot that the variable disp seems to correlate with hp and wt.

my_data$am  <- as.factor(my_data$am)
my_data$cyl  <- as.factor(my_data$cyl)
my_data$vs   <- as.factor(my_data$vs)
my_data$gear <- as.factor(my_data$gear)
my_data$carb <- as.factor(my_data$carb)

my_fn <- function(data, mapping, pts=list(), smt=list(), ...){
  ggplot(data = data, mapping = mapping, ...) + 
    do.call(geom_point, pts) +
    do.call(geom_smooth, smt) +
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))

}

ggpairs(my_data[,sapply(my_data,is.numeric)], 
        lower = list(continuous = 
                       wrap(my_fn,
                            pts=list(colour="black"), 
                            smt=list(method="loess", se=F, colour="blue"))),
        upper = list(continuous = 
                       wrap(my_fn,
                            pts=list(colour="black"), 
                            smt=list(method="loess", se=F, colour="blue"))))

Here we can see the correlations more clearly. In our regression we should not use the disp variable to avoid multicollinearity.

ggcorrplot(cor(my_data[,sapply(my_data,is.numeric)], method = c("pearson")), hc.order = TRUE, type = "lower", ggtheme = ggplot2::theme_gray, lab = TRUE) + 
  ggtitle("Pearson Linear Correlation")+ 
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))

Next we take a closer look at the distribution of the numeric variables. The red line is the median and the blue line is the mean.

temp <- my_data[,sapply(my_data,is.numeric)]
longtemp <- melt(temp, measure.vars = names(temp))
vline.dat.mean <- aggregate(longtemp[,2], list(longtemp$variable), mean)
vline.dat.median <- aggregate(longtemp[,2], list(longtemp$variable), median)
names(vline.dat.mean)[1] <- "variable"
names(vline.dat.median)[1] <- "variable" 

ggplot(longtemp,aes(x=value))+
  geom_histogram(aes(y = ..density..,  fill = variable), colour = "blue", fill = "darkgrey")+
  geom_density()+
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))+
  geom_vline(aes(xintercept = x), data = vline.dat.mean, linetype = "longdash", color = "blue")+
  geom_vline(aes(xintercept = x), data = vline.dat.median, linetype = "longdash", color = "red")+
  xlab("")+
  facet_wrap(~ variable, ncol = 3, scales = "free")

Since the mean and median are close to each other for our variable of interest, mpg, we’ll make a new varable that splits mpg into above and below average value. We can see that disp, hp, wt and drat separate mpg quite well.

my_temp <- ifelse(my_data$mpg > mean(my_data$mpg), "Above", "Below")

longtemp <- cbind(longtemp, my_temp)
names(longtemp) <- c("variable", "value", "avg")
vline.dat.mean <- aggregate(longtemp[,2], list(longtemp$variable), mean)
vline.dat.median <- aggregate(longtemp[,2], list(longtemp$variable), median)
names(vline.dat.mean)[1] <- "variable"
names(vline.dat.median)[1] <- "variable" 
ggplot(longtemp,aes(x=value))+
  geom_histogram(aes(y = ..density..,  fill = avg), colour = "black")+
  geom_density() +
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))+
  geom_vline(aes(xintercept = x), data = vline.dat.mean, linetype = "longdash", color = "blue")+
  geom_vline(aes(xintercept = x), data = vline.dat.median, linetype = "longdash", color = "red")+
  xlab("")+ 
  facet_wrap(~ variable, ncol = 3, scales = "free")

Next we’ll take a look at the factor variables. They all seem to separate mpg into their respective groups.

temp <- my_data[,!sapply(my_data,is.numeric)]
temp <- cbind(temp, mpg = my_data[,"mpg"])
longtemp <- melt(temp, measure.vars = 1:5)
ggplot(longtemp, aes(x=value, y=mpg))+
  xlab("")+
  ylab("")+
  stat_boxplot(geom ='errorbar')+ 
  geom_boxplot(outlier.shape = NA) +  
  geom_jitter(aes(colour=mpg), position = position_jitter(width = 0.2)) +
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))+
  facet_wrap(~ variable, ncol = 2, scales = "free")

What about separating mpg into above and below average? We can see the same pattern emerge.

temp <- my_data[,!sapply(my_data,is.numeric)]
temp <- cbind(temp, avg = my_temp)
longtemp <- melt(temp, measure.vars = 1:5)
ggplot(longtemp, aes(x=value))+
  geom_bar(aes(fill = avg), colour="black") + 
  facet_wrap(~ variable, ncol = 2, scales = "free")+
  theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))

Here I quantify the MPG difference between automatic and manual transmissions.

aggregate(mpg~am, data = my_data, mean)

##   am      mpg
## 1  0 17.14737
## 2  1 24.39231

We can thus hypothesize that automatic cars have an MPG 7.25 lower than manual cars.
To determine if this is a significant difference, we use a t-test. The p-value is 0.001374, thus we can state this is a significant difference.

t.test(my_data$mpg[my_data$am == "0"], my_data$mpg[my_data$am == "1"])

## 
##  Welch Two Sample t-test
## 
## data:  my_data$mpg[my_data$am == "0"] and my_data$mpg[my_data$am == "1"]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

Out of interest we will test the other factor variables as well.

t.test(my_data$mpg[my_data$vs == "0"], my_data$mpg[my_data$vs == "1"])

## 
##  Welch Two Sample t-test
## 
## data:  my_data$mpg[my_data$vs == "0"] and my_data$mpg[my_data$vs == "1"]
## t = -4.6671, df = 22.716, p-value = 0.0001098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.462508  -4.418445
## sample estimates:
## mean of x mean of y 
##  16.61667  24.55714

pairwise.t.test(my_data$mpg, my_data$cyl)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  my_data$mpg and my_data$cyl 
## 
##   4       6      
## 6 0.00024 -      
## 8 2.6e-09 0.00415
## 
## P value adjustment method: holm

pairwise.t.test(my_data$mpg, my_data$gear)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  my_data$mpg and my_data$gear 
## 
##   3       4      
## 4 0.00022 -      
## 5 0.07684 0.21832
## 
## P value adjustment method: holm

pairwise.t.test(my_data$mpg[my_data$carb %in% c("1","2","3","4")], my_data$carb[my_data$carb %in% c("1","2","3","4")])

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  my_data$mpg[my_data$carb %in% c("1", "2", "3", "4")] and my_data$carb[my_data$carb %in% c("1", "2", "3", "4")] 
## 
##   1      2      3     
## 2 0.4687 -      -     
## 3 0.0514 0.2102 -     
## 4 0.0032 0.0285 0.8757
## 
## P value adjustment method: holm

Next we will make a simple univariate model.

model_1 <- lm(mpg ~ am, data = my_data)
summary(model_1)

## 
## Call:
## lm(formula = mpg ~ am, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

We can see that the average MPG for automatic is 17.1 MPG, while manual is 7.2 MPG higher. The Adjusted R2 value is 0.34, telling us this model only explains about 34% of the variance.
A new model should be made as the Adjusted R-squared is quite low for our univariate model. The new model will use the variables hp, wt, drat, cyl, vs, am to make it more accurate.
The model explains about 84% of the variance

model_2 <- lm(mpg ~ hp+wt+drat+cyl+vs+am, data = my_data)
summary(model_2)

## 
## Call:
## lm(formula = mpg ~ hp + wt + drat + cyl + vs + am, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2298 -1.1331 -0.1287  0.9751  4.6213 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.86178    6.61784   4.512 0.000144 ***
## hp          -0.03532    0.01429  -2.471 0.020959 *  
## wt          -2.34132    0.91509  -2.559 0.017242 *  
## drat         0.33645    1.43066   0.235 0.816069    
## cyl6        -1.96167    1.74787  -1.122 0.272831    
## cyl8         0.53513    3.36783   0.159 0.875081    
## vs1          2.00638    1.79577   1.117 0.274933    
## am1          2.58271    1.70905   1.511 0.143794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.444 on 24 degrees of freedom
## Multiple R-squared:  0.8727, Adjusted R-squared:  0.8356 
## F-statistic:  23.5 on 7 and 24 DF,  p-value: 2.788e-09

par(mfrow = c(1,2))
plot(model_2)

par(mfrow = c(1,1))

We can see from the Q-Q plot that the normality assumption of our residuals is questionable.

anova(model_1, model_2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ hp + wt + drat + cyl + vs + am
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     24 143.35  6    577.55 16.116 2.319e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, the p-value is under our threshold of 0.05 and thus we can conclude that model_2 is a better model than our simpler model_1.

Basic Regression Modeling

JMFlin

Summary