Coursera Regression Model - Project

Executive Summary

In the data cleaning section, I transform the am attribute into factor type. And in the Exploratory Data Analysis section, I use boxplot to show that there might be a significant difference between automotive and manual cars.

In the Linear Regression Model section, I fit the data with simple linear regression and multivariate regression and found out that the multivariate regression works better. The result is verified in the Residuals Analysis section.

Data Loading and Cleaning

data("mtcars")
mtcars$am <- as.factor(mtcars$am)

Exploratory Data Analysis and Inferrence

Load the ggplot2 library.

Notice that the 25% quantile of manual cars’ MPG is significantly greater than 75% quantile of automatic cars’ mpg. So we might hypothesize that Manual cars will be better than automatic cars. So we can use t-test to test this hypothesis.

tTest <- t.test(mpg ~ factor(am), data = mtcars)
tTest
## 
##  Welch Two Sample t-test
## 
## data:  mpg by factor(am)
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

The p-value, which is significantly less than 0.05, shows that we can acccept our alternative hypothesis: true difference in means is not equal to 0.

Linear Regression Models

Simple Linear Regression

simpleFit <- lm(formula = mpg ~ factor(am), data = mtcars)
summary(simpleFit)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

We can see that the Adjusted R-squared is 0.3385 and Multiple R-squared is 0.3598, which means our simple linear model is not enough to explain the underlying relationship between the outcome and the predictors.

Multivariate Linear Models.

As we discuss in the last section, we need a more complex model.

multiVFit <- lm(mpg ~., data = mtcars)
summary(multiVFit)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am1          2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

The Adjusted R-squared now is 0.8066, which means this model can explain more than 80 percent of the variance of the mpg variables. And the p-values shows that we can accept the alternative hypothesis.

Residual Analysis

I select the multivariate fitting as our best model.

par(mfrow = c(2, 2))
plot(multiVFit)

Figure 1 demonstrate the independence conditions. Figure 2 shows that the residuals are normally distributed. Figure 3 shows that the variance is constant. Figure 4 shows that there may be some outliers that we might interested in.