EXECUTIVE SUMMARY
This report analyses ‘mtcars’ dataset to excavate the relationship between MPG(Miles per Gallon) & rest of the variables.The dataset consists of 32 different car models & comparison of some of their parameters.Regression analysis & exploratory data analysis techniques are used to study mainly the effect of transmission types (i.e. automatic or manual) on MPG.T-test concluded that mean of manual transmission is about 7.25 MPG more than annual transmission.Then, we fit 4 linear regression models and select the one with highest Adjusted R-squared value, lowest residual standard error and least RSS(Residual Sum of Squares). We also observed from the conditional plots that cars that are lighter in weight with a manual transmission and cars that are heavier in weight with an automatic transmission will have higher MPG values.
Loading The Dataset & necessary Libraries
data("mtcars")
library(ggplot2)
library(corrplot)
str(mtcars)
Converting the required numeric variables into factor variables for regression & exploratory data analysis
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
Exploratory Data Analysis
Boxplot of MPG Vs Transmission Types
boxplot(mpg~am,data = mtcars,xlab="Transmission Type ( Automatic=0, Manual=1 )",ylab="MPG",main="Effect of Transmission on MPG", col="Yellow")

Corelation Plot to Explain Relationship Between Variables
col1 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","white",
"cyan", "#007FFF", "blue","#00007F"))
col2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
"#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))
col3 <- colorRampPalette(c("red", "white", "blue"))
col4 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","#7FFF7F",
"cyan", "#007FFF", "blue","#00007F"))
wb <- c("white","black")
## using these color spectrums
corrplot(cor(mtcars), order="hclust", addrect=2, col=col1(100),method = "number")
Conditioning Plot for MPG Vs Weight conditioned under Transmission Types
coplot(mpg~wt|am,data = mtcars,panel = panel.smooth, xlab = "Weight Graph 1-Automatic, Graph 2-All Transmissions, Graph 3-Manual",ylab = "MPG",columns = 2)

According to the box plot, we see that manual transmission yields higher values of MPG than automatic transmissions on an average.Looking at the corelation plot, we can infer existance of higher relationships between variables(wt,hp,cyl,disp,carb).
Statistical Inference
Null Hypothesis: MPG of Automatic as well as Manual transmissions are from the same populations.
We will carry out T-Test to test the null hypothesis.
t.test(mpg~am,data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
As the p-value(p=0.001374)<0.05, we reject the null hypothesis. Therefore,MPG of Automatic as well as Manual transmissions are not from the same populations.Further, mean of Manual Transmission MPG is about 7.25 units more than Automatic Transmission.
Regression Analysis
We will carry out the regression analysis of mtcars dataset on all the variables against MPG.
MODEL 1
fit <- lm(mpg ~ ., data=mtcars)
summary(fit)
This model has the Residual standard error as 2.65 on 21 degrees of freedom. And the Adjusted R-squared value is 0.8066, which means that the model can explain about 81% of the variance of the MPG variable. However, none of the coefficients except weight and transmission are significant at 0.05 significance level.
Therefore, we will now try another Model with statistically significant variables
MODEL 2
fit1 <- step(fit,k=log(nrow(mtcars)))
summary(fit1)
Above model recognizes, ‘weight’,‘quarter mile time’ & ‘transmission’ as the influencial predictor variables & it applies the same to result into Residual Standard Error of 2.459 on 28 degrees of freedom with an improved Adjusted R-square of 0.8336 i.e. the model explains about 83.36% variance of MPG variable.All of the coefficients are significant at 0.05 significance level.
It can be easily derived from the above conditioning and corelation plots that there exist a close interaction between ‘Weight’ and ‘transmission’ variables. Hence, we need to account for the same in our above model as below.
MODEL 3
fit2 <- lm(mpg ~ wt + qsec + am + wt:am, data=mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am + wt:am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5076 -1.3801 -0.5588 1.0630 4.3684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.723 5.899 1.648 0.110893
## wt -2.937 0.666 -4.409 0.000149 ***
## qsec 1.017 0.252 4.035 0.000403 ***
## am1 14.079 3.435 4.099 0.000341 ***
## wt:am1 -4.141 1.197 -3.460 0.001809 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.084 on 27 degrees of freedom
## Multiple R-squared: 0.8959, Adjusted R-squared: 0.8804
## F-statistic: 58.06 on 4 and 27 DF, p-value: 7.168e-13
par(mfrow=c(2,2))
plot(fit2)

This model has the Residual standard error as 2.084 on 27 degrees of freedom. And the Adjusted R-squared value is 0.8804, which means that the model can explain about 88% of the variance of the MPG variable. All of the coefficients are significant at 0.05 significant level. This is a pretty good one.
As we see the transmission coefficient is exceedingly stronger, we will try new model with only transmission variable.
MODEL 4
fit3 <- lm(mpg~wt + wt:am, data = mtcars)
summary(fit3)
This model has the Residual standard error as 3.05 on 29 degrees of freedom. And the Adjusted R-squared value is 0.7439, which means that the model can explain about 74% of the variance of the MPG variable.
Selection of Model
anova(fit,fit1,fit2,fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ wt + qsec + am
## Model 3: mpg ~ wt + qsec + am + wt:am
## Model 4: mpg ~ wt + wt:am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 15 120.40
## 2 28 169.29 -13 -48.883 0.4685 0.911413
## 3 27 117.28 1 52.010 6.4795 0.022398 *
## 4 29 269.76 -2 -152.481 9.4982 0.002162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(fit2)
## 2.5 % 97.5 %
## (Intercept) -2.3807791 21.826884
## wt -4.3031019 -1.569960
## qsec 0.4998811 1.534066
## am1 7.0308746 21.127981
## wt:am1 -6.5970316 -1.685721
Conclusion of Regression Analysis
We select Model 3 which has superior Adjusted R-Squared value with lowest Residual Standard Error and Residual Sum of Squares (RSS) amongst all other models.
Residual Analysis and Diagnostics
Please refer to the residual plots created with Model 3 which support following assumptions
1. The Residuals vs. Fitted plot shows no consistent pattern, supporting the accuracy of the independence assumption.
2. The Normal Q-Q plot indicates that the residuals are normally distributed because the points lie closely to the line.
3. The Scale-Location plot confirms the constant variance assumption, as the points are randomly distributed.
4. The Residuals vs. Leverage argues that no outliers are present, as all values fall well within the 0.5 bands.
For dfbetas, we will test following
sum((abs(dfbetas(fit2)))>1)
## [1] 0
Therefore, the above analyses meet all basic assumptions of linear regression and well answer the questions as follows
1) Manual transmission is better for MPG than Automatic transmission.
2) Mean of manual transmission is about 7.25 MPG more than annual transmission.