library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
library(stats)
library(statsr)
we are looking at a data set of car collection assigned with some specifications such as MPG, Number of cylinders, and so on…, and we need to answer some questions such as . Is an Automatic or manual transmission better of MPG consumbtion? . Quantify the difference between automatic and manual transmissions and to figure this out, I will follow these following steps 1. process and prepare data. 2. Explore data through visualization to gain sense of what I’m doing. 3. model selection, to figure which model is better from the other for the MPG. 4. Model exaamination to figure whether my model holds up against standarsd and conditions. 5. Jumbing to a conclusion depending on my answers to the questions.
data(mtcars)
mtcarsdata<- mtcars
remove(mtcars)
mtcarsdata$am<- as.factor(mtcarsdata$am)
levels(mtcarsdata$am)<-c("Automatic", "Manual")
mtcarsdata$cyl<- as.factor(mtcarsdata$cyl)
mtcarsdata$gear<- as.factor(mtcarsdata$gear)
mtcarsdata$vs<- as.factor(mtcarsdata$vs)
levels(mtcarsdata$vs)<-c("V", "S")
. Exploring datat dimensions and summary
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
dim(mtcars)
## [1] 32 11
Visualising the relationship between the columns that I’m interested in:
ggplot(data=mtcarsdata, aes(am, mpg)) +
geom_boxplot(aes(fill= am))
we can notice the difference between the two types and that Manual Transmission has a higher Consumption than the Automatic one but we dont really know whether it’s a noticable difference or not so I’m calling a T test to compare between them using 95% Interval of Confidence
#Hypothesis Test
inference(mpg, am, data= mtcarsdata,statistic="mean", alt="twosided", type="ht", method="theoretical")
## Warning: Missing null value, set to 0
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Automatic = 19, y_bar_Automatic = 17.1474, s_Automatic = 3.834
## n_Manual = 13, y_bar_Manual = 24.3923, s_Manual = 6.1665
## H0: mu_Automatic = mu_Manual
## HA: mu_Automatic != mu_Manual
## t = -3.7671, df = 12
## p_value = 0.0027
this T test shows that there is a difference between the two types of trasmissions but this crystal clear as data might be biased or the sample size might be not enough
To Choose the best regression model for this case I have to assign all the predictors that might Influence my model beside the transmission type so I’m the (Backward Methodology) for to achieve the best regression model depending on the P-Value
assigning all variables to the model
model1<- lm(mpg ~ am + wt + cyl + hp + drat + disp , data= mtcarsdata)
summary(model1)
##
## Call:
## lm(formula = mpg ~ am + wt + cyl + hp + drat + disp, data = mtcarsdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8267 -1.4366 -0.4153 1.1649 5.0671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.611986 6.274227 5.198 2.52e-05 ***
## amManual 1.681130 1.554386 1.082 0.2902
## wt -2.726729 1.200207 -2.272 0.0323 *
## cyl6 -3.026760 1.576680 -1.920 0.0669 .
## cyl8 -2.541967 3.059145 -0.831 0.4142
## hp -0.033038 0.014476 -2.282 0.0316 *
## drat 0.326616 1.471086 0.222 0.8262
## disp 0.004395 0.013090 0.336 0.7400
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.501 on 24 degrees of freedom
## Multiple R-squared: 0.8667, Adjusted R-squared: 0.8278
## F-statistic: 22.29 on 7 and 24 DF, p-value: 4.768e-09
I can see that adjusted R squared is 82% which is sufficient as it indecates the percentage of variability that can be explaind by the predictors But i can Enhance my model by removing the mhighest P-Value which is related to drat
model2<- lm(mpg ~ am + wt + cyl + hp + disp , data= mtcarsdata)
summary(model2)
##
## Call:
## lm(formula = mpg ~ am + wt + cyl + hp + disp, data = mtcarsdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9374 -1.3347 -0.3903 1.1910 5.0757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.864276 2.695416 12.564 2.67e-12 ***
## amManual 1.806099 1.421079 1.271 0.2155
## wt -2.738695 1.175978 -2.329 0.0282 *
## cyl6 -3.136067 1.469090 -2.135 0.0428 *
## cyl8 -2.717781 2.898149 -0.938 0.3573
## hp -0.032480 0.013983 -2.323 0.0286 *
## disp 0.004088 0.012767 0.320 0.7515
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.453 on 25 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8344
## F-statistic: 27.03 on 6 and 25 DF, p-value: 8.861e-10
you can see that R squared has jumped from 82% to 83.4% which is perfection in this case but I’ll have to try removig the second hight P- value which is related to the Piston Displacement
model3<-lm(formula = mpg ~ am + wt + cyl + hp , data = mtcarsdata)
summary(model3)
##
## Call:
## lm(formula = mpg ~ am + wt + cyl + hp, data = mtcarsdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## amManual 1.80921 1.39630 1.296 0.20646
## wt -2.49683 0.88559 -2.819 0.00908 **
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
and jumping again to 84 which i think the best model to predict MPG.
so my residuals are:
residuals<-residuals(model3)
plot(residuals~mtcarsdata$hp)
plot(residuals~ model3$fitted.values, ylab = "fitted Values")
abline(h=0)
mean(residuals) # near to ZERO
## [1] 8.326673e-17
hist(residuals ,col=2)
qqnorm(residuals, col= 22)
qqline(0)
for the first and second questions, I can tell there is adifference between the two types of transmission and this difference appears in the H.T i did above, and apparently Automatic is way better tha Manual Transmission for the consumbtion which appears in the regression model above too.
I can tell the model need to be more enhanced, maybe by increasing the numbers in the sample to get more normality, and avoid skewness.