In this report, I’ll be examining several binary elements from the dataset in order to determine if any of these factors have significant leverage over various numeric variables.
In each test, we will use a Levene Test to establish whether or not the samples have equal variance.
We will be using the t.test() function, since the degrees of freedom can be really difficult to calculate manually. This approach will provide us with cleaner and more readable code.
The original data source for this report can be found on Kaggle.
This dataset consist of data From 1985 Ward’s Automotive Yearbook.
library(ggplot2)
library(car)
## Warning: package 'car' was built under R version 3.3.3
auto <- read.csv("Automobile_data.csv")
auto <- auto[,c("fuel.type", "engine.location", "curb.weight", "num.of.cylinders", "highway.mpg", "price")]
mpg.front <- auto[auto$engine.location %in% "front",]$highway.mpg
mpg.rear <- auto[auto$engine.location %in% "rear",]$highway.mpg
length(mpg.front)
## [1] 202
length(mpg.rear)
## [1] 3
boxplot(highway.mpg~engine.location,data=auto, main="Engine Location vs Fuel Efficiency",
xlab="Engine Location", ylab="HWY MPG")
ggplot(auto, aes(x = highway.mpg, fill = auto$engine.location)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Location of Engine"))
mean(mpg.rear)
## [1] 25
sd(mpg.rear)
## [1] 0
mean(mpg.front)
## [1] 30.83663
sd(mpg.front)
## [1] 6.901442
testLevene <- leveneTest(highway.mpg~engine.location, data=auto)
testLevene
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 4.7547 0.03037 *
## 203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The means for Front and Rear engines, respectively, are 30.8 and 25
Since we have a small Levene Test P value of 0.0304, we should assume that the 2 variances are not equal.
testTSTAT <- t.test(mpg.front, mpg.rear, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)
testTSTAT
##
## Welch Two Sample t-test
##
## data: mpg.front and mpg.rear
## t = 12.02, df = 201, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.879142 6.794125
## sample estimates:
## mean of x mean of y
## 30.83663 25.00000
Our analysis gave us a t statistic of 12.0198, with 201 degrees of freedom. The final P value was 1.978579810^{-25}. Because this falls outside of our 5% threshold, we must reject the null hypothesis.
However, it’s important to keep in mind that the small size of the “rear” sample might give us inaccurate results.
wt.front <- auto[auto$engine.location %in% "front",]$curb.weight
wt.rear <- auto[auto$engine.location %in% "rear",]$curb.weight
length(wt.front)
## [1] 202
length(wt.rear)
## [1] 3
boxplot(curb.weight~engine.location,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")
ggplot(auto, aes(x = curb.weight, fill = auto$engine.location)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Location of Engine"))
mean(wt.rear)
## [1] 2770.667
sd(wt.rear)
## [1] 25.40341
mean(wt.front)
## [1] 2552.371
sd(wt.front)
## [1] 523.8769
testLevene2 <- leveneTest(curb.weight~engine.location, data=auto)
testLevene2
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 4.0878 0.04451 *
## 203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The means for Front and Rear engines, respectively, are 2552.4 and 2770.7
Since we have a small Levene Test P value of 0.0445, we should assume that the 2 variances are not equal.
testTSTAT2 <- t.test(mpg.front, mpg.rear, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)
testTSTAT2
##
## Welch Two Sample t-test
##
## data: mpg.front and mpg.rear
## t = 12.02, df = 201, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.879142 6.794125
## sample estimates:
## mean of x mean of y
## 30.83663 25.00000
Our analysis gave us a t statistic of 12.0198, with 201 degrees of freedom. The final P value was 1.978579810^{-25}. Because this falls outside of our 5% threshold, we must reject the null hypothesis.
However, it’s important to keep in mind that the small size of the “rear” sample might give us inaccurate results.
wt.gas <- auto[auto$fuel.type %in% "gas",]$curb.weight
wt.diesel <- auto[auto$fuel.type %in% "diesel",]$curb.weight
length(wt.gas)
## [1] 185
length(wt.diesel)
## [1] 20
boxplot(curb.weight~fuel.type,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")
ggplot(auto, aes(x = curb.weight, fill = auto$fuel.type)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Type of Fuel"))
mean(wt.gas)
## [1] 2518.459
sd(wt.gas)
## [1] 501.0002
mean(wt.diesel)
## [1] 2898.8
sd(wt.diesel)
## [1] 585.386
testLevene3 <- leveneTest(curb.weight~fuel.type, data=auto)
testLevene3
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.559 0.06065 .
## 203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The means for Gas and Diesel engines, respectively, are 2518.5 and 2898.8
Since we have a small Levene Test P value of 0.0607, we should assume that the 2 variances are not equal.
testTSTAT3 <- t.test(wt.gas, wt.diesel, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)
testTSTAT3
##
## Welch Two Sample t-test
##
## data: wt.gas and wt.diesel
## t = -2.797, df = 22.114, p-value = 0.01047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -662.26194 -98.41914
## sample estimates:
## mean of x mean of y
## 2518.459 2898.800
Our analysis gave us a t statistic of -2.797, with 22.1139 degrees of freedom. The final P value was 0.0104745. Because this falls outside of our 5% threshold, we must reject the null hypothesis.
However, it’s important to keep in mind that the small size of the “diesel” sample might give us inaccurate results. It’s also interesting to note that even though “Gas” had a much larger sample size, it didn’t reduce the standard deviation by a very large amount, when compared to the “Diesel” set of only 20 samples.
Just out of curiosity, I’ve applied another t-test assuming equal variance. Notice that it also rejects the null hypothesis, although with a significantly smaller P value.
t.test(wt.gas, wt.diesel, var.equal = TRUE, alternative = 'two.sided', paired=FALSE)
##
## Two Sample t-test
##
## data: wt.gas and wt.diesel
## t = -3.1715, df = 203, p-value = 0.001752
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -616.8008 -143.8803
## sample estimates:
## mean of x mean of y
## 2518.459 2898.800
auto$price <- as.numeric(auto$price)
price.gas <- auto[auto$fuel.type %in% "gas",]$price
price.diesel <- auto[auto$fuel.type %in% "diesel",]$price
length(price.gas)
## [1] 185
length(price.diesel)
## [1] 20
boxplot(price~fuel.type,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")
ggplot(auto, aes(x = price, fill = auto$fuel.type)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Type of Fuel"))
mean(price.gas)
## [1] 96.04865
sd(price.gas)
## [1] 55.29007
mean(price.diesel)
## [1] 85.3
sd(price.diesel)
## [1] 50.64541
testLevene4 <- leveneTest(price~fuel.type, data=auto)
testLevene4
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.4596 0.2284
## 203
The means for Gas and Diesel engines, respectively, are 96 and 85.3
Since we have a larger Levene Test P value of 0.2284, we should assume that the 2 variances are equal.
testTSTAT4 <- t.test(price.gas, price.diesel, var.equal = TRUE, alternative = 'two.sided', paired=FALSE)
testTSTAT4
##
## Two Sample t-test
##
## data: price.gas and price.diesel
## t = 0.8322, df = 203, p-value = 0.4063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.71802 36.21531
## sample estimates:
## mean of x mean of y
## 96.04865 85.30000
Our analysis gave us a t statistic of 0.8322, with 203 degrees of freedom. The final P value was 0.406275. Because this falls just barely outside of our 5% threshold, we must cautiously reject the null hypothesis.
It’s important to keep in mind that the small size of the “diesel” sample might give us inaccurate results.