In this report, I’ll be examining several binary elements from the dataset in order to determine if any of these factors have significant leverage over various numeric variables.

The original data source for this report can be found on Kaggle.

This dataset consist of data From 1985 Ward’s Automotive Yearbook.

Questions we aim to answer

Load The Data and Functions We Will Use

  library(ggplot2)
  library(car)
## Warning: package 'car' was built under R version 3.3.3
  auto <- read.csv("Automobile_data.csv")
  auto <- auto[,c("fuel.type", "engine.location", "curb.weight", "num.of.cylinders", "highway.mpg", "price")]

Test 1: Does the location of the engine influence Highway MPG?

Step 1: State the hypothesis

  • Null Hypothesis -> The difference between the means of both samples is zero.

Step 2: Set the criterion

  • Alpha level = 5%
  • Degrees of freedom = ?
  • Critical value for the t-test = ?
  • This is a 2-tailed test.

Step 3: Explore the data

  mpg.front <- auto[auto$engine.location %in% "front",]$highway.mpg
  mpg.rear <- auto[auto$engine.location %in% "rear",]$highway.mpg

  length(mpg.front)
## [1] 202
  length(mpg.rear)
## [1] 3
  boxplot(highway.mpg~engine.location,data=auto, main="Engine Location vs Fuel Efficiency", 
    xlab="Engine Location", ylab="HWY MPG")

  ggplot(auto, aes(x = highway.mpg, fill = auto$engine.location)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Location of Engine"))

  mean(mpg.rear)
## [1] 25
  sd(mpg.rear)
## [1] 0
  mean(mpg.front)
## [1] 30.83663
  sd(mpg.front)
## [1] 6.901442
  testLevene <- leveneTest(highway.mpg~engine.location, data=auto)
  testLevene
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  4.7547 0.03037 *
##       203                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Calculate test statistic

The means for Front and Rear engines, respectively, are 30.8 and 25

Since we have a small Levene Test P value of 0.0304, we should assume that the 2 variances are not equal.

  testTSTAT <- t.test(mpg.front, mpg.rear, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)

  testTSTAT
## 
##  Welch Two Sample t-test
## 
## data:  mpg.front and mpg.rear
## t = 12.02, df = 201, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.879142 6.794125
## sample estimates:
## mean of x mean of y 
##  30.83663  25.00000

Step 5: Make a decision about the hypotheses

Our analysis gave us a t statistic of 12.0198, with 201 degrees of freedom. The final P value was 1.978579810^{-25}. Because this falls outside of our 5% threshold, we must reject the null hypothesis.

However, it’s important to keep in mind that the small size of the “rear” sample might give us inaccurate results.

Test 2: Does the location of the engine influence weight?

Step 1: State the hypothesis

  • Null Hypothesis -> The difference between the means of both samples is zero.

Step 2: Set the criterion

  • Alpha level = 5%
  • Degrees of freedom = ?
  • Critical value for the t-test = ?
  • This is a 2-tailed test

Step 3: Explore the data

  wt.front <- auto[auto$engine.location %in% "front",]$curb.weight
  wt.rear <- auto[auto$engine.location %in% "rear",]$curb.weight

  length(wt.front)
## [1] 202
  length(wt.rear)
## [1] 3
  boxplot(curb.weight~engine.location,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")

  ggplot(auto, aes(x = curb.weight, fill = auto$engine.location)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Location of Engine"))

  mean(wt.rear)
## [1] 2770.667
  sd(wt.rear)
## [1] 25.40341
  mean(wt.front)
## [1] 2552.371
  sd(wt.front)
## [1] 523.8769
  testLevene2 <- leveneTest(curb.weight~engine.location, data=auto)
  testLevene2
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  4.0878 0.04451 *
##       203                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Calculate test statistic

The means for Front and Rear engines, respectively, are 2552.4 and 2770.7

Since we have a small Levene Test P value of 0.0445, we should assume that the 2 variances are not equal.

  testTSTAT2 <- t.test(mpg.front, mpg.rear, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)

  testTSTAT2
## 
##  Welch Two Sample t-test
## 
## data:  mpg.front and mpg.rear
## t = 12.02, df = 201, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.879142 6.794125
## sample estimates:
## mean of x mean of y 
##  30.83663  25.00000

Step 5: Make a decision about the hypotheses

Our analysis gave us a t statistic of 12.0198, with 201 degrees of freedom. The final P value was 1.978579810^{-25}. Because this falls outside of our 5% threshold, we must reject the null hypothesis.

However, it’s important to keep in mind that the small size of the “rear” sample might give us inaccurate results.

Test 3: Does the fuel type of the engine influence curb weight?

Step 1: State the hypothesis

  • Null Hypothesis -> The difference between the means of both samples is zero.

Step 2: Set the criterion

  • Alpha level = 5%
  • Degrees of freedom = ?
  • Critical value for the t-test = ?
  • This is a 2-tailed test

Step 3: Explore the data

  wt.gas <- auto[auto$fuel.type %in% "gas",]$curb.weight
  wt.diesel <- auto[auto$fuel.type %in% "diesel",]$curb.weight

  length(wt.gas)
## [1] 185
  length(wt.diesel)
## [1] 20
  boxplot(curb.weight~fuel.type,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")

  ggplot(auto, aes(x = curb.weight, fill = auto$fuel.type)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Type of Fuel"))

  mean(wt.gas)
## [1] 2518.459
  sd(wt.gas)
## [1] 501.0002
  mean(wt.diesel)
## [1] 2898.8
  sd(wt.diesel)
## [1] 585.386
  testLevene3 <- leveneTest(curb.weight~fuel.type, data=auto)
  testLevene3
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1   3.559 0.06065 .
##       203                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Calculate test statistic

The means for Gas and Diesel engines, respectively, are 2518.5 and 2898.8

Since we have a small Levene Test P value of 0.0607, we should assume that the 2 variances are not equal.

  testTSTAT3 <- t.test(wt.gas, wt.diesel, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)

  testTSTAT3
## 
##  Welch Two Sample t-test
## 
## data:  wt.gas and wt.diesel
## t = -2.797, df = 22.114, p-value = 0.01047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -662.26194  -98.41914
## sample estimates:
## mean of x mean of y 
##  2518.459  2898.800

Step 5: Make a decision about the hypotheses

Our analysis gave us a t statistic of -2.797, with 22.1139 degrees of freedom. The final P value was 0.0104745. Because this falls outside of our 5% threshold, we must reject the null hypothesis.

However, it’s important to keep in mind that the small size of the “diesel” sample might give us inaccurate results. It’s also interesting to note that even though “Gas” had a much larger sample size, it didn’t reduce the standard deviation by a very large amount, when compared to the “Diesel” set of only 20 samples.

Just out of curiosity, I’ve applied another t-test assuming equal variance. Notice that it also rejects the null hypothesis, although with a significantly smaller P value.

t.test(wt.gas, wt.diesel, var.equal = TRUE, alternative = 'two.sided', paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  wt.gas and wt.diesel
## t = -3.1715, df = 203, p-value = 0.001752
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -616.8008 -143.8803
## sample estimates:
## mean of x mean of y 
##  2518.459  2898.800

Test 4: Does the fuel type of the engine influence price?

Step 1: State the hypothesis

  • Null Hypothesis -> The difference between the means of both samples is zero.

Step 2: Set the criterion

  • Alpha level = 5%
  • Degrees of freedom = ?
  • Critical value for the t-test = ?
  • This is a 2-tailed test

Step 3: Explore the data

  auto$price <- as.numeric(auto$price)
  price.gas <- auto[auto$fuel.type %in% "gas",]$price
  price.diesel <- auto[auto$fuel.type %in% "diesel",]$price

  length(price.gas)
## [1] 185
  length(price.diesel)
## [1] 20
  boxplot(price~fuel.type,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")

  ggplot(auto, aes(x = price, fill = auto$fuel.type)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Type of Fuel"))

  mean(price.gas)
## [1] 96.04865
  sd(price.gas)
## [1] 55.29007
  mean(price.diesel)
## [1] 85.3
  sd(price.diesel)
## [1] 50.64541
  testLevene4 <- leveneTest(price~fuel.type, data=auto)
  testLevene4
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  1.4596 0.2284
##       203

Step 4: Calculate test statistic

The means for Gas and Diesel engines, respectively, are 96 and 85.3

Since we have a larger Levene Test P value of 0.2284, we should assume that the 2 variances are equal.

  testTSTAT4 <- t.test(price.gas, price.diesel, var.equal = TRUE, alternative = 'two.sided', paired=FALSE)

  testTSTAT4
## 
##  Two Sample t-test
## 
## data:  price.gas and price.diesel
## t = 0.8322, df = 203, p-value = 0.4063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.71802  36.21531
## sample estimates:
## mean of x mean of y 
##  96.04865  85.30000

Step 5: Make a decision about the hypotheses

Our analysis gave us a t statistic of 0.8322, with 203 degrees of freedom. The final P value was 0.406275. Because this falls just barely outside of our 5% threshold, we must cautiously reject the null hypothesis.

It’s important to keep in mind that the small size of the “diesel” sample might give us inaccurate results.