In this report, I’ll be examining several binary elements from the dataset in order to determine if any of these factors have significant leverage over various numeric variables.

In each test, we will use a Levene Test to establish whether or not the samples have equal variance.
We will be using the t.test() function, since the degrees of freedom can be really difficult to calculate manually. This approach will provide us with cleaner and more readable code.

The original data source for this report can be found on Kaggle.

This dataset consist of data From 1985 Ward’s Automotive Yearbook.

Questions we aim to answer

Test 1: Does the location of the engine influence Highway MPG?
Test 2: Does the location of the engine influence weight?
Test 3: Does the fuel type of the engine influence curb weight?
Test 4: Does the fuel type of the engine influence price?

Load The Data and Functions We Will Use

  library(ggplot2)
  library(car)

## Warning: package 'car' was built under R version 3.3.3

  auto <- read.csv("Automobile_data.csv")
  auto <- auto[,c("fuel.type", "engine.location", "curb.weight", "num.of.cylinders", "highway.mpg", "price")]

Test 1: Does the location of the engine influence Highway MPG?

Step 1: State the hypothesis

Null Hypothesis -> The difference between the means of both samples is zero.

Step 2: Set the criterion

Alpha level = 5%
Degrees of freedom = ?
Critical value for the t-test = ?
This is a 2-tailed test.

Step 3: Explore the data

  mpg.front <- auto[auto$engine.location %in% "front",]$highway.mpg
  mpg.rear <- auto[auto$engine.location %in% "rear",]$highway.mpg

  length(mpg.front)

## [1] 202

  length(mpg.rear)

## [1] 3

  boxplot(highway.mpg~engine.location,data=auto, main="Engine Location vs Fuel Efficiency", 
    xlab="Engine Location", ylab="HWY MPG")

  ggplot(auto, aes(x = highway.mpg, fill = auto$engine.location)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Location of Engine"))

  mean(mpg.rear)

## [1] 25

  sd(mpg.rear)

## [1] 0

  mean(mpg.front)

## [1] 30.83663

  sd(mpg.front)

## [1] 6.901442

  testLevene <- leveneTest(highway.mpg~engine.location, data=auto)
  testLevene

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  4.7547 0.03037 *
##       203                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Calculate test statistic

The means for Front and Rear engines, respectively, are 30.8 and 25

Since we have a small Levene Test P value of 0.0304, we should assume that the 2 variances are not equal.

  testTSTAT <- t.test(mpg.front, mpg.rear, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)

  testTSTAT

## 
##  Welch Two Sample t-test
## 
## data:  mpg.front and mpg.rear
## t = 12.02, df = 201, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.879142 6.794125
## sample estimates:
## mean of x mean of y 
##  30.83663  25.00000

Step 5: Make a decision about the hypotheses

Our analysis gave us a t statistic of 12.0198, with 201 degrees of freedom. The final P value was 1.978579810^{-25}. Because this falls outside of our 5% threshold, we must reject the null hypothesis.

However, it’s important to keep in mind that the small size of the “rear” sample might give us inaccurate results.

Test 2: Does the location of the engine influence weight?

Step 1: State the hypothesis

Null Hypothesis -> The difference between the means of both samples is zero.

Step 2: Set the criterion

Alpha level = 5%
Degrees of freedom = ?
Critical value for the t-test = ?
This is a 2-tailed test

Step 3: Explore the data

  wt.front <- auto[auto$engine.location %in% "front",]$curb.weight
  wt.rear <- auto[auto$engine.location %in% "rear",]$curb.weight

  length(wt.front)

## [1] 202

  length(wt.rear)

## [1] 3

  boxplot(curb.weight~engine.location,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")

  ggplot(auto, aes(x = curb.weight, fill = auto$engine.location)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Location of Engine"))

  mean(wt.rear)

## [1] 2770.667

  sd(wt.rear)

## [1] 25.40341

  mean(wt.front)

## [1] 2552.371

  sd(wt.front)

## [1] 523.8769

  testLevene2 <- leveneTest(curb.weight~engine.location, data=auto)
  testLevene2

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  4.0878 0.04451 *
##       203                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Calculate test statistic

The means for Front and Rear engines, respectively, are 2552.4 and 2770.7

Since we have a small Levene Test P value of 0.0445, we should assume that the 2 variances are not equal.

  testTSTAT2 <- t.test(mpg.front, mpg.rear, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)

  testTSTAT2

## 
##  Welch Two Sample t-test
## 
## data:  mpg.front and mpg.rear
## t = 12.02, df = 201, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.879142 6.794125
## sample estimates:
## mean of x mean of y 
##  30.83663  25.00000

Step 5: Make a decision about the hypotheses

However, it’s important to keep in mind that the small size of the “rear” sample might give us inaccurate results.

Test 3: Does the fuel type of the engine influence curb weight?

Step 1: State the hypothesis

Null Hypothesis -> The difference between the means of both samples is zero.

Step 2: Set the criterion

Alpha level = 5%
Degrees of freedom = ?
Critical value for the t-test = ?
This is a 2-tailed test

Step 3: Explore the data

  wt.gas <- auto[auto$fuel.type %in% "gas",]$curb.weight
  wt.diesel <- auto[auto$fuel.type %in% "diesel",]$curb.weight

  length(wt.gas)

## [1] 185

  length(wt.diesel)

## [1] 20

  boxplot(curb.weight~fuel.type,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")

  ggplot(auto, aes(x = curb.weight, fill = auto$fuel.type)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Type of Fuel"))

  mean(wt.gas)

## [1] 2518.459

  sd(wt.gas)

## [1] 501.0002

  mean(wt.diesel)

## [1] 2898.8

  sd(wt.diesel)

## [1] 585.386

  testLevene3 <- leveneTest(curb.weight~fuel.type, data=auto)
  testLevene3

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1   3.559 0.06065 .
##       203                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Calculate test statistic

The means for Gas and Diesel engines, respectively, are 2518.5 and 2898.8

Since we have a small Levene Test P value of 0.0607, we should assume that the 2 variances are not equal.

  testTSTAT3 <- t.test(wt.gas, wt.diesel, var.equal = FALSE, alternative = 'two.sided', paired=FALSE)

  testTSTAT3

## 
##  Welch Two Sample t-test
## 
## data:  wt.gas and wt.diesel
## t = -2.797, df = 22.114, p-value = 0.01047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -662.26194  -98.41914
## sample estimates:
## mean of x mean of y 
##  2518.459  2898.800

Step 5: Make a decision about the hypotheses

Our analysis gave us a t statistic of -2.797, with 22.1139 degrees of freedom. The final P value was 0.0104745. Because this falls outside of our 5% threshold, we must reject the null hypothesis.

However, it’s important to keep in mind that the small size of the “diesel” sample might give us inaccurate results. It’s also interesting to note that even though “Gas” had a much larger sample size, it didn’t reduce the standard deviation by a very large amount, when compared to the “Diesel” set of only 20 samples.

Just out of curiosity, I’ve applied another t-test assuming equal variance. Notice that it also rejects the null hypothesis, although with a significantly smaller P value.

t.test(wt.gas, wt.diesel, var.equal = TRUE, alternative = 'two.sided', paired=FALSE)

## 
##  Two Sample t-test
## 
## data:  wt.gas and wt.diesel
## t = -3.1715, df = 203, p-value = 0.001752
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -616.8008 -143.8803
## sample estimates:
## mean of x mean of y 
##  2518.459  2898.800

Test 4: Does the fuel type of the engine influence price?

Step 1: State the hypothesis

Null Hypothesis -> The difference between the means of both samples is zero.

Step 2: Set the criterion

Alpha level = 5%
Degrees of freedom = ?
Critical value for the t-test = ?
This is a 2-tailed test

Step 3: Explore the data

  auto$price <- as.numeric(auto$price)
  price.gas <- auto[auto$fuel.type %in% "gas",]$price
  price.diesel <- auto[auto$fuel.type %in% "diesel",]$price

  length(price.gas)

## [1] 185

  length(price.diesel)

## [1] 20

  boxplot(price~fuel.type,data=auto, main="Engine Location vs Curb Weight", xlab="Engine Location", ylab="Curb Weight")

  ggplot(auto, aes(x = price, fill = auto$fuel.type)) + geom_density(alpha = 0.5) + guides(fill=guide_legend(title="Type of Fuel"))

  mean(price.gas)

## [1] 96.04865

  sd(price.gas)

## [1] 55.29007

  mean(price.diesel)

## [1] 85.3

  sd(price.diesel)

## [1] 50.64541

  testLevene4 <- leveneTest(price~fuel.type, data=auto)
  testLevene4

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  1.4596 0.2284
##       203

Step 4: Calculate test statistic

The means for Gas and Diesel engines, respectively, are 96 and 85.3

Since we have a larger Levene Test P value of 0.2284, we should assume that the 2 variances are equal.

  testTSTAT4 <- t.test(price.gas, price.diesel, var.equal = TRUE, alternative = 'two.sided', paired=FALSE)

  testTSTAT4

## 
##  Two Sample t-test
## 
## data:  price.gas and price.diesel
## t = 0.8322, df = 203, p-value = 0.4063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.71802  36.21531
## sample estimates:
## mean of x mean of y 
##  96.04865  85.30000

Step 5: Make a decision about the hypotheses

Our analysis gave us a t statistic of 0.8322, with 203 degrees of freedom. The final P value was 0.406275. Because this falls just barely outside of our 5% threshold, we must cautiously reject the null hypothesis.

It’s important to keep in mind that the small size of the “diesel” sample might give us inaccurate results.

Two Sample T Tests

Patrick

May 23, 2017

Questions we aim to answer

Load The Data and Functions We Will Use

Test 1: Does the location of the engine influence Highway MPG?

Step 1: State the hypothesis

Step 2: Set the criterion

Step 3: Explore the data

Step 4: Calculate test statistic

Step 5: Make a decision about the hypotheses

Test 2: Does the location of the engine influence weight?

Step 1: State the hypothesis

Step 2: Set the criterion

Step 3: Explore the data

Step 4: Calculate test statistic

Step 5: Make a decision about the hypotheses

Test 3: Does the fuel type of the engine influence curb weight?

Step 1: State the hypothesis

Step 2: Set the criterion

Step 3: Explore the data

Step 4: Calculate test statistic

Step 5: Make a decision about the hypotheses

Test 4: Does the fuel type of the engine influence price?

Step 1: State the hypothesis

Step 2: Set the criterion

Step 3: Explore the data

Step 4: Calculate test statistic

Step 5: Make a decision about the hypotheses