Summary of Data

We are going to first read in the data from a github site and display the data.

dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv")
knitr::kable(dat)
USCars JapaneseCars
18 24
15 27
18 27
16 25
17 31
15 35
14 24
14 19
14 28
15 23
15 27
14 20
15 22
14 18
22 20
18 31
21 32
21 31
10 32
10 24
11 26
9 29
28 24
25 24
19 33
16 33
17 32
19 28
18 NA
14 NA
14 NA
14 NA
14 NA
12 NA
13 NA

We would like to test the hypothesis that \(\mu_1=\mu_2\) against the alternative that \(\mu_1\neq\mu_2\) at and \(\alpha=0.05\) level of significance. This is very important as

  1. Fuel efficiency affects carbon emissions

  2. Fuel prices for society

Further, we are using this analysis to determine whether the EPA should impose penalties on carmakers.

Boxplot

A boxplot of the data is as follows

boxplot(dat$USCars,dat$JapaneseCars,
        main="Boxplot of MPG for US and Japanese Cars",
        col=c("red","green"),
        names=c("USCars","JapaneseCars"))

Transformation

It appears that the means do differ, yet there may be a problem with constant variance when doing a two-sample t-test. Lets transform the data by taking the natural log

dat$USCars<-log(dat$USCars)
dat$JapaneseCars<-log(dat$JapaneseCars)

The sample standard deviation,\(s^2\), of US cars after the transformation is 0.2466874, and for Japanese cars is 0.1820182 (note: use inline R code to calcuate the standard deviations, don’t just type in), for which it deemed that \(\sigma_1\approx\sigma_2\), and hence a two-sample t-test is performed.

Two-Sample t-test

The t-statistic for this test may be computed as

\[ t=\frac{\bar{x}_1-\bar{x_2}}{\sqrt{(s_1^2/n_1)+(s_2^2/n_2)}} \]

The test performed in R is as follows

t.test(dat$USCars, dat$JapaneseCars,
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  dat$USCars and dat$JapaneseCars
## t = -9.804, df = 60.651, p-value = 4.015e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6380580 -0.4218536
## sample estimates:
## mean of x mean of y 
##  2.741001  3.270957

Note that the p-value is very small, and hence the equality of the means is rejected, and hence there is statistical evidence that the mean mpg of Japanese cars is less than US Cars.

Complete Code

dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv")
head(dat)
boxplot(dat$USCars,dat$JapaneseCars,
        main="Boxplot of MPG for US and Japanese Cars",
        col=c("red","green"),
        names=c("USCars","JapaneseCars"))
dat$USCars<-log(dat$USCars)
dat$JapaneseCars<-log(dat$JapaneseCars)
t.test(dat$USCars, dat$JapaneseCars,
       alternative = "two.sided")