We are going to first read in the data from a github site and display the data.
dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv")
knitr::kable(dat)
USCars | JapaneseCars |
---|---|
18 | 24 |
15 | 27 |
18 | 27 |
16 | 25 |
17 | 31 |
15 | 35 |
14 | 24 |
14 | 19 |
14 | 28 |
15 | 23 |
15 | 27 |
14 | 20 |
15 | 22 |
14 | 18 |
22 | 20 |
18 | 31 |
21 | 32 |
21 | 31 |
10 | 32 |
10 | 24 |
11 | 26 |
9 | 29 |
28 | 24 |
25 | 24 |
19 | 33 |
16 | 33 |
17 | 32 |
19 | 28 |
18 | NA |
14 | NA |
14 | NA |
14 | NA |
14 | NA |
12 | NA |
13 | NA |
We would like to test the hypothesis that \(\mu_1=\mu_2\) against the alternative that \(\mu_1\neq\mu_2\) at and \(\alpha=0.05\) level of significance. This is very important as
Fuel efficiency affects carbon emissions
Fuel prices for society
Further, we are using this analysis to determine whether the EPA should impose penalties on carmakers.
A boxplot of the data is as follows
boxplot(dat$USCars,dat$JapaneseCars,
main="Boxplot of MPG for US and Japanese Cars",
col=c("red","green"),
names=c("USCars","JapaneseCars"))
It appears that the means do differ, yet there may be a problem with constant variance when doing a two-sample t-test. Lets transform the data by taking the natural log
dat$USCars<-log(dat$USCars)
dat$JapaneseCars<-log(dat$JapaneseCars)
The sample standard deviation,\(s^2\), of US cars after the transformation is 0.2466874, and for Japanese cars is 0.1820182 (note: use inline R code to calcuate the standard deviations, don’t just type in), for which it deemed that \(\sigma_1\approx\sigma_2\), and hence a two-sample t-test is performed.
The t-statistic for this test may be computed as
\[ t=\frac{\bar{x}_1-\bar{x_2}}{\sqrt{(s_1^2/n_1)+(s_2^2/n_2)}} \]
The test performed in R is as follows
t.test(dat$USCars, dat$JapaneseCars,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: dat$USCars and dat$JapaneseCars
## t = -9.804, df = 60.651, p-value = 4.015e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6380580 -0.4218536
## sample estimates:
## mean of x mean of y
## 2.741001 3.270957
Note that the p-value is very small, and hence the equality of the means is rejected, and hence there is statistical evidence that the mean mpg of Japanese cars is less than US Cars.
dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv")
head(dat)
boxplot(dat$USCars,dat$JapaneseCars,
main="Boxplot of MPG for US and Japanese Cars",
col=c("red","green"),
names=c("USCars","JapaneseCars"))
dat$USCars<-log(dat$USCars)
dat$JapaneseCars<-log(dat$JapaneseCars)
t.test(dat$USCars, dat$JapaneseCars,
alternative = "two.sided")