An environmental group would like to test the hypothesis that the mean mpg of cars manufactured in the US is less than that of those manufactured in Japan. Towards this end, they sampled n1=35 US and n2=28 Japanese cars, which were tested for mpg fuel efficiency. (As a caveat, assume that this is a random sample from a large population of US and Japanese cars, not a complete census). The data is reported in the following file csv file:
#Import US_Japanese_Car from url
cars <- read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv")
names(cars)[names(cars) == "ï..USCars"] <- "USCars"
cars
## USCars JapaneseCars
## 1 18 24
## 2 15 27
## 3 18 27
## 4 16 25
## 5 17 31
## 6 15 35
## 7 14 24
## 8 14 19
## 9 14 28
## 10 15 23
## 11 15 27
## 12 14 20
## 13 15 22
## 14 14 18
## 15 22 20
## 16 18 31
## 17 21 32
## 18 21 31
## 19 10 32
## 20 10 24
## 21 11 26
## 22 9 29
## 23 28 24
## 24 25 24
## 25 19 33
## 26 16 33
## 27 17 32
## 28 19 28
## 29 18 NA
## 30 14 NA
## 31 14 NA
## 32 14 NA
## 33 14 NA
## 34 12 NA
## 35 13 NA
# MPG for cars from the US and Japan on a normal probability plot,
qqnorm(cars$USCars, ylab="MPG, Sample Quantiles", xlab="Standard Normal Quantiles", main="US Cars Normal Probability Plot", col="steelblue")
qqline(cars$USCars, col="blue")
qqnorm(cars$JapaneseCars, ylab="MPG, Sample Quantiles", xlab="Standard Normal Quantiles", main="Japanese Cars Normal Probability Plot", col="firebrick2")
qqline(cars$JapaneseCars, col="red")
It is a fairly safe assumption that the mpg of both US and Japanese cars are normally distributed. The points seem to fall about a straight line. The deviations from the straight line are minimal, with outliers more apparent with the US cars distribution over Japanese cars. This indicates laregely a normal distribution given that most of the samples of each are along the straight line. Each are a small sample size and this limits the robustness of the plot, and further analysis is warranted. Overall, Data near the mean are more frequent in occurrence than data far from the mean.
# boxplot comparison for US and Japanese cars' MPG
var(cars$USCars)
## [1] 16.44034
var(cars$JapaneseCars,na.rm = TRUE)
## [1] 22.12037
boxplot(cars$USCars,cars$JapaneseCars, main="US vs. Japanese MPG", names=c("US","Japan"),col=c("steelblue","firebrick2"), ylab="MPG")
The variability within each group is different, as represented by the interquartile range (IQR) of each. The boxplot for US cars shows the overall IQR to be more narrow than for Japanese cars, and skewed to the right since the median is lower than the mean.
For Japanese cars there is a more evenly distributed MPG about the mean, but with a larger IQR. The median appears to be lower than the mean for Japanese cars, but less so than for US cars.
There appears to be greater variance for Japanese car MPG than for US cars. Japanese cars have a greater IQR than US cars, indicating a greater variance from the mean.
# log(MPG) for cars from the US and Japan on a normal probability plot,
cars$logUSCars <- log(cars$USCars)
cars$logJapCars <- log(cars$JapaneseCars)
qqnorm(cars$logUSCars, ylab="Log MPG, Sample Quantiles", xlab="Standard Normal Quantiles", main="US Cars log(Normal Probability Plot)", col="steelblue")
qqline(cars$logUSCars, col="blue")
qqnorm(cars$logJapCars, ylab="Log MPG, Sample Quantiles", xlab="Standard Normal Quantiles", main="Japanase Cars log(Normal Probability Plot)", col="firebrick2")
qqline(cars$logJapCars, col="firebrick2")
# boxplot comparison for US and Japanese cars' log(MPG)
var(cars$logUSCars)
## [1] 0.06085468
var(cars$logJapCars,na.rm = TRUE)
## [1] 0.03313062
boxplot(cars$logUSCars, cars$logJapCars, main="US vs. Japanese Log MPG", names=c("US","Japan"),col=c("steelblue","firebrick2"), ylab="MPG")
In comparing
The hypotheses to test is the NULL hypothesis, in which the means of the two log MPG populations approximated by the samples are the same. The alternative hypothesis is that the two log MPG populations are not the same.
Null Hypothesis H0: µUS - µJapanese = 0
Alternative hypothesis, in which the means of the two MPG populations approximated by the samples are not the same:
H1: µUS - µJapanese ≠ 0
# perform 2 sample t-test with default 95% confidence level
t.test(cars$logUSCars,cars$logJapCars,alternative = "l")
##
## Welch Two Sample t-test
##
## data: cars$logUSCars and cars$logJapCars
## t = -9.804, df = 60.651, p-value = 2.008e-14
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.4396641
## sample estimates:
## mean of x mean of y
## 2.741001 3.270957
Our p-value is nearly 0 < 0.05 = α, so we reject the null hypothesis and we are 95% confident that the means of the MPGs are not equal.
Returning to the query of the environmental group, assuming that the samples were truly random from a large population of US and Japanese cars, the hypothesis the mean mpg of cars manufactured in the US is less than that of those manufactured in Japan is confirmed; we are 95% confident that the the lower bound of our estimated log(µUS) - log(µJapanese) is infinite and our upper bound is -0.4396641.
Complete R code used in this analysis.
#Import US_Japanese_Car from url
cars <- read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv")
names(cars)[names(cars) == "ï..USCars"] <- "USCars"
cars
# MPG for cars from the US and Japan on a normal probability plot,
qqnorm(cars$USCars, ylab="MPG, Sample Quantiles", xlab="Standard Normal Quantiles", main="US Cars Normal Probability Plot", col="steelblue")
qqline(cars$USCars, col="blue")
qqnorm(cars$JapaneseCars, ylab="MPG, Sample Quantiles", xlab="Standard Normal Quantiles", main="Japanese Cars Normal Probability Plot", col="firebrick2")
qqline(cars$JapaneseCars, col="red")
# boxplot comparison for US and Japanese cars' MPG
var(cars$USCars)
var(cars$JapaneseCars,na.rm = TRUE)
boxplot(cars$USCars,cars$JapaneseCars, main="US vs. Japanese MPG", names=c("US","Japan"),col=c("steelblue","firebrick2"), ylab="MPG")
# log(MPG) for cars from the US and Japan on a normal probability plot,
cars$logUSCars <- log(cars$USCars)
cars$logJapCars <- log(cars$JapaneseCars)
qqnorm(cars$logUSCars, ylab="Log MPG, Sample Quantiles", xlab="Standard Normal Quantiles", main="US Cars log(Normal Probability Plot)", col="steelblue")
qqline(cars$logUSCars, col="blue")
qqnorm(cars$logJapCars, ylab="Log MPG, Sample Quantiles", xlab="Standard Normal Quantiles", main="Japanase Cars log(Normal Probability Plot)", col="firebrick2")
qqline(cars$logJapCars, col="firebrick2")
# boxplot comparison for US and Japanese cars' log(MPG)
var(cars$logUSCars)
var(cars$logJapCars,na.rm = TRUE)
boxplot(cars$logUSCars, cars$logJapCars, main="US vs. Japanese Log MPG", names=c("US","Japan"),col=c("steelblue","firebrick2"), ylab="MPG")
# perform 2 sample t-test with default 95% confidence level
t.test(cars$logUSCars,cars$logJapCars,alternative = "l")