data <- read.csv('https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv')
We should look at the number of samples we have to know whether or not the Central Limit Theorem holds.
length(data$JapaneseCars)-sum(is.na(data$JapaneseCars))
## [1] 28
length(data$ï..USCars)
## [1] 35
I do not think the Central Limit Theorem would apply here because while the US cars sample is over 30, the Japanese Car sample is under 30. The central limit theorem usually applies with sample size of roughly 30-40.
qqnorm(data$JapaneseCars,main='Japanese Cars Normal Q-Q Plot')
qqline(data$JapaneseCars)
qqnorm(data$ï..USCars,main='U.S Cars Normal Q-Q Plot')
qqline(data$ï..USCars)
It looks like the U.S cars may not be normally distributed, though the Japanese cars seem to be centered around the center line. The U.S cars seems to have a drift upwards at the top of the graph.
boxplot(data$JapaneseCars,data$ï..USCars,names=c("Japanese","U.S"),main='Boxplot of Japanese and U.S Cars mpg')
We can see here that our boxplots have a clear difference in mean between the two samples. The U.S is centered around 15, with a much smaller quantile range than the Japanese sample. The Japanese sample is centered aroud 27 with a larger range of quartiles as well. Judging by the proportion of the quantile boxes on the U.S boxplot, the U.S data may be skewed to the right, meaning our variances are probably not equal.
data$logUS <- log(data$ï..USCars)
data$logJapan <- log(data$JapaneseCars)
Here are the new Normal Probability Plots for the two data sets after having been transformed via log:
qqnorm(data$logJapan,main='Log Transform Japanese Cars Normal Q-Q Plot')
qqline(data$logJapan)
qqnorm(data$logUS,main='Log Transform U.S Cars Normal Q-Q Plot')
qqline(data$logUS)
The U.S sample set appears to be normally distributed, for the most part now. THe japanese sample is mostly unchanged from the transformation.
boxplot(data$logJapan,data$logUS,names=c("Japanese","U.S"),main='Boxplot of log transformed data')
Our new boxplot data is similar for the most part, the inner quartile range looks similar for JApan, but the outer range seems to be smaller. The U.S data now has a wider range, but still has non-symmetrical inner quartile boxes.
The null hypothesis here is that Japanese and U.S cars have an equal mean.
meanUS=meanJapan
The alternative hypothesis is that the U.S mpg is less than the Japanese mpg.
meanUS<meanJapan
t.test(data$logUS,data$logJapan,var.equal = TRUE,alternative = 'less')
##
## Two Sample t-test
##
## data: data$logUS and data$logJapan
## t = -9.4828, df = 61, p-value = 6.528e-14
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.4366143
## sample estimates:
## mean of x mean of y
## 2.741001 3.270957
Our sample averages of the U.S cars mpg is 2.7410014 and the Japanese cars have an mpg of 3.270957. The U.S cars average is clearly lower than the Japanese.
Our conclusions here are that the U.S cars do in fact have a lower mpg then the Japanese cars, given this sample set. There is not enough evidence to support the null hypothesis that they are statistically the same.