An environmental group would like to test the hypothesis that the mean mpg of cars manufactured in the US is less than that of those manufactured in Japan. Towards this end, they sampled n1=35 US and n2=28 Japanese cars, which were tested for mpg fuel efficiency. (As a caveat, assume that this is a random sample from a large population of US and Japanese cars, not a complete census). The data is reported in the following file csv file:
https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv
# # Create the Q-Q plot
qqnorm(us_cars, xlab = "US Cars ", main = "Normal Probability Dist. of US Cars")
qqline(us_cars, col = "blue", lwd = 2)
## Create the Q-Q plot
qqnorm(japan_cars, xlab = "Japanese Cars", main = "Normal Probability of Japanese Cars")
qqline(japan_cars, col = "red", lwd = 2)
It is evident from the graphs that we can safely assume a linear relationship, so we can use the normal distributions assumptions for this data.
Now we will inspect the data for variance in comparision to each data set. We are looking for overlap using boxplots to see visually if the data sets have any overlap.
# # Create side-by-side box plots
boxplot(us_cars, japan_cars,
names = c("US Cars", "Japanese Cars"), # Labels under each box
main = "Comparison of US and Japanese Car MPG Fuel Efficiency", # Main title
col = c("darkblue", "red"), # Optional: color for each box
ylab = "MPG") # Label for y-axis
It is evident from the box plots that we have very little if any overlap in our data. Even the means are widely disparate from each other. It is interesting to note that the US data has two outliers that we will discuss later in this document. Looking at the individual data sets, we can see that the US values appear to be clustered toward the low end and as we approach the high end the density of the values lessens. Whereas for the Japanese values they appear to have a more even distribution. The US median line is showing toward the lower end of the box because of the clustering of values at the low-end. The Japanese median appears roughly in the middle of the box for the box and whisker plot.
We now transform our data set by taking a logarithm of the values to see if that will affect our initial conclusions. The initial graphs will be repeated.
us_cars_log<-log(dat$USCars)
japan_cars_log<-log(dat$JapaneseCars)
# # Create the Q-Q plot
qqnorm(us_cars_log, xlab = "US Cars in LOG ", main = "Normal Probability Dist. of US Cars-LOG transform")
qqline(us_cars_log, col = "blue", lwd = 2)
## Create the Q-Q plot
qqnorm(japan_cars_log, xlab = "Japanese Cars in LOG", main = "Normal Probability of Japanese Cars-LOG transform")
qqline(japan_cars_log, col = "red", lwd = 2)
The data remains with a linear relationship, so we can safely say that the data is normally distributed.
# # Create side-by-side box plots
boxplot(us_cars_log, japan_cars_log,
names = c("US Cars_log", "Japanese Cars_log"), # Labels under each box
main = "Comparison of US and Japanese Car MPG Fuel Efficiency in LOG", # Main title
col = c("darkblue", "red"), # Optional: color for each box
ylab = "MPG(log)") # Label for y-axis
However the variance conclusion will remain the same. This is little or
no overlap between the data values of the two data sets. One interesting
not is that the outliers for the US data does change for the log
transform.
Our null and alternate hypothesis is:
\[\begin{array}{l} {H_0}:{\mu _a} < {\mu _j}\\ {H_a}:{\mu _a} > {\mu _j} \end{array}\]
Where \[{\mu _a}\] is the mean for the American cars and \[{\mu _j}\] is the mean for the Japanese cars. The mean for the American cars is 2.7410014. The mean for the Japanese cars is 3.2709572.
The summary statistics for both are as follows:(Using the log transform data)
summary(us_cars_log)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.197 2.639 2.708 2.741 2.890 3.332
summary(japan_cars_log)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.890 3.178 3.296 3.271 3.434 3.555 7
We will test the hypothesis with the t test.
t.test(us_cars_log, japan_cars_log,alternative = "less",mu=0,conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: us_cars_log and japan_cars_log
## t = -9.804, df = 60.651, p-value = 2.008e-14
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.4396641
## sample estimates:
## mean of x mean of y
## 2.741001 3.270957
The p value is extremely small. In fact it is much smaller than 0.05. We can conclude that we would reject the null hypothesis and conclude that US cars do not have a better gas mileage than Japanese cars. There is clear evidence from our data to support this conclusion.
One thing that we can conclude from the box and whisker plots is that there are some outlier data values. The following is an analysis of those outliers.
#Outliers
us_Q1 <- quantile(us_cars, 0.25)
us_Q3 <- quantile(us_cars, 0.75)
us_IQR <- IQR(us_cars)
# Define bounds
us_lower_bound <- us_Q1 - 1.5 * us_IQR
us_upper_bound <- us_Q3 + 1.5 * us_IQR
# Identify outliers
us_outliers <- us_cars[us_cars < us_lower_bound | us_cars > us_upper_bound]
print(us_outliers)
## [1] 28 25
The outliers that were present in the data were 28 & 25 MPG. Which are interesting to note that these values match closely to the range of the Japanese data. This may be evidence of some manufacturer trying to be more competitive with the Japanese in terms of MPG on a particular class of car.
Here is the source code for this assignment:
dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv")
us_cars<-dat$USCars
japan_cars<-dat$JapaneseCars
print(japan_cars)
# # Create the Q-Q plot
qqnorm(us_cars, xlab = "US Cars ", main = "Normal Probability Dist. of US Cars")
qqline(us_cars, col = "blue", lwd = 2)
## Create the Q-Q plot
qqnorm(japan_cars, xlab = "Japanese Cars", main = "Normal Probability of Japanese Cars")
qqline(japan_cars, col = "red", lwd = 2)
#
# # Create side-by-side box plots
boxplot(us_cars, japan_cars,
names = c("US Cars", "Japanese Cars"), # Labels under each box
main = "Comparison of US and Japanese Car MPG Fuel Efficiency", # Main title
col = c("darkblue", "red"), # Optional: color for each box
ylab = "MPG") # Label for y-axis
#LOG transfor of data
us_cars_log<-log(dat$USCars)
japan_cars_log<-log(dat$JapaneseCars)
print(japan_cars)
# # Create the Q-Q plot
qqnorm(us_cars_log, xlab = "US Cars in LOG ", main = "Normal Probability Dist. of US Cars-LOG transform")
qqline(us_cars_log, col = "blue", lwd = 2)
## Create the Q-Q plot
qqnorm(japan_cars_log, xlab = "Japanese Cars in LOG", main = "Normal Probability of Japanese Cars-LOG transform")
qqline(japan_cars_log, col = "red", lwd = 2)
#
# # Create side-by-side box plots
boxplot(us_cars_log, japan_cars_log,
names = c("US Cars_log", "Japanese Cars_log"), # Labels under each box
main = "Comparison of US and Japanese Car MPG Fuel Efficiency in LOG", # Main title
col = c("darkblue", "red"), # Optional: color for each box
ylab = "MPG(log)") # Label for y-axis
summary(us_cars)
summary(us_cars_log)
summary(japan_cars)
summary(japan_cars_log)
mean(us_cars)
mean(us_cars_log)
mean(japan_cars,na.rm = TRUE)
mean(japan_cars_log,na.rm = TRUE)
t.test(us_cars_log, japan_cars_log,alternative = "less",mu=0,conf.level = 0.95)
#Outliers
us_Q1 <- quantile(us_cars, 0.25)
us_Q3 <- quantile(us_cars, 0.75)
us_IQR <- IQR(us_cars)
# Define bounds
us_lower_bound <- us_Q1 - 1.5 * us_IQR
us_upper_bound <- us_Q3 + 1.5 * us_IQR
# Identify outliers
us_outliers <- us_cars[us_cars < us_lower_bound | us_cars > us_upper_bound]
print(us_outliers)