1 NPP Graphs

An environmental group would like to test the hypothesis that the mean mpg of cars manufactured in the US is less than that of those manufactured in Japan. Towards this end, they sampled n1=35 US and n2=28 Japanese cars, which were tested for mpg fuel efficiency. (As a caveat, assume that this is a random sample from a large population of US and Japanese cars, not a complete census). The data is reported in the following file csv file:

https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv

  1. the mpg of both US and Japanese cars appear to be normally distributed based on the following Normal plots: A linear line is included here for illustration purposes.
# # Create the Q-Q plot
qqnorm(us_cars, xlab = "US Cars ", main = "Normal Probability Dist. of US Cars")
qqline(us_cars, col = "blue", lwd = 2)

## Create the Q-Q plot
qqnorm(japan_cars, xlab = "Japanese Cars", main = "Normal Probability of Japanese Cars")
qqline(japan_cars, col = "red", lwd = 2)

It is evident from the graphs that we can safely assume a linear relationship, so we can use the normal distributions assumptions for this data.

2 Data Set Variance

Now we will inspect the data for variance in comparision to each data set. We are looking for overlap using boxplots to see visually if the data sets have any overlap.

# # Create side-by-side box plots
boxplot(us_cars, japan_cars,
        names = c("US Cars", "Japanese Cars"),  # Labels under each box
        main = "Comparison of US and Japanese Car MPG Fuel Efficiency", # Main title
        col = c("darkblue", "red"), # Optional: color for each box
        ylab = "MPG")                    # Label for y-axis

It is evident from the box plots that we have very little if any overlap in our data. Even the means are widely disparate from each other. It is interesting to note that the US data has two outliers that we will discuss later in this document. Looking at the individual data sets, we can see that the US values appear to be clustered toward the low end and as we approach the high end the density of the values lessens. Whereas for the Japanese values they appear to have a more even distribution. The US median line is showing toward the lower end of the box because of the clustering of values at the low-end. The Japanese median appears roughly in the middle of the box for the box and whisker plot.

3 LOG Transform of the data set

We now transform our data set by taking a logarithm of the values to see if that will affect our initial conclusions. The initial graphs will be repeated.

us_cars_log<-log(dat$USCars)
japan_cars_log<-log(dat$JapaneseCars)
# # Create the Q-Q plot
qqnorm(us_cars_log, xlab = "US Cars in LOG ", main = "Normal Probability Dist. of US Cars-LOG transform")
qqline(us_cars_log, col = "blue", lwd = 2)

## Create the Q-Q plot
qqnorm(japan_cars_log, xlab = "Japanese Cars in LOG", main = "Normal Probability of Japanese Cars-LOG transform")
qqline(japan_cars_log, col = "red", lwd = 2)

The data remains with a linear relationship, so we can safely say that the data is normally distributed.

# # Create side-by-side box plots
boxplot(us_cars_log, japan_cars_log,
        names = c("US Cars_log", "Japanese Cars_log"),  # Labels under each box
        main = "Comparison of US and Japanese Car MPG Fuel Efficiency in LOG", # Main title
        col = c("darkblue", "red"), # Optional: color for each box
        ylab = "MPG(log)")                    # Label for y-axis

However the variance conclusion will remain the same. This is little or no overlap between the data values of the two data sets. One interesting not is that the outliers for the US data does change for the log transform.

4 Hypothesis Test and Conclusion

Our null and alternate hypothesis is:

\[\begin{array}{l} {H_0}:{\mu _a} < {\mu _j}\\ {H_a}:{\mu _a} > {\mu _j} \end{array}\]

Where \[{\mu _a}\] is the mean for the American cars and \[{\mu _j}\] is the mean for the Japanese cars. The mean for the American cars is 2.7410014. The mean for the Japanese cars is 3.2709572.

The summary statistics for both are as follows:(Using the log transform data)

summary(us_cars_log)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.197   2.639   2.708   2.741   2.890   3.332
summary(japan_cars_log)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.890   3.178   3.296   3.271   3.434   3.555       7

We will test the hypothesis with the t test.

t.test(us_cars_log, japan_cars_log,alternative = "less",mu=0,conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  us_cars_log and japan_cars_log
## t = -9.804, df = 60.651, p-value = 2.008e-14
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf -0.4396641
## sample estimates:
## mean of x mean of y 
##  2.741001  3.270957

The p value is extremely small. In fact it is much smaller than 0.05. We can conclude that we would reject the null hypothesis and conclude that US cars do not have a better gas mileage than Japanese cars. There is clear evidence from our data to support this conclusion.

5 Outliers

One thing that we can conclude from the box and whisker plots is that there are some outlier data values. The following is an analysis of those outliers.

#Outliers

us_Q1 <- quantile(us_cars, 0.25)
us_Q3 <- quantile(us_cars, 0.75)
us_IQR <- IQR(us_cars)

# Define bounds
us_lower_bound <- us_Q1 - 1.5 * us_IQR
us_upper_bound <- us_Q3 + 1.5 * us_IQR

# Identify outliers
us_outliers <- us_cars[us_cars < us_lower_bound | us_cars > us_upper_bound]
print(us_outliers)
## [1] 28 25

The outliers that were present in the data were 28 & 25 MPG. Which are interesting to note that these values match closely to the range of the Japanese data. This may be evidence of some manufacturer trying to be more competitive with the Japanese in terms of MPG on a particular class of car.

6 R Source code

Here is the source code for this assignment:

dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/US_Japanese_Cars.csv")
us_cars<-dat$USCars
japan_cars<-dat$JapaneseCars
print(japan_cars)

# # Create the Q-Q plot
 qqnorm(us_cars, xlab = "US Cars ", main = "Normal Probability Dist. of US Cars")
 qqline(us_cars, col = "blue", lwd = 2)
 
## Create the Q-Q plot
 qqnorm(japan_cars, xlab = "Japanese Cars", main = "Normal Probability of Japanese Cars")
 qqline(japan_cars, col = "red", lwd = 2)
 

# 
# # Create side-by-side box plots
 boxplot(us_cars, japan_cars,
         names = c("US Cars", "Japanese Cars"),  # Labels under each box
         main = "Comparison of US and Japanese Car MPG Fuel Efficiency", # Main title
         col = c("darkblue", "red"), # Optional: color for each box
         ylab = "MPG")                    # Label for y-axis
 
#LOG transfor of data
 us_cars_log<-log(dat$USCars)
 japan_cars_log<-log(dat$JapaneseCars)
 print(japan_cars)
 
 # # Create the Q-Q plot
 qqnorm(us_cars_log, xlab = "US Cars in LOG ", main = "Normal Probability Dist. of US Cars-LOG transform")
 qqline(us_cars_log, col = "blue", lwd = 2)
 
 ## Create the Q-Q plot
 qqnorm(japan_cars_log, xlab = "Japanese Cars in LOG", main = "Normal Probability of Japanese Cars-LOG transform")
 qqline(japan_cars_log, col = "red", lwd = 2)
 
 
 
 # 
 # # Create side-by-side box plots
 boxplot(us_cars_log, japan_cars_log,
         names = c("US Cars_log", "Japanese Cars_log"),  # Labels under each box
         main = "Comparison of US and Japanese Car MPG Fuel Efficiency in LOG", # Main title
         col = c("darkblue", "red"), # Optional: color for each box
         ylab = "MPG(log)")                    # Label for y-axis
 summary(us_cars)
 summary(us_cars_log)
 summary(japan_cars)
 summary(japan_cars_log)
 mean(us_cars)
 mean(us_cars_log)
 mean(japan_cars,na.rm = TRUE)
 mean(japan_cars_log,na.rm = TRUE)
 
 t.test(us_cars_log, japan_cars_log,alternative = "less",mu=0,conf.level = 0.95)
 
 
 #Outliers
 
 us_Q1 <- quantile(us_cars, 0.25)
 us_Q3 <- quantile(us_cars, 0.75)
 us_IQR <- IQR(us_cars)
 
 # Define bounds
 us_lower_bound <- us_Q1 - 1.5 * us_IQR
 us_upper_bound <- us_Q3 + 1.5 * us_IQR
 
 # Identify outliers
 us_outliers <- us_cars[us_cars < us_lower_bound | us_cars > us_upper_bound]
 print(us_outliers)