Part I: Introductions:

As per EPA, fuel economy data are the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA.This data was obtained from the EPA site listed below. The data forms the basis for our analysis of the highway mileage of cars with different cylinder counts in the United States.

Part 2: Data Overview

This dataset contains number of cylinders and gas milage infromation for all car models from 1984–2023. The data was obtained from EPA data https://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip There are 44787 observations and 83 variables in the data.

Libraries:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.6     v purrr   0.3.4
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Read data from R:

I loaded data in my Github and R read from Github. Github file https://raw.githubusercontent.com/deepasharma06/Data-606/main/vehicles%5B1%5D.csv

df <- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-606/main/vehicles%5B1%5D.csv")
##df1 <- na.omit(df) I wanted to omit the null rows but I left it in there as it significantly decreased my sample size.

Part 3: Data analysis

The data give the highway mileage for cars with different cylinder count. The data will be analyzed to understand if there is a significance difference between the mean of highway mileage for each cylinder count. For this analysis, I am focusing on the two most common cylinder types (4 cylinder and 6 cylinder) as seen in the data. Of the total 44787 records in the dataset, 32785 records are those for 4 cylinder and 6 cylinder cars.

Independent Variable:Number of cylinders (cylinders)

The independent variable in this analysis will be the number of cylinders (cylinders) that the car has.

df %>%
  summarise(mean   = mean(cylinders, na.rm=TRUE), 
            median = median(cylinders, na.rm=TRUE), 
            n         = n(),
            sd     = sd(cylinders, na.rm=TRUE),
            var    = var(cylinders, na.rm=TRUE),
            iqr    = IQR(cylinders, na.rm=TRUE),
            min    = min(cylinders, na.rm=TRUE),
            max    = max(cylinders, na.rm=TRUE))
##       mean median     n       sd      var iqr min max
## 1 5.710999      6 44787 1.768306 3.126905   2   2  16

Dependent Vaiable: Highway Mileage (highway08)

The dependent variable in this analysis will be the highway mileage of the car (highway08). This variable is a function of the independent variable (cylinders)

df %>%
  summarise(mean   = mean(highway08, na.rm=TRUE), 
            median = median(highway08, na.rm=TRUE), 
            n         = n(),
            sd     = sd(highway08, na.rm=TRUE),
            var    = var(highway08, na.rm=TRUE),
            iqr    = IQR(highway08, na.rm=TRUE),
            min    = min(highway08, na.rm=TRUE),
            max    = max(highway08, na.rm=TRUE))
##       mean median     n       sd      var iqr min max
## 1 24.95816     24 44787 8.818173 77.76018   8   9 133

Part 4: Inference

Statistical Input:

The first thing I wanted to analyze is to figure out the most common cylinder count in cars that I will be using for comparision. For this analysis, I plot the data in a histogram and see that cars with 4 and 6 cylinders are the most common ones in the data set. Of the 44787 records, 32785 are those for cars with either 4 cylinders or 6 cylinders. Hence, I will be using the data where the cylinder count is either 4 or 6 for further analysis.

hist(df$cylinders)

Test slection:

From the box plot of the data for cars with 4 and 6 cylinders, we can see that the medians for the two data set are quite far apart. In fact, the interquartile range (IQR) are quite apart with a little overlap. The minimum and maximum range for each data set are different but they seem to show a pattern. For the analysis, we want to test if the difference in mean is significant.

### I attached the dataframe into R so R can read it. The code shows the median highway milage for all cars from the population which is 24 miles/gallon.
### To perform this analysis, I created a separate data frame (df1) with the selected data rows where the cylinders = 4 or 6.

df1 <- df %>%
  select(cylinders, highway08) %>%
  filter(cylinders == "4" |
           cylinders == "6")

attach(df1)
median(highway08)
## [1] 26
boxplot(data = df1, highway08~cylinders)

Hypothesis Testing:

  • Null Hypothesis: The statistical difference between the highway mileages of cars with 4 cylinders and those with 6 cylinders is 0
  • Alternative Hypothesis: There is a significant difference between the highway mileages of cars with 4 cylinders and those with 6 cylinders
### I attached the dataframe into R so R can read it. The code shows the median highway milage for all cars from the population which is 24 miles/gallon.
### To perform this analysis, I created a separate data frame (df1) with the selected data rows where the cylinders = 4 or 6.

df1 <- df %>%
  select(cylinders, highway08) %>%
  filter(cylinders == "4" |
           cylinders == "6")

attach(df1)
## The following objects are masked from df1 (pos = 3):
## 
##     cylinders, highway08
median(highway08)
## [1] 26

Box plot of highway mileage against number of cylinders.

boxplot(highway08~cylinders)

I am using dplyr to perform some analysis on my data. Below, I am using a filters where cylinders = 4 or 6 and grouping the data by cylinders. Then I summarized the data based on cylinders to get the average highway milage between the two categories.

df1 %>%
  select(cylinders, highway08) %>%
  filter(cylinders == "4" |
           cylinders == "6") %>%
  group_by(cylinders) %>%
  summarise(Average_Highway_Milage = mean(highway08))
## # A tibble: 2 x 2
##   cylinders Average_Highway_Milage
##       <int>                  <dbl>
## 1         4                   28.7
## 2         6                   23.0

I see from the above analysis that the average mileages for 4 cylinder cars is 28.69928 and that for 6 cylinder cars is 22.99159.

I calculated the difference between the two averages below.

28.69928-22.99159
## [1] 5.70769

From the above result, I can see that there is a difference of 5.71 miles between the average highway milages of a 4 cylinder car vs a 6 cylinder car.

The below analysis will show if that difference is statistically significant. Then I performed a t-test to compare the highway milages for a 4 cylinder and 6 cylinder cars.

t.test(data = df1, highway08~cylinders)
## 
##  Welch Two Sample t-test
## 
## data:  highway08 by cylinders
## t = 113.45, df = 31364, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 4 and group 6 is not equal to 0
## 95 percent confidence interval:
##  5.609092 5.806306
## sample estimates:
## mean in group 4 mean in group 6 
##        28.69928        22.99159

We can see that that average highway mileage for 4 cylinder cars is 28.69928 and that for a 6 cylinder car is 22.99159 with an observed difference of 5.70769. Our null hypothesis is that the difference between the average highway milage of 4 cylinder and 6 cylinder cars is 0. We see from the t-test above that the p value which is the probability that two randomly chosen samples will have no significant difference (equal highlway milage) is 2.2e-16 which is very small or close to zero. Based on that, the null hypothesis can be rejected. Hence, I can conclude that there is a significant difference between the highway mileages of 4 cylinder and 6 cylinder cars.

df %>%
  ggplot(aes(x =  cylinders,y = log(highway08), col = highway08, size = highway08))+
  geom_point(alpha = .005)+
  geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 372 rows containing non-finite values (stat_smooth).
## Warning: Removed 372 rows containing missing values (geom_point).

The graphic did not come out as I expected because the cylinders values is in intervals (1 - 16) and the data is spread across these discrete cylinder values. However, I drew a geom_smooth line to show the trend and we can see that the highway mileage decreases as cylinder count increases which is expected from our analysis above.

Conclusion:

As seen in the t-test above, we can see that at 95% confidence level, there could be a difference in highway mileages between 4 cylinder and 6 cylinder cars from 5.609092 and 5.806306 miles. This analysis is important because 4 and 6 cylinder engines are the most commonly used car engines in the United States. The results of this analysis can be used by car buyers in making the best decision for themselves in terms of what type of car they want to buy. Those who favor fuel efficiency should use the 4 cylinder vehicles as they are found to be more fuel efficient.

One of the this that this analysis does not consider is the overall mileage, which includes both highway and local driving conditions. Another thing to note is that it does not account for the fuel type (regular vs. super). A 4 cylinder car may use more dollars worth of gas than a 6 cylinder in a scenario where the 4 cylinder car uses the expensive super gas type and the 6 cylinder uses the cheaper regular type.

References:

Citation: Download Fuel Economy Data. www.fueleconomy.gov - the official government source for fuel economy information. (n.d.). Retrieved April 3, 2022, from https://www.fueleconomy.gov/feg/download.shtml