In the below data cleaning process I cleaned names to all lower case while keeping in their syntax. I then ensured it was a data frame being saved to “cars”. I then converted “test_fuel_type_description” and “vehicle_type” to factors. I left the rest of the variables in their original form as it appeared accurate to work with. I then selected variables that I believe would be most important to analyzing fuel economy. I removed electric vehicles from the the fuel type column as they don’t emit compounds like gas vehicles do. Lastly, I removed any remaining rows with “NA” to focus on the sample of the data with complete information for every row.
# Descriptive stats
myvars <- c("rated_horsepower", "x_of_cylinders_and_rotors", "equivalent_test_weight_lbs", "n_v_ratio", "thc_g_mi", "co_g_mi", "co2_g_mi", "n_ox_g_mi", "pm_g_mi", "ch4_g_mi", "n2o_g_mi", "rnd_adj_fe")
summary(cars[myvars])
## rated_horsepower x_of_cylinders_and_rotors equivalent_test_weight_lbs
## Min. : 137.0 Min. : 4 Min. :3500
## 1st Qu.: 255.0 1st Qu.: 4 1st Qu.:4250
## Median : 347.0 Median : 6 Median :4750
## Mean : 390.1 Mean : 6 Mean :4774
## 3rd Qu.: 473.0 3rd Qu.: 8 3rd Qu.:5250
## Max. :1500.0 Max. :16 Max. :6500
## n_v_ratio thc_g_mi co_g_mi co2_g_mi
## Min. : 9.40 Min. :0.000000 Min. :0.0000 Min. :167.0
## 1st Qu.:22.90 1st Qu.:0.003281 1st Qu.:0.1293 1st Qu.:264.5
## Median :24.80 Median :0.010541 Median :0.2611 Median :317.6
## Mean :25.39 Mean :0.016980 Mean :0.3364 Mean :342.2
## 3rd Qu.:26.90 3rd Qu.:0.021773 3rd Qu.:0.4511 3rd Qu.:401.4
## Max. :40.10 Max. :0.210638 Max. :2.2279 Max. :839.0
## n_ox_g_mi pm_g_mi ch4_g_mi n2o_g_mi
## Min. :0.000000 Min. :0.0000000 Min. :0.000000 Min. :0.000e+00
## 1st Qu.:0.003100 1st Qu.:0.0002347 1st Qu.:0.001483 1st Qu.:5.578e-05
## Median :0.006486 Median :0.0006195 Median :0.002885 Median :5.133e-04
## Mean :0.009877 Mean :0.0007874 Mean :0.004787 Mean :1.262e-03
## 3rd Qu.:0.013050 3rd Qu.:0.0010000 3rd Qu.:0.006052 3rd Qu.:1.038e-03
## Max. :0.066365 Max. :0.0052780 Max. :0.035236 Max. :3.679e-02
## rnd_adj_fe
## Min. :10.60
## 1st Qu.:22.05
## Median :27.85
## Mean :28.44
## 3rd Qu.:33.62
## Max. :52.80
mystats <- function(x, na.omit=FALSE){
if (na.omit)
x <- x[!is.na(x)]
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x-m)^3/s^3)/n
kurt <- sum((x-m)^4/s^4)/n - 3
return(c(n=n, mean=m, stdev=s,
skew=skew, kurtosis=kurt))}
sapply(cars[myvars], mystats)
## rated_horsepower x_of_cylinders_and_rotors equivalent_test_weight_lbs
## n 190.000000 190.000000 190.0000000
## mean 390.147368 6.000000 4774.3421053
## stdev 189.019132 2.263116 710.3609365
## skew 2.451725 1.569278 0.3414813
## kurtosis 10.659782 3.478291 -0.5282861
## n_v_ratio thc_g_mi co_g_mi co2_g_mi n_ox_g_mi
## n 190.0000000 190.00000000 190.0000000 190.000000 1.900000e+02
## mean 25.3863158 0.01697974 0.3364225 342.244266 9.876814e-03
## stdev 4.0816562 0.02756161 0.3156472 111.176238 1.036329e-02
## skew 0.7916713 5.17028808 2.3156956 1.080636 2.324657e+00
## kurtosis 2.3938194 32.25017599 8.4376680 1.449117 7.046641e+00
## pm_g_mi ch4_g_mi n2o_g_mi rnd_adj_fe
## n 1.900000e+02 1.900000e+02 1.900000e+02 190.0000000
## mean 7.873842e-04 4.786592e-03 1.261592e-03 28.4389474
## stdev 8.151935e-04 5.683705e-03 4.097011e-03 8.3976098
## skew 2.256878e+00 2.379861e+00 7.241609e+00 0.3700589
## kurtosis 6.959706e+00 7.033626e+00 5.452070e+01 -0.2816429
describe(cars[myvars])
## cars[myvars]
##
## 12 Variables 190 Observations
## --------------------------------------------------------------------------------
## rated_horsepower
## n missing distinct Info Mean Gmd .05 .10
## 190 0 86 1 390.1 186.7 201.0 221.0
## .25 .50 .75 .90 .95
## 255.0 347.0 473.0 617.0 653.8
##
## lowest : 137 158 177 181 182, highest: 711 785 814 823 1500
## --------------------------------------------------------------------------------
## x_of_cylinders_and_rotors
## n missing distinct Info Mean Gmd
## 190 0 5 0.89 6 2.263
##
## Value 4 6 8 12 16
## Frequency 77 60 43 8 2
## Proportion 0.405 0.316 0.226 0.042 0.011
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## equivalent_test_weight_lbs
## n missing distinct Info Mean Gmd .05 .10
## 190 0 13 0.988 4774 804.9 3681 3875
## .25 .50 .75 .90 .95
## 4250 4750 5250 6000 6000
##
## Value 3500 3625 3750 3875 4000 4250 4500 4750 5000 5250 5500
## Frequency 6 4 4 7 14 26 25 27 21 13 22
## Proportion 0.032 0.021 0.021 0.037 0.074 0.137 0.132 0.142 0.111 0.068 0.116
##
## Value 6000 6500
## Frequency 17 4
## Proportion 0.089 0.021
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## n_v_ratio
## n missing distinct Info Mean Gmd .05 .10
## 190 0 85 1 25.39 4.292 20.50 21.39
## .25 .50 .75 .90 .95
## 22.90 24.80 26.90 29.72 34.78
##
## lowest : 9.4 17.4 19.1 19.2 19.5, highest: 36 36.2 36.3 37.4 40.1
## --------------------------------------------------------------------------------
## thc_g_mi
## n missing distinct Info Mean Gmd .05 .10
## 190 0 169 1 0.01698 0.02003 0.0000000 0.0004983
## .25 .50 .75 .90 .95
## 0.0032810 0.0105412 0.0217730 0.0356702 0.0488536
##
## lowest : 0 1e-04 0.0001026 0.00016 0.0001609
## highest: 0.055681 0.058823 0.196682 0.210208 0.210638
## --------------------------------------------------------------------------------
## co_g_mi
## n missing distinct Info Mean Gmd .05 .10
## 190 0 187 1 0.3364 0.3102 0.01655 0.03823
## .25 .50 .75 .90 .95
## 0.12928 0.26111 0.45107 0.67951 0.90677
##
## lowest : 0 0.0001777 0.00182 0.00194 0.00448
## highest: 1.21496 1.30475 1.43762 1.75749 2.22792
## --------------------------------------------------------------------------------
## co2_g_mi
## n missing distinct Info Mean Gmd .05 .10
## 190 0 187 1 342.2 121.1 204.7 217.8
## .25 .50 .75 .90 .95
## 264.5 317.6 401.4 501.8 553.9
##
## lowest : 167.002 169.89 188.146 191.136 191.19
## highest: 597 603 610.698 660 839
## --------------------------------------------------------------------------------
## n_ox_g_mi
## n missing distinct Info Mean Gmd .05 .10
## 190 0 164 1 0.009877 0.009903 0.001000 0.001208
## .25 .50 .75 .90 .95
## 0.003100 0.006486 0.013050 0.020652 0.031683
##
## lowest : 0 1e-04 6e-04 0.0007576 0.0008047
## highest: 0.0402585 0.044 0.0458 0.057348 0.066365
## --------------------------------------------------------------------------------
## pm_g_mi
## n missing distinct Info Mean Gmd .05 .10
## 190 0 135 0.999 0.0007874 0.0007907 0.0000000 0.0000983
## .25 .50 .75 .90 .95
## 0.0002347 0.0006195 0.0010000 0.0016596 0.0024033
##
## lowest : 0 9e-06 1.2e-05 1.72e-05 6.7e-05
## highest: 0.003 0.003075 0.003874 0.0044 0.005278
## --------------------------------------------------------------------------------
## ch4_g_mi
## n missing distinct Info Mean Gmd .05 .10
## 190 0 141 0.999 0.004787 0.005317 0.000000 0.000100
## .25 .50 .75 .90 .95
## 0.001483 0.002885 0.006052 0.011064 0.017661
##
## lowest : 0 5e-05 1e-04 2e-04 0.0002978
## highest: 0.0207584 0.021926 0.023265 0.031893 0.0352363
## --------------------------------------------------------------------------------
## n2o_g_mi
## n missing distinct Info Mean Gmd .05 .10
## 190 0 119 0.987 0.001262 0.001865 0.000e+00 0.000e+00
## .25 .50 .75 .90 .95
## 5.578e-05 5.133e-04 1.038e-03 1.703e-03 3.337e-03
##
## lowest : 0 1.32e-05 1.4e-05 5e-05 7.31e-05
## highest: 0.005925 0.0064 0.026489 0.0334921 0.03679
## --------------------------------------------------------------------------------
## rnd_adj_fe
## n missing distinct Info Mean Gmd .05 .10
## 190 0 142 1 28.44 9.529 16.04 17.67
## .25 .50 .75 .90 .95
## 22.05 27.85 33.62 40.51 43.06
##
## lowest : 10.6 13.4 14.4 14.7 14.9, highest: 45.6 46.3 47.4 52.1 52.8
## --------------------------------------------------------------------------------
Below I generated a variety of descriptive statistics for purposes of better understanding the data. I selected all numerical or integer variables I would potentially be analyzing and generated their mean, length, standard deviation, skew, and kurtosis. I also generated each variables 5 number summary as well as using the describe function to produce more information on the given variables.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 28.43895
## [1] 8.39761
Below I created several histograms to continue to observe the distribution of the data. The primary variables I focused on here were “co2_g_mi” which equates to CO2 Grams/Miles and “rnd_adj_fe” which represents a fuel efficiency score overall for the vehicle. It can be seen that the graph is right skewed with the tail trailing off on the right side the majority of the count being concentrated to the left in the 200-400 grams/mile range. The mean for CO2 in this sub-setted sample of the data is 342.2 and the median of 317.6 which makes sense since in a right skewed graph the mean is greater than the median. It also had a skew of 1.08 and kurtosis of 1.45. With the kurtosis being positive it tells us this sample of the data has a larger tail and higher peak than a normal distribution which means there are more outliers in the CO2 variable.
I then looked at a quick histogram of the fuel efficiency score to get a read on it’s distribution. It had a skew of .37 and a kurtosis of -.28 which tells us it is much closer to representing a normal distribution as it gets closer to zero.
## Both Car Truck Total
## (150,200] 7 1 1 9
## (200,250] 14 6 5 25
## (250,300] 16 20 13 49
## (300,350] 14 8 11 33
## (350,400] 8 4 14 26
## (400,450] 5 5 4 14
## (450,500] 4 4 6 14
## (500,550] 1 3 6 10
## (550,600] 1 3 2 6
## (600,650] 1 1 2
## (650,700] 1 1
## (800,850] 1 1
## Total 71 56 63 190
Here I created two pivot tables to further explore the variables of CO2 and Fuel Efficiency among the different vehicle types. Since there were two many data points to capture in a condensed pivot table I grouped the values for CO2 (g/mi) into groups of 50 from 150-850. In the pivot table a trend can be seen that trucks has a higher likelihood of producing more CO2 g/mi than “Car” or “Both”.
Similarly to the first pivot table I created groupings for the variable “rnd_adj_fe” or Fuel Efficiency score in order to capture in a pivot table. Here I compared mean car weight among the different vehicle types from each FE score group they fell into. It can be seen in the table that the lighter the car gets the better FE score it receives.
## Both Car Truck
##
## 3500 1 5 0
## 3625 0 4 0
## 3750 0 4 0
## 3875 2 5 0
## 4000 5 7 2
## 4250 18 5 3
## 4500 14 3 8
## 4750 10 8 9
## 5000 11 4 6
## 5250 4 4 5
## 5500 3 3 16
## 6000 3 4 10
## 6500 0 0 4
## represented_test_veh_make represented_test_veh_model vehicle_type
## Length:155 Length:155 Both :63
## Class :character Class :character Car :31
## Mode :character Mode :character Truck:61
##
##
##
##
## rated_horsepower x_of_cylinders_and_rotors equivalent_test_weight_lbs
## Min. : 177.0 Min. : 4.000 Min. :4250
## 1st Qu.: 261.0 1st Qu.: 4.000 1st Qu.:4500
## Median : 355.0 Median : 6.000 Median :4750
## Mean : 397.8 Mean : 6.168 Mean :4990
## 3rd Qu.: 473.0 3rd Qu.: 8.000 3rd Qu.:5500
## Max. :1500.0 Max. :16.000 Max. :6500
##
## n_v_ratio test_fuel_type_description
## Min. :17.40 Cold CO Premium (Tier 2) : 5
## 1st Qu.:22.55 Cold CO Regular (Tier 2) : 0
## Median :24.70 Electricity : 0
## Mean :24.96 Federal Cert Diesel 7-15 PPM Sulfur : 3
## 3rd Qu.:26.50 Hydrogen 5 : 0
## Max. :37.40 Tier 2 Cert Gasoline :147
## Tier 3 E10 Regular Gasoline (9 RVP @Low Alt.): 0
## thc_g_mi co_g_mi co2_g_mi n_ox_g_mi
## Min. :0.000000 Min. :0.0001777 Min. :167.0 Min. :0.000000
## 1st Qu.:0.003359 1st Qu.:0.1413746 1st Qu.:268.6 1st Qu.:0.003065
## Median :0.011300 Median :0.2624000 Median :324.2 Median :0.006200
## Mean :0.017507 Mean :0.3488690 Mean :349.3 Mean :0.009989
## 3rd Qu.:0.021524 3rd Qu.:0.4468126 3rd Qu.:406.5 3rd Qu.:0.013298
## Max. :0.210638 Max. :2.2279231 Max. :839.0 Max. :0.066365
##
## pm_g_mi ch4_g_mi n2o_g_mi rnd_adj_fe
## Min. :0.0000000 Min. :0.000000 Min. :0.000e+00 Min. :10.60
## 1st Qu.:0.0002453 1st Qu.:0.001419 1st Qu.:4.355e-05 1st Qu.:21.75
## Median :0.0006460 Median :0.002700 Median :5.164e-04 Median :27.40
## Mean :0.0008327 Mean :0.004831 Mean :1.335e-03 Mean :27.84
## 3rd Qu.:0.0010213 3rd Qu.:0.006572 3rd Qu.:1.040e-03 3rd Qu.:32.95
## Max. :0.0052780 Max. :0.035236 Max. :3.679e-02 Max. :52.80
##
## co2_g_mi_groups rnd_adj_fe_groups
## (250,300]:32 (25,30]:37
## (300,350]:29 (20,25]:34
## (350,400]:25 (30,35]:25
## (200,250]:22 (15,20]:23
## (400,450]:13 (40,45]:14
## (450,500]:13 (35,40]:13
## (Other) :21 (Other): 9
## vehicle_type
## rnd_adj_fe_groups Both Car Truck
## [0,5] 0 0 0
## (5,10] 0 0 0
## (10,15] 1 2 3
## (15,20] 6 5 12
## (20,25] 12 7 15
## (25,30] 17 6 14
## (30,35] 9 7 9
## (35,40] 6 1 6
## (40,45] 10 3 1
## (45,50] 2 0 0
## (50,55] 0 0 1
## vehicle_type
## Both Car Truck
## 63 31 61
## vehicle_type
## rnd_adj_fe_groups Both Car Truck
## [0,5] 0.00000000 0.00000000 0.00000000
## (5,10] 0.00000000 0.00000000 0.00000000
## (10,15] 0.01587302 0.06451613 0.04918033
## (15,20] 0.09523810 0.16129032 0.19672131
## (20,25] 0.19047619 0.22580645 0.24590164
## (25,30] 0.26984127 0.19354839 0.22950820
## (30,35] 0.14285714 0.22580645 0.14754098
## (35,40] 0.09523810 0.03225806 0.09836066
## (40,45] 0.15873016 0.09677419 0.01639344
## (45,50] 0.03174603 0.00000000 0.00000000
## (50,55] 0.00000000 0.00000000 0.01639344
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 190
##
##
## | cars$vehicle_type
## cars$co2_g_mi_groups | Both | Car | Truck | Row Total |
## ---------------------|-----------|-----------|-----------|-----------|
## (150,200] | 7 | 1 | 1 | 9 |
## | 0.778 | 0.111 | 0.111 | 0.047 |
## | 0.099 | 0.018 | 0.016 | |
## | 0.037 | 0.005 | 0.005 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (200,250] | 14 | 6 | 5 | 25 |
## | 0.560 | 0.240 | 0.200 | 0.132 |
## | 0.197 | 0.107 | 0.079 | |
## | 0.074 | 0.032 | 0.026 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (250,300] | 16 | 20 | 13 | 49 |
## | 0.327 | 0.408 | 0.265 | 0.258 |
## | 0.225 | 0.357 | 0.206 | |
## | 0.084 | 0.105 | 0.068 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (300,350] | 14 | 8 | 11 | 33 |
## | 0.424 | 0.242 | 0.333 | 0.174 |
## | 0.197 | 0.143 | 0.175 | |
## | 0.074 | 0.042 | 0.058 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (350,400] | 8 | 4 | 14 | 26 |
## | 0.308 | 0.154 | 0.538 | 0.137 |
## | 0.113 | 0.071 | 0.222 | |
## | 0.042 | 0.021 | 0.074 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (400,450] | 5 | 5 | 4 | 14 |
## | 0.357 | 0.357 | 0.286 | 0.074 |
## | 0.070 | 0.089 | 0.063 | |
## | 0.026 | 0.026 | 0.021 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (450,500] | 4 | 4 | 6 | 14 |
## | 0.286 | 0.286 | 0.429 | 0.074 |
## | 0.056 | 0.071 | 0.095 | |
## | 0.021 | 0.021 | 0.032 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (500,550] | 1 | 3 | 6 | 10 |
## | 0.100 | 0.300 | 0.600 | 0.053 |
## | 0.014 | 0.054 | 0.095 | |
## | 0.005 | 0.016 | 0.032 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (550,600] | 1 | 3 | 2 | 6 |
## | 0.167 | 0.500 | 0.333 | 0.032 |
## | 0.014 | 0.054 | 0.032 | |
## | 0.005 | 0.016 | 0.011 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (600,650] | 0 | 1 | 1 | 2 |
## | 0.000 | 0.500 | 0.500 | 0.011 |
## | 0.000 | 0.018 | 0.016 | |
## | 0.000 | 0.005 | 0.005 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (650,700] | 1 | 0 | 0 | 1 |
## | 1.000 | 0.000 | 0.000 | 0.005 |
## | 0.014 | 0.000 | 0.000 | |
## | 0.005 | 0.000 | 0.000 | |
## ---------------------|-----------|-----------|-----------|-----------|
## (800,850] | 0 | 1 | 0 | 1 |
## | 0.000 | 1.000 | 0.000 | 0.005 |
## | 0.000 | 0.018 | 0.000 | |
## | 0.000 | 0.005 | 0.000 | |
## ---------------------|-----------|-----------|-----------|-----------|
## Column Total | 71 | 56 | 63 | 190 |
## | 0.374 | 0.295 | 0.332 | |
## ---------------------|-----------|-----------|-----------|-----------|
##
##
Continuing from the previous pivot table, I created a flat contingency table which gave the count of different vehicle types by weight. This helps illustrate the trend of trucks being heavier vehicles it also shows most of the vehicles were 4250 lbs or greater meaning the greater the FE score the smaller the sample got which aligns with it following a normal distribution. To take a closer look at comparing these two tables I generated a proportions table. First I wanted to focus on where the majority of the data was so I took cars that were 4000 lbs or greater. In the proportion table we can again see there is a trend in “Car” and “Both” being slightly higher than “Truck” for fuel efficiency score. Lastly, I generated a cross table to compare the ratios of vehicle type and CO2 emissions in grams/mile. Focusing on where there are the greatest number of observations it can be seen that “Car” and “Both” have a larger overall ratio at 200-250 and 250-300. For 300-350 the ratio is fairly even but “Both” has the highest at .074 then trucks at .058. For 350-400 trucks has more than three times the ratio of cars for CO2 emissions.
Based on the above analysis through comparing different results across multiple tables with various variables we can conclude that the heavier the vehicle type is the more CO2 emissions it’s likely to produce and this contributes to a lower FE score. Trucks tend to be heavier vehicles than cars so it makes sense that they would also trend towards having a lower FE score and higher CO2 emissions.
References:
When prompted with “How to create bins in increments of 50 for a variable in R” the ChatGPT generated text indicated “This code will create a new column named”co2_bins” in your “cars” data frame, with values representing the bins. The seq(0, max(cars$co2_g_mi) + 50, by = 50) generates breaks starting from 0 up to the maximum value of “co2_g_mi” plus 50, with increments of 50” (OpenAI, 2024).
OpenAI. (2024). ChatGPT (Feb 25 version) [Large Language model] https://chat.openai.com/
Kabacoff, R. I. (2015). R in Action (2nd ed.). Manning Publications.