Cleaning the data

In the below data cleaning process I cleaned names to all lower case while keeping in their syntax. I then ensured it was a data frame being saved to “cars”. I then converted “test_fuel_type_description” and “vehicle_type” to factors. I left the rest of the variables in their original form as it appeared accurate to work with. I then selected variables that I believe would be most important to analyzing fuel economy. I removed electric vehicles from the the fuel type column as they don’t emit compounds like gas vehicles do. Lastly, I removed any remaining rows with “NA” to focus on the sample of the data with complete information for every row.

# Descriptive stats

myvars <- c("rated_horsepower", "x_of_cylinders_and_rotors", "equivalent_test_weight_lbs", "n_v_ratio", "thc_g_mi", "co_g_mi", "co2_g_mi", "n_ox_g_mi", "pm_g_mi", "ch4_g_mi", "n2o_g_mi", "rnd_adj_fe")
summary(cars[myvars])
##  rated_horsepower x_of_cylinders_and_rotors equivalent_test_weight_lbs
##  Min.   : 137.0   Min.   : 4                Min.   :3500              
##  1st Qu.: 255.0   1st Qu.: 4                1st Qu.:4250              
##  Median : 347.0   Median : 6                Median :4750              
##  Mean   : 390.1   Mean   : 6                Mean   :4774              
##  3rd Qu.: 473.0   3rd Qu.: 8                3rd Qu.:5250              
##  Max.   :1500.0   Max.   :16                Max.   :6500              
##    n_v_ratio        thc_g_mi           co_g_mi          co2_g_mi    
##  Min.   : 9.40   Min.   :0.000000   Min.   :0.0000   Min.   :167.0  
##  1st Qu.:22.90   1st Qu.:0.003281   1st Qu.:0.1293   1st Qu.:264.5  
##  Median :24.80   Median :0.010541   Median :0.2611   Median :317.6  
##  Mean   :25.39   Mean   :0.016980   Mean   :0.3364   Mean   :342.2  
##  3rd Qu.:26.90   3rd Qu.:0.021773   3rd Qu.:0.4511   3rd Qu.:401.4  
##  Max.   :40.10   Max.   :0.210638   Max.   :2.2279   Max.   :839.0  
##    n_ox_g_mi           pm_g_mi             ch4_g_mi           n2o_g_mi        
##  Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000   Min.   :0.000e+00  
##  1st Qu.:0.003100   1st Qu.:0.0002347   1st Qu.:0.001483   1st Qu.:5.578e-05  
##  Median :0.006486   Median :0.0006195   Median :0.002885   Median :5.133e-04  
##  Mean   :0.009877   Mean   :0.0007874   Mean   :0.004787   Mean   :1.262e-03  
##  3rd Qu.:0.013050   3rd Qu.:0.0010000   3rd Qu.:0.006052   3rd Qu.:1.038e-03  
##  Max.   :0.066365   Max.   :0.0052780   Max.   :0.035236   Max.   :3.679e-02  
##    rnd_adj_fe   
##  Min.   :10.60  
##  1st Qu.:22.05  
##  Median :27.85  
##  Mean   :28.44  
##  3rd Qu.:33.62  
##  Max.   :52.80
mystats <- function(x, na.omit=FALSE){
  if (na.omit)
    x <- x[!is.na(x)]
  m <- mean(x)
  n <- length(x)
  s <- sd(x)
  skew <- sum((x-m)^3/s^3)/n
  kurt <- sum((x-m)^4/s^4)/n - 3
  return(c(n=n, mean=m, stdev=s, 
           skew=skew, kurtosis=kurt))}
sapply(cars[myvars], mystats)
##          rated_horsepower x_of_cylinders_and_rotors equivalent_test_weight_lbs
## n              190.000000                190.000000                190.0000000
## mean           390.147368                  6.000000               4774.3421053
## stdev          189.019132                  2.263116                710.3609365
## skew             2.451725                  1.569278                  0.3414813
## kurtosis        10.659782                  3.478291                 -0.5282861
##            n_v_ratio     thc_g_mi     co_g_mi   co2_g_mi    n_ox_g_mi
## n        190.0000000 190.00000000 190.0000000 190.000000 1.900000e+02
## mean      25.3863158   0.01697974   0.3364225 342.244266 9.876814e-03
## stdev      4.0816562   0.02756161   0.3156472 111.176238 1.036329e-02
## skew       0.7916713   5.17028808   2.3156956   1.080636 2.324657e+00
## kurtosis   2.3938194  32.25017599   8.4376680   1.449117 7.046641e+00
##               pm_g_mi     ch4_g_mi     n2o_g_mi  rnd_adj_fe
## n        1.900000e+02 1.900000e+02 1.900000e+02 190.0000000
## mean     7.873842e-04 4.786592e-03 1.261592e-03  28.4389474
## stdev    8.151935e-04 5.683705e-03 4.097011e-03   8.3976098
## skew     2.256878e+00 2.379861e+00 7.241609e+00   0.3700589
## kurtosis 6.959706e+00 7.033626e+00 5.452070e+01  -0.2816429
describe(cars[myvars])
## cars[myvars] 
## 
##  12  Variables      190  Observations
## --------------------------------------------------------------------------------
## rated_horsepower 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      190        0       86        1    390.1    186.7    201.0    221.0 
##      .25      .50      .75      .90      .95 
##    255.0    347.0    473.0    617.0    653.8 
## 
## lowest :  137  158  177  181  182, highest:  711  785  814  823 1500
## --------------------------------------------------------------------------------
## x_of_cylinders_and_rotors 
##        n  missing distinct     Info     Mean      Gmd 
##      190        0        5     0.89        6    2.263 
##                                         
## Value          4     6     8    12    16
## Frequency     77    60    43     8     2
## Proportion 0.405 0.316 0.226 0.042 0.011
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## equivalent_test_weight_lbs 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      190        0       13    0.988     4774    804.9     3681     3875 
##      .25      .50      .75      .90      .95 
##     4250     4750     5250     6000     6000 
##                                                                             
## Value       3500  3625  3750  3875  4000  4250  4500  4750  5000  5250  5500
## Frequency      6     4     4     7    14    26    25    27    21    13    22
## Proportion 0.032 0.021 0.021 0.037 0.074 0.137 0.132 0.142 0.111 0.068 0.116
##                       
## Value       6000  6500
## Frequency     17     4
## Proportion 0.089 0.021
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## n_v_ratio 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      190        0       85        1    25.39    4.292    20.50    21.39 
##      .25      .50      .75      .90      .95 
##    22.90    24.80    26.90    29.72    34.78 
## 
## lowest : 9.4  17.4 19.1 19.2 19.5, highest: 36   36.2 36.3 37.4 40.1
## --------------------------------------------------------------------------------
## thc_g_mi 
##         n   missing  distinct      Info      Mean       Gmd       .05       .10 
##       190         0       169         1   0.01698   0.02003 0.0000000 0.0004983 
##       .25       .50       .75       .90       .95 
## 0.0032810 0.0105412 0.0217730 0.0356702 0.0488536 
## 
## lowest : 0         1e-04     0.0001026 0.00016   0.0001609
## highest: 0.055681  0.058823  0.196682  0.210208  0.210638 
## --------------------------------------------------------------------------------
## co_g_mi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      190        0      187        1   0.3364   0.3102  0.01655  0.03823 
##      .25      .50      .75      .90      .95 
##  0.12928  0.26111  0.45107  0.67951  0.90677 
## 
## lowest : 0         0.0001777 0.00182   0.00194   0.00448  
## highest: 1.21496   1.30475   1.43762   1.75749   2.22792  
## --------------------------------------------------------------------------------
## co2_g_mi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      190        0      187        1    342.2    121.1    204.7    217.8 
##      .25      .50      .75      .90      .95 
##    264.5    317.6    401.4    501.8    553.9 
## 
## lowest : 167.002 169.89  188.146 191.136 191.19 
## highest: 597     603     610.698 660     839    
## --------------------------------------------------------------------------------
## n_ox_g_mi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      190        0      164        1 0.009877 0.009903 0.001000 0.001208 
##      .25      .50      .75      .90      .95 
## 0.003100 0.006486 0.013050 0.020652 0.031683 
## 
## lowest : 0         1e-04     6e-04     0.0007576 0.0008047
## highest: 0.0402585 0.044     0.0458    0.057348  0.066365 
## --------------------------------------------------------------------------------
## pm_g_mi 
##         n   missing  distinct      Info      Mean       Gmd       .05       .10 
##       190         0       135     0.999 0.0007874 0.0007907 0.0000000 0.0000983 
##       .25       .50       .75       .90       .95 
## 0.0002347 0.0006195 0.0010000 0.0016596 0.0024033 
## 
## lowest : 0        9e-06    1.2e-05  1.72e-05 6.7e-05 
## highest: 0.003    0.003075 0.003874 0.0044   0.005278
## --------------------------------------------------------------------------------
## ch4_g_mi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      190        0      141    0.999 0.004787 0.005317 0.000000 0.000100 
##      .25      .50      .75      .90      .95 
## 0.001483 0.002885 0.006052 0.011064 0.017661 
## 
## lowest : 0         5e-05     1e-04     2e-04     0.0002978
## highest: 0.0207584 0.021926  0.023265  0.031893  0.0352363
## --------------------------------------------------------------------------------
## n2o_g_mi 
##         n   missing  distinct      Info      Mean       Gmd       .05       .10 
##       190         0       119     0.987  0.001262  0.001865 0.000e+00 0.000e+00 
##       .25       .50       .75       .90       .95 
## 5.578e-05 5.133e-04 1.038e-03 1.703e-03 3.337e-03 
## 
## lowest : 0         1.32e-05  1.4e-05   5e-05     7.31e-05 
## highest: 0.005925  0.0064    0.026489  0.0334921 0.03679  
## --------------------------------------------------------------------------------
## rnd_adj_fe 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      190        0      142        1    28.44    9.529    16.04    17.67 
##      .25      .50      .75      .90      .95 
##    22.05    27.85    33.62    40.51    43.06 
## 
## lowest : 10.6 13.4 14.4 14.7 14.9, highest: 45.6 46.3 47.4 52.1 52.8
## --------------------------------------------------------------------------------

Descriptive Statistics

Below I generated a variety of descriptive statistics for purposes of better understanding the data. I selected all numerical or integer variables I would potentially be analyzing and generated their mean, length, standard deviation, skew, and kurtosis. I also generated each variables 5 number summary as well as using the describe function to produce more information on the given variables.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 28.43895
## [1] 8.39761

Histograms

Below I created several histograms to continue to observe the distribution of the data. The primary variables I focused on here were “co2_g_mi” which equates to CO2 Grams/Miles and “rnd_adj_fe” which represents a fuel efficiency score overall for the vehicle. It can be seen that the graph is right skewed with the tail trailing off on the right side the majority of the count being concentrated to the left in the 200-400 grams/mile range. The mean for CO2 in this sub-setted sample of the data is 342.2 and the median of 317.6 which makes sense since in a right skewed graph the mean is greater than the median. It also had a skew of 1.08 and kurtosis of 1.45. With the kurtosis being positive it tells us this sample of the data has a larger tail and higher peak than a normal distribution which means there are more outliers in the CO2 variable.

I then looked at a quick histogram of the fuel efficiency score to get a read on it’s distribution. It had a skew of .37 and a kurtosis of -.28 which tells us it is much closer to representing a normal distribution as it gets closer to zero.

##            Both  Car  Truck  Total  
## (150,200]     7    1      1      9  
## (200,250]    14    6      5     25  
## (250,300]    16   20     13     49  
## (300,350]    14    8     11     33  
## (350,400]     8    4     14     26  
## (400,450]     5    5      4     14  
## (450,500]     4    4      6     14  
## (500,550]     1    3      6     10  
## (550,600]     1    3      2      6  
## (600,650]          1      1      2  
## (650,700]     1                  1  
## (800,850]          1             1  
## Total        71   56     63    190

Pivot Tables

Here I created two pivot tables to further explore the variables of CO2 and Fuel Efficiency among the different vehicle types. Since there were two many data points to capture in a condensed pivot table I grouped the values for CO2 (g/mi) into groups of 50 from 150-850. In the pivot table a trend can be seen that trucks has a higher likelihood of producing more CO2 g/mi than “Car” or “Both”.

Similarly to the first pivot table I created groupings for the variable “rnd_adj_fe” or Fuel Efficiency score in order to capture in a pivot table. Here I compared mean car weight among the different vehicle types from each FE score group they fell into. It can be seen in the table that the lighter the car gets the better FE score it receives.

##       Both Car Truck
##                     
## 3500     1   5     0
## 3625     0   4     0
## 3750     0   4     0
## 3875     2   5     0
## 4000     5   7     2
## 4250    18   5     3
## 4500    14   3     8
## 4750    10   8     9
## 5000    11   4     6
## 5250     4   4     5
## 5500     3   3    16
## 6000     3   4    10
## 6500     0   0     4
##  represented_test_veh_make represented_test_veh_model vehicle_type
##  Length:155                Length:155                 Both :63    
##  Class :character          Class :character           Car  :31    
##  Mode  :character          Mode  :character           Truck:61    
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  rated_horsepower x_of_cylinders_and_rotors equivalent_test_weight_lbs
##  Min.   : 177.0   Min.   : 4.000            Min.   :4250              
##  1st Qu.: 261.0   1st Qu.: 4.000            1st Qu.:4500              
##  Median : 355.0   Median : 6.000            Median :4750              
##  Mean   : 397.8   Mean   : 6.168            Mean   :4990              
##  3rd Qu.: 473.0   3rd Qu.: 8.000            3rd Qu.:5500              
##  Max.   :1500.0   Max.   :16.000            Max.   :6500              
##                                                                       
##    n_v_ratio                                     test_fuel_type_description
##  Min.   :17.40   Cold CO Premium (Tier 2)                     :  5         
##  1st Qu.:22.55   Cold CO Regular (Tier 2)                     :  0         
##  Median :24.70   Electricity                                  :  0         
##  Mean   :24.96   Federal Cert Diesel 7-15 PPM Sulfur          :  3         
##  3rd Qu.:26.50   Hydrogen 5                                   :  0         
##  Max.   :37.40   Tier 2 Cert Gasoline                         :147         
##                  Tier 3 E10 Regular Gasoline (9 RVP @Low Alt.):  0         
##     thc_g_mi           co_g_mi             co2_g_mi       n_ox_g_mi       
##  Min.   :0.000000   Min.   :0.0001777   Min.   :167.0   Min.   :0.000000  
##  1st Qu.:0.003359   1st Qu.:0.1413746   1st Qu.:268.6   1st Qu.:0.003065  
##  Median :0.011300   Median :0.2624000   Median :324.2   Median :0.006200  
##  Mean   :0.017507   Mean   :0.3488690   Mean   :349.3   Mean   :0.009989  
##  3rd Qu.:0.021524   3rd Qu.:0.4468126   3rd Qu.:406.5   3rd Qu.:0.013298  
##  Max.   :0.210638   Max.   :2.2279231   Max.   :839.0   Max.   :0.066365  
##                                                                           
##     pm_g_mi             ch4_g_mi           n2o_g_mi           rnd_adj_fe   
##  Min.   :0.0000000   Min.   :0.000000   Min.   :0.000e+00   Min.   :10.60  
##  1st Qu.:0.0002453   1st Qu.:0.001419   1st Qu.:4.355e-05   1st Qu.:21.75  
##  Median :0.0006460   Median :0.002700   Median :5.164e-04   Median :27.40  
##  Mean   :0.0008327   Mean   :0.004831   Mean   :1.335e-03   Mean   :27.84  
##  3rd Qu.:0.0010213   3rd Qu.:0.006572   3rd Qu.:1.040e-03   3rd Qu.:32.95  
##  Max.   :0.0052780   Max.   :0.035236   Max.   :3.679e-02   Max.   :52.80  
##                                                                            
##   co2_g_mi_groups rnd_adj_fe_groups
##  (250,300]:32     (25,30]:37       
##  (300,350]:29     (20,25]:34       
##  (350,400]:25     (30,35]:25       
##  (200,250]:22     (15,20]:23       
##  (400,450]:13     (40,45]:14       
##  (450,500]:13     (35,40]:13       
##  (Other)  :21     (Other): 9
##                  vehicle_type
## rnd_adj_fe_groups Both Car Truck
##           [0,5]      0   0     0
##           (5,10]     0   0     0
##           (10,15]    1   2     3
##           (15,20]    6   5    12
##           (20,25]   12   7    15
##           (25,30]   17   6    14
##           (30,35]    9   7     9
##           (35,40]    6   1     6
##           (40,45]   10   3     1
##           (45,50]    2   0     0
##           (50,55]    0   0     1
## vehicle_type
##  Both   Car Truck 
##    63    31    61
##                  vehicle_type
## rnd_adj_fe_groups       Both        Car      Truck
##           [0,5]   0.00000000 0.00000000 0.00000000
##           (5,10]  0.00000000 0.00000000 0.00000000
##           (10,15] 0.01587302 0.06451613 0.04918033
##           (15,20] 0.09523810 0.16129032 0.19672131
##           (20,25] 0.19047619 0.22580645 0.24590164
##           (25,30] 0.26984127 0.19354839 0.22950820
##           (30,35] 0.14285714 0.22580645 0.14754098
##           (35,40] 0.09523810 0.03225806 0.09836066
##           (40,45] 0.15873016 0.09677419 0.01639344
##           (45,50] 0.03174603 0.00000000 0.00000000
##           (50,55] 0.00000000 0.00000000 0.01639344
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  190 
## 
##  
##                      | cars$vehicle_type 
## cars$co2_g_mi_groups |      Both |       Car |     Truck | Row Total | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (150,200] |         7 |         1 |         1 |         9 | 
##                      |     0.778 |     0.111 |     0.111 |     0.047 | 
##                      |     0.099 |     0.018 |     0.016 |           | 
##                      |     0.037 |     0.005 |     0.005 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (200,250] |        14 |         6 |         5 |        25 | 
##                      |     0.560 |     0.240 |     0.200 |     0.132 | 
##                      |     0.197 |     0.107 |     0.079 |           | 
##                      |     0.074 |     0.032 |     0.026 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (250,300] |        16 |        20 |        13 |        49 | 
##                      |     0.327 |     0.408 |     0.265 |     0.258 | 
##                      |     0.225 |     0.357 |     0.206 |           | 
##                      |     0.084 |     0.105 |     0.068 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (300,350] |        14 |         8 |        11 |        33 | 
##                      |     0.424 |     0.242 |     0.333 |     0.174 | 
##                      |     0.197 |     0.143 |     0.175 |           | 
##                      |     0.074 |     0.042 |     0.058 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (350,400] |         8 |         4 |        14 |        26 | 
##                      |     0.308 |     0.154 |     0.538 |     0.137 | 
##                      |     0.113 |     0.071 |     0.222 |           | 
##                      |     0.042 |     0.021 |     0.074 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (400,450] |         5 |         5 |         4 |        14 | 
##                      |     0.357 |     0.357 |     0.286 |     0.074 | 
##                      |     0.070 |     0.089 |     0.063 |           | 
##                      |     0.026 |     0.026 |     0.021 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (450,500] |         4 |         4 |         6 |        14 | 
##                      |     0.286 |     0.286 |     0.429 |     0.074 | 
##                      |     0.056 |     0.071 |     0.095 |           | 
##                      |     0.021 |     0.021 |     0.032 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (500,550] |         1 |         3 |         6 |        10 | 
##                      |     0.100 |     0.300 |     0.600 |     0.053 | 
##                      |     0.014 |     0.054 |     0.095 |           | 
##                      |     0.005 |     0.016 |     0.032 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (550,600] |         1 |         3 |         2 |         6 | 
##                      |     0.167 |     0.500 |     0.333 |     0.032 | 
##                      |     0.014 |     0.054 |     0.032 |           | 
##                      |     0.005 |     0.016 |     0.011 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (600,650] |         0 |         1 |         1 |         2 | 
##                      |     0.000 |     0.500 |     0.500 |     0.011 | 
##                      |     0.000 |     0.018 |     0.016 |           | 
##                      |     0.000 |     0.005 |     0.005 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (650,700] |         1 |         0 |         0 |         1 | 
##                      |     1.000 |     0.000 |     0.000 |     0.005 | 
##                      |     0.014 |     0.000 |     0.000 |           | 
##                      |     0.005 |     0.000 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##            (800,850] |         0 |         1 |         0 |         1 | 
##                      |     0.000 |     1.000 |     0.000 |     0.005 | 
##                      |     0.000 |     0.018 |     0.000 |           | 
##                      |     0.000 |     0.005 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
##         Column Total |        71 |        56 |        63 |       190 | 
##                      |     0.374 |     0.295 |     0.332 |           | 
## ---------------------|-----------|-----------|-----------|-----------|
## 
## 

Further Analysis and Various Tables

Continuing from the previous pivot table, I created a flat contingency table which gave the count of different vehicle types by weight. This helps illustrate the trend of trucks being heavier vehicles it also shows most of the vehicles were 4250 lbs or greater meaning the greater the FE score the smaller the sample got which aligns with it following a normal distribution. To take a closer look at comparing these two tables I generated a proportions table. First I wanted to focus on where the majority of the data was so I took cars that were 4000 lbs or greater. In the proportion table we can again see there is a trend in “Car” and “Both” being slightly higher than “Truck” for fuel efficiency score. Lastly, I generated a cross table to compare the ratios of vehicle type and CO2 emissions in grams/mile. Focusing on where there are the greatest number of observations it can be seen that “Car” and “Both” have a larger overall ratio at 200-250 and 250-300. For 300-350 the ratio is fairly even but “Both” has the highest at .074 then trucks at .058. For 350-400 trucks has more than three times the ratio of cars for CO2 emissions.

Conclusion

Based on the above analysis through comparing different results across multiple tables with various variables we can conclude that the heavier the vehicle type is the more CO2 emissions it’s likely to produce and this contributes to a lower FE score. Trucks tend to be heavier vehicles than cars so it makes sense that they would also trend towards having a lower FE score and higher CO2 emissions.

Work Cited

References:

When prompted with “How to create bins in increments of 50 for a variable in R” the ChatGPT generated text indicated “This code will create a new column named”co2_bins” in your “cars” data frame, with values representing the bins. The seq(0, max(cars$co2_g_mi) + 50, by = 50) generates breaks starting from 0 up to the maximum value of “co2_g_mi” plus 50, with increments of 50” (OpenAI, 2024).

OpenAI. (2024). ChatGPT (Feb 25 version) [Large Language model] https://chat.openai.com/

Kabacoff, R. I. (2015). R in Action (2nd ed.). Manning Publications.