In 2010 US Government ran a program (C.A.R.S/ cash for clunker) during which consumers turned in gas guzzlers and bought nearly 700,000 more fuel-efficient vehicles in fewer than 30 days. Transaction data is available through NHTSA website.

The following analysis are done using Final Paid Transaction Database text file (via ftp). You can find all the R code I used on my Gist.

I. Measure the Success of the Program

I am going to describe the success of the program from the following three aspects.

A. Number of Transactions

1. Total Number of Transactions by State

## 
##    VI    MP    DC    GU    PR    WY    AK    MT    HI    ND    VT    SD 
##     6     7    17   153   428   591  1143  1469  1508  1996  2319  2437 
##    RI    DE    ID    MS    WV    NM    NV    ME    NE    NH    AR    UT 
##  2513  2661  2745  2927  3150  3224  3350  3885  5135  5373  5467  5606 
##    AL    KS    LA    CO    OK    KY    SC    OR    IA    AZ    CT    TN 
##  7376  7402  7892  8520  8730  8771  8806  8820  8868  9129  9166 11930 
##    WA    MO    MA    IN    WI    GA    MN    MD    NC    VA    NJ    MI 
## 13072 14339 15144 15490 16452 16590 17025 17655 18407 23555 24536 31180 
##    OH    PA    IL    FL    NY    TX    CA 
## 32115 32873 33939 34129 36751 42907 76403

From the above table, we can see that the top 10 states with the largest number of transactions are: VA, NJ, MI, OH, PA, IL, FL, NY, TX, CA. And the bottom 10 states with the smallest number of transactions are: VI, MP, DC, GU, PR, WY, AK, MT, HI, ND.

2. Difference by Category

I also calculated the differences in the number of vehicles by category for each state, before and after this program. I plot them together with result we’ve got in 1. From the following two graphs, we can see that for most states, the number of passenger automobiles increase, whereas the number of trucks decrease.

B. Approved Voucher Dollar Amount

1. Total Voucher Dollar Amount by State

From the following graphs, we can see that the top 10 states with the largest approved voucher dollar amount are: VA, NJ, MI, OH, PA, IL, FL, NY, TX, CA. And the bottom 10 states are: VI, MP, DC, GU, PR, WY, AK, MT, HI, ND. The results are very similar to what we got in A.1.

2. Average Voucher Dollar per Transaction by State

From the following graphs, we can see that the top 10 states with the highest mean voucher per transaction are: NH, AZ, CO, UT, ME, OR, GU, MT, PR, MP. And the bottom 10 states are: DC, VI, NJ, MI, WY, SC, AL, LA, IL, MS. We didn’t see any huge difference in the mean values.

C. Average MPG Ratio

1. Average MPG Ratio by State

Average MPG Ratio = Avg New Vehicle MPG/ Avg Trade-in Vehicle MPG

The top 10 states with the highest average MPG ratio are: NM, ID, NV, WA, AZ, CO, MT, OR, CA, UT. And the bottom 10 states are: VI, LA, MI, PR, MS, GU, AL, AK, SC, HI.

2. Average MPG Ratio by Category

I also calculated the average MPG ratio by category for each state. I plot them together with result we’ve got in 1. From the following two graphs, we can see that for most states, the mean MPG ratio of passenger automobiles, truck 1 and 2 are larger than 1, whereas the mean MPG ratio of truck 3 are less than 1.

II. Did West Coast Consumers Buy More Fuel Efficient Cars?

Suppose the term “west coast” refers to the states of California, Oregon and Washington, and the term “fuel efficient cars” refers to passenger automobiles. From the following table, we can see that west coast consumers buy less fuel efficient cars than other regions. But this is the absolute number, what about the proportion? From the chi-squared test, we can see that 63% of west coast consumers bought fuel efficient cars, while that of other region is 58%. The p-value is less than 0.05, indicating that the two proportions are significantly different.

Passenger Automobile Other Total
Other 332952 242368 575320
West Coast 62021 36245 98266
Total 394973 278613 673586
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  tbl4test[-3, 1] out of tbl4test[-3, 3]
## X-squared = 951.01, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.05571074 -0.04914793
## sample estimates:
##    prop 1    prop 2 
## 0.5787249 0.6311542

Let’s look at this problem in another way.

First, let’s look at the new vehicle MPG. The following violin plot shows that the average new vehicle mpg of west coast is higher, although the modes and variance are similar. T-test result shows that the average new vehicle mpg of west coast (group 1) is significantly higher.

## 
##  Welch Two Sample t-test
## 
## data:  paidtrans$new_vehicle_car_mileage by paidtrans$west_coast
## t = -67.309, df = 122190, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.603444 -1.512704
## sample estimates:
## mean in group 0 mean in group 1 
##        24.64484        26.20292

Second, let’s look at the trade-in vehicle MPG. The following violin plot shows that the average new vehicle mpg of west coast is about the same as other region, while its variance seems smaller. T-test result shows that the average new vehicle mpg of west coast (group 1) is significantly higher, although the magnitude is very small.

## 
##  Welch Two Sample t-test
## 
## data:  paidtrans$trade_in_mileage by paidtrans$west_coast
## t = -29.917, df = 132550, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2489629 -0.2183473
## sample estimates:
## mean in group 0 mean in group 1 
##        15.66442        15.89807

III. Behavioral Patterns about How Consumers Buy New Vehicles

In this part, I want to dig into this problem: What factors influence consumers’ choice in the type of vehicle?

First, I chose New_vehicle_category as dependent variable, and State, Sales_type, Trade_in_vehicle_category, Invoice_amount, Trade_in_mileage, Trade_in_odometer_reading, New_vehicle_car_mileage, New_vehicle_MSRP, Tradein_age as predictors. Here, Tradein_age are define as the year of sale date subtracted by the model year of the trade-in car.

Next, I randomly selected 80% of the data as training set, and the rest 20% as test set. I trained the model using random forest algorithm. The number of trees is 500, and the number of variables tried at each split is 3. Missing values are imputed by function na.roughfix() from R package randomForest.

## 
## Call:
##  randomForest(formula = new_vehicle_category ~ ., data = data0.train,      importance = TRUE, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 3.43%
## Confusion matrix:
##        1     2    3      P class.error
## 1 171621   267    3  11485  0.06410326
## 2    840 36722   25    129  0.02635486
## 3     15   282 1530      5  0.16484716
## P   5520    11    0 313426  0.01734090

From the above result, we can see that the out-of-bag estimate of error rate is 3.43%, which is an unbiased estimate of the classification error. The confusion matrix shows that the class error are very small for every type except truck 3.

Then I test the model using the test set. The predicted result is very good, with an error rate of 3.48%. The class errors are very closed to what we got from the training set.

## [1] "Confusion matrix of test set:"
##         predicted
## observed     1     2     3     P
##        1 42775    66     0  2882
##        2   246  9260     6    20
##        3     4    68   362     0
##        P  1418     1     0 78249
## [1] "Class error of test set:"
## [1] 0.06447521 0.02853546 0.16589862 0.01781142

Lastly, let’s take a look at the importance of predictors. From the following graph, we can see that the MPG of new vehicle has a very significant effect on the dependent variable, followed by the invoice amount, trade-in vehicle type, new vehicle MSRP and the MPG of trade-in vehicle. From the importance table, we can see that for each category, these five factors are of high importance, although the rank of them varies. The factor of state is not very important compared to others, which means this result might be able to be applied to the general population.

1 2 3 P MeanDecreaseAccuracy MeanDecreaseGini
state 60.32969 10.362512 6.685064 41.797310 66.75210 4670.7000
sales_type 17.65172 15.873666 2.033106 7.400115 21.13402 102.8518
trade_in_vehicle_category 314.61978 568.206697 82.268503 97.019728 460.92393 17520.2458
invoice_amount 322.89980 35.158924 37.351015 425.903611 476.52372 17986.1479
trade_in_mileage 152.34976 24.217173 17.075393 228.639558 189.63263 12951.7135
trade_in_odometer_reading 14.37350 -3.507858 4.456435 32.946703 34.32455 8494.3512
new_vehicle_car_mileage 955.84255 1426.214790 277.145451 937.880501 1602.97954 172299.8856
new_vehicle_MSRP 172.69092 22.343383 25.581381 226.767311 227.96911 42394.7912
tradein_age 97.10152 25.748456 14.541912 40.258201 62.05324 4036.4175

IV. Is this Program Wildly Successful?

I don’t think we can draw the conclusion that this program is “wildly successful” solely from the data provided by NHTSA. Using the metrics I defined in part I to measure the success: 1) Number of Transactions. We should compare the number of transactions in the duration of this program to that in the same length of time before/after the program. 2) Average MPG Ratio. We should compare the average MPG ratio in the duration of this program to that in the same length of time before/after the program. It is possible that people tend to buy more fuel-efficient cars regardless of the incentive.

Moreover, we should also consider the factor of oil price. Is the value of oil saved by fuel-efficient cars higher than the expense of the incentive? And also, does it consume more energy to manufacture new cars and dispose old cars, than the energy we saved by switching to fuel-efficient cars?