In 2010 US Government ran a program (C.A.R.S/ cash for clunker) during which consumers turned in gas guzzlers and bought nearly 700,000 more fuel-efficient vehicles in fewer than 30 days. Transaction data is available through NHTSA website.
The following analysis are done using Final Paid Transaction Database text file (via ftp). You can find all the R code I used on my Gist.
I am going to describe the success of the program from the following three aspects.
##
## VI MP DC GU PR WY AK MT HI ND VT SD
## 6 7 17 153 428 591 1143 1469 1508 1996 2319 2437
## RI DE ID MS WV NM NV ME NE NH AR UT
## 2513 2661 2745 2927 3150 3224 3350 3885 5135 5373 5467 5606
## AL KS LA CO OK KY SC OR IA AZ CT TN
## 7376 7402 7892 8520 8730 8771 8806 8820 8868 9129 9166 11930
## WA MO MA IN WI GA MN MD NC VA NJ MI
## 13072 14339 15144 15490 16452 16590 17025 17655 18407 23555 24536 31180
## OH PA IL FL NY TX CA
## 32115 32873 33939 34129 36751 42907 76403
From the above table, we can see that the top 10 states with the largest number of transactions are: VA, NJ, MI, OH, PA, IL, FL, NY, TX, CA. And the bottom 10 states with the smallest number of transactions are: VI, MP, DC, GU, PR, WY, AK, MT, HI, ND.
I also calculated the differences in the number of vehicles by category for each state, before and after this program. I plot them together with result we’ve got in 1. From the following two graphs, we can see that for most states, the number of passenger automobiles increase, whereas the number of trucks decrease.
From the following graphs, we can see that the top 10 states with the largest approved voucher dollar amount are: VA, NJ, MI, OH, PA, IL, FL, NY, TX, CA. And the bottom 10 states are: VI, MP, DC, GU, PR, WY, AK, MT, HI, ND. The results are very similar to what we got in A.1.
From the following graphs, we can see that the top 10 states with the highest mean voucher per transaction are: NH, AZ, CO, UT, ME, OR, GU, MT, PR, MP. And the bottom 10 states are: DC, VI, NJ, MI, WY, SC, AL, LA, IL, MS. We didn’t see any huge difference in the mean values.
Average MPG Ratio = Avg New Vehicle MPG/ Avg Trade-in Vehicle MPG
The top 10 states with the highest average MPG ratio are: NM, ID, NV, WA, AZ, CO, MT, OR, CA, UT. And the bottom 10 states are: VI, LA, MI, PR, MS, GU, AL, AK, SC, HI.
I also calculated the average MPG ratio by category for each state. I plot them together with result we’ve got in 1. From the following two graphs, we can see that for most states, the mean MPG ratio of passenger automobiles, truck 1 and 2 are larger than 1, whereas the mean MPG ratio of truck 3 are less than 1.
Suppose the term “west coast” refers to the states of California, Oregon and Washington, and the term “fuel efficient cars” refers to passenger automobiles. From the following table, we can see that west coast consumers buy less fuel efficient cars than other regions. But this is the absolute number, what about the proportion? From the chi-squared test, we can see that 63% of west coast consumers bought fuel efficient cars, while that of other region is 58%. The p-value is less than 0.05, indicating that the two proportions are significantly different.
| Passenger Automobile | Other | Total | |
|---|---|---|---|
| Other | 332952 | 242368 | 575320 |
| West Coast | 62021 | 36245 | 98266 |
| Total | 394973 | 278613 | 673586 |
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: tbl4test[-3, 1] out of tbl4test[-3, 3]
## X-squared = 951.01, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.05571074 -0.04914793
## sample estimates:
## prop 1 prop 2
## 0.5787249 0.6311542
Let’s look at this problem in another way.
First, let’s look at the new vehicle MPG. The following violin plot shows that the average new vehicle mpg of west coast is higher, although the modes and variance are similar. T-test result shows that the average new vehicle mpg of west coast (group 1) is significantly higher.
##
## Welch Two Sample t-test
##
## data: paidtrans$new_vehicle_car_mileage by paidtrans$west_coast
## t = -67.309, df = 122190, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.603444 -1.512704
## sample estimates:
## mean in group 0 mean in group 1
## 24.64484 26.20292
Second, let’s look at the trade-in vehicle MPG. The following violin plot shows that the average new vehicle mpg of west coast is about the same as other region, while its variance seems smaller. T-test result shows that the average new vehicle mpg of west coast (group 1) is significantly higher, although the magnitude is very small.
##
## Welch Two Sample t-test
##
## data: paidtrans$trade_in_mileage by paidtrans$west_coast
## t = -29.917, df = 132550, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2489629 -0.2183473
## sample estimates:
## mean in group 0 mean in group 1
## 15.66442 15.89807
In this part, I want to dig into this problem: What factors influence consumers’ choice in the type of vehicle?
First, I chose New_vehicle_category as dependent variable, and State, Sales_type, Trade_in_vehicle_category, Invoice_amount, Trade_in_mileage, Trade_in_odometer_reading, New_vehicle_car_mileage, New_vehicle_MSRP, Tradein_age as predictors. Here, Tradein_age are define as the year of sale date subtracted by the model year of the trade-in car.
Next, I randomly selected 80% of the data as training set, and the rest 20% as test set. I trained the model using random forest algorithm. The number of trees is 500, and the number of variables tried at each split is 3. Missing values are imputed by function na.roughfix() from R package randomForest.
##
## Call:
## randomForest(formula = new_vehicle_category ~ ., data = data0.train, importance = TRUE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.43%
## Confusion matrix:
## 1 2 3 P class.error
## 1 171621 267 3 11485 0.06410326
## 2 840 36722 25 129 0.02635486
## 3 15 282 1530 5 0.16484716
## P 5520 11 0 313426 0.01734090
From the above result, we can see that the out-of-bag estimate of error rate is 3.43%, which is an unbiased estimate of the classification error. The confusion matrix shows that the class error are very small for every type except truck 3.
Then I test the model using the test set. The predicted result is very good, with an error rate of 3.48%. The class errors are very closed to what we got from the training set.
## [1] "Confusion matrix of test set:"
## predicted
## observed 1 2 3 P
## 1 42775 66 0 2882
## 2 246 9260 6 20
## 3 4 68 362 0
## P 1418 1 0 78249
## [1] "Class error of test set:"
## [1] 0.06447521 0.02853546 0.16589862 0.01781142
Lastly, let’s take a look at the importance of predictors. From the following graph, we can see that the MPG of new vehicle has a very significant effect on the dependent variable, followed by the invoice amount, trade-in vehicle type, new vehicle MSRP and the MPG of trade-in vehicle. From the importance table, we can see that for each category, these five factors are of high importance, although the rank of them varies. The factor of state is not very important compared to others, which means this result might be able to be applied to the general population.
| 1 | 2 | 3 | P | MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|---|---|---|---|
| state | 60.32969 | 10.362512 | 6.685064 | 41.797310 | 66.75210 | 4670.7000 |
| sales_type | 17.65172 | 15.873666 | 2.033106 | 7.400115 | 21.13402 | 102.8518 |
| trade_in_vehicle_category | 314.61978 | 568.206697 | 82.268503 | 97.019728 | 460.92393 | 17520.2458 |
| invoice_amount | 322.89980 | 35.158924 | 37.351015 | 425.903611 | 476.52372 | 17986.1479 |
| trade_in_mileage | 152.34976 | 24.217173 | 17.075393 | 228.639558 | 189.63263 | 12951.7135 |
| trade_in_odometer_reading | 14.37350 | -3.507858 | 4.456435 | 32.946703 | 34.32455 | 8494.3512 |
| new_vehicle_car_mileage | 955.84255 | 1426.214790 | 277.145451 | 937.880501 | 1602.97954 | 172299.8856 |
| new_vehicle_MSRP | 172.69092 | 22.343383 | 25.581381 | 226.767311 | 227.96911 | 42394.7912 |
| tradein_age | 97.10152 | 25.748456 | 14.541912 | 40.258201 | 62.05324 | 4036.4175 |
I don’t think we can draw the conclusion that this program is “wildly successful” solely from the data provided by NHTSA. Using the metrics I defined in part I to measure the success: 1) Number of Transactions. We should compare the number of transactions in the duration of this program to that in the same length of time before/after the program. 2) Average MPG Ratio. We should compare the average MPG ratio in the duration of this program to that in the same length of time before/after the program. It is possible that people tend to buy more fuel-efficient cars regardless of the incentive.
Moreover, we should also consider the factor of oil price. Is the value of oil saved by fuel-efficient cars higher than the expense of the incentive? And also, does it consume more energy to manufacture new cars and dispose old cars, than the energy we saved by switching to fuel-efficient cars?