xgboost.
model.matrix() which is an R internal function.lm.MMR - Mixed Model Regression?Bivariate Pearson correlations of
We briefly describe the results for each of the research hypotheses generated for Task 2 but not studied in this work. The hypothesis number refers to their order in Task 2.
We test these hypotheses, and provide the reader with an immersive interactive view of the data via the interactive dashboard.
The research project was ordered as follows:
Research Question 0. Investigation of all ten original research hypothesis, choosing those which have business implication and are testable.
Research Question 1. We start from investigation of the make and origin effects on lemon rate. This puts to the test a Null Hypothesis coming from common knowledge/sense (“some origins are more prone to lemons than others”).
Research Question 2. We seek predictors that are strongly associated with the lemon outcome.
Research Question 3. We use the predictors found in Question 2, as well as domain knowledge to train, tune, fit and assess two advanced machine learning models on the data..
Additional questions are also answered.
| hypothesis | studied | remark |
|---|---|---|
| 1 | TRUE | Primary hypothesis of this project. |
| 2 | TRUE | Primary hypothesis of this project. |
| 3 | TRUE | Advised by supervisor as a preliminary check |
| 4 | TRUE | Same as 3 |
| 5 | FALSE | We present a module for regression modeling of this variable |
| 6 | FALSE | The geographic data are unbalanced, and no trends were identified. |
| 7 | TRUE | We already presented this in Task 2 |
| 8 | TRUE | We fitted machine learning models to the data and assessed them |
| 9 | FALSE | We found no such evidence |
| 10 | TRUE | There is no difference. |
The question as originally formulated:
How do the lemon rates vary across different makes and models? Are there any significant differences in lemon rates between American, Japanese, and other manufacturers?
We took two approaches - a descriptive (graphical) one, and one that is rooted in statistics.
Table with lemon counts and rates.
Fisher’s exact test tests whether the distribution of lemons is similar across different origins (H0) or otherwise (H1).
According to a simulated1 Fisher’s exact test, the lemon rate depends on the origin country.
| Characteristic | germany, N = 1581 | japan, N = 7,3311 | other, N = 1301 | south korea, N = 4,2951 | sweden, N = 371 | usa, N = 61,0321 | p-value2 |
|---|---|---|---|---|---|---|---|
| lemon | 27 (17%) | 1,026 (14%) | 11 (8.5%) | 525 (12%) | 0 (0%) | 7,387 (12%) | <0.001 |
| 1 n (%) | |||||||
| 2 Fisher’s Exact Test for Count Data with simulated p-value (based on 1e+05 replicates) | |||||||
| Country of origin | Lemons/100 vehicles | Cars in database | CI(95%) |
|---|---|---|---|
| germany | 17.09 | 158 | 0.118-0.241 |
| japan | 13.9 | 7461 | 0.131-0.147 |
| south korea | 12.22 | 4295 | 0.113-0.132 |
| usa | 12.1 | 61032 | 0.118-0.124 |
| sweden | 0 | 37 | 0-0.117 |
| No. of makes |
|---|
| 2 |
| 13 |
| 2 |
| 15 |
| 1 |
In conclusion, we see that it is the make, and not the country of origin which is responsible for the lemon rate.
Actionable business implication: use the findings to promote domestic cars, as their lemon rate is no different than of foreign makes.
Use the estimated lemon rate in the framework quantitative risk management, i.e., pass this insight to the money people in the firm.
What are the key factors, including make, model, year, and other vehicle characteristics, that are most strongly associated with a car being a lemon?
We originally speculated: that vehicle age, make, and model might be predictors of a lemon.
Since the dataset is very large (n=72983), statistical significance may be achieved for factors that have nothing to do with lemon status of a car.
We employ the following variable screening procedure:
For each variable, we fit a univariate logistic regression model to predict the outcome.
The user of the interactive report can see these models and assess their fit.
The single strongest predictor of a lemon care is its age.
Somewhat surprisingly, the transmission is a weak single predictor.
Cost-associated columns (MNR) and the odomoter are all strongly correlated with vehicle_age.
| variable | auc |
|---|---|
| vehicle_age | 0.6318 |
| mmr_current_auction_average_price | 0.5983 |
| mmr_acquisition_auction_average_price | 0.5977 |
| mmr_current_auction_clean_price | 0.5947 |
| mmr_acquisition_auction_clean_price | 0.5925 |
| mmr_current_retail_average_price | 0.5895 |
| veh_b_cost | 0.589 |
| mmr_current_retail_clean_price | 0.5875 |
| mmr_acquisition_retail_average_price | 0.5758 |
| mmr_acquisiton_retail_clean_price | 0.5723 |
| veh_odo | 0.5712 |
| transmission | 0.5025 |
Variables selected due to domain knowledge: vehicle_age, origin, transmission, veh_odo
Variables not used in the final analyses: auction, veh_year, make, model, trim, sub_model, color, nationality, size, top_three_american_name, vnzip, vnst, is_online_sale, warranty_cost, purch_date, byrno
Variables considered: mmr_acquisition_auction_average_price, mmr_acquisition_auction_clean_price, mmr_acquisition_retail_average_price, mmr_acquisiton_retail_clean_price, mmr_current_auction_average_price, mmr_current_auction_clean_price, mmr_current_retail_average_price, mmr_current_retail_clean_price, veh_b_cost
The cost (veh_b_cost) and the transformed MMR variables are also included in the model.
We originally asked in Task 2:
Can we develop a predictive model to estimate the probability of a car being a lemon based on its features?
In this section we fit machine learning models to classify lemon/normal cars, and assess the performance of these models. We discuss potential business implications of these models.
We used tree-based ensemble methods which are considered as well-performing (Boehmke and Greenwell 2019).
Random forest - using the R ranger package (Wright and Ziegler 2017).
XGboost - using xgboost (Chen et al. 2023).
In Random Forest, you just need enough trees.
In XGboost, the number of trees needs tuning, too many trees may result in overfitting.
Models are trained to optimize some metric of their peformance.
We focused on sensitivity (aka true positive rate): we want to find as many lemon cars as we can.
\[\text{Sens.} = \frac{\text{True positive}}{\text{All positive}}\]
A sensitivity of 50% implies that we correctly identify 50% of the lemons out there.
This makes sense because “one rotten apple spoils the pile.”
However, this is not free of cost.
caret packgage (Kuhn and Max 2008).| Model | Sensitivity |
|---|---|
| XGB | 13.97% [12.19-15.06] |
| RF | 2.4% [1.54-3.27] |
| Hyperparameter | values | best |
|---|---|---|
| XGBoost | ||
| nrounds | 100,200,300 | 300 |
| max_depth | 3,6,9 | 6 |
| eta | 1,0.1,0.01,0.001 | 1 |
| gamma | 0,1,5 | 0 |
| colsample_bytree | 0.6,0.8,1 | 1 |
| min_child_weight | 1,5,10 | 1 |
| subsample | 0.6,0.8,1 | 0.6 |
| Random Forest | ||
| mtry | 2,4,6 | 6 |
| min.node.size | 3,5 | 3 |
| splitrule | gini | gini |
Variable importances from the two predictive models.
ROC curve for the model that contains the make (XGBoost)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| car_age | |||
| new | — | — | |
| old | 1.69 | 1.28, 2.23 | <0.001 |
| odo | 1.01 | 1.01, 1.01 | <0.001 |
| car_age * odo | |||
| old * odo | 1.00 | 1.00, 1.01 | 0.063 |
| 1 OR = Odds Ratio, CI = Confidence Interval | |||
From this interaction plot we see that:
Below 3 years of age, the odometer reading has little effect on the probability of lemon.
After 5 years it is apparent that higher odometer = more lemons.
The ratio of the predicted probabilities given age is constant \(\exp(\hat{\beta}_{\text{odo}})\).
No: the lemon rate is exactly the same online and in-store.
The contingency table below reveals a steady 12% lemon rate in both online and retail sales.
The conclusion from the \(\chi^2\) test is that there is no association between online sale and lemon.
| lemon | Total | p-value1 | ||
|---|---|---|---|---|
| 0 | 1 | |||
| is_online_sale | 0.3 | |||
| 0 | 62,375 (88%) | 8,763 (12%) | 71,138 (100%) | |
| 1 | 1,632 (88%) | 213 (12%) | 1,845 (100%) | |
| Total | 64,007 (88%) | 8,976 (12%) | 72,983 (100%) | |
| 1 Pearson’s Chi-squared test | ||||
Lemon status is hard to predict.
The chance of lemon increases with car age.
More kilometers on the odometer - higher chance of lemon!
The final classifier probably does not pass reasonable threshold, a higher AUC is desired.
Domain knowledge is lacking and this notably impacts the research.
The Sensitivity of the model can be chosen by the deployer. For example:
All cars are lemons.
Focus on studying foreign cars, German and other European makes in particular.
Gather more lemon observations - cooperate with other dealers?
Build a more advanced shiny app to include more models (neural networks for example).