Data preparation

Transformations

One-hot encoding was needed for fitting xgboost.
- We used a trick with the function model.matrix() which is an R internal function.
- It is called “under the hood” of functions such as lm.
There are variables in the data which may have come from another model.
- They are predictions of various price definitions for the given car.
- Their correlations are very high (almost 1).
- They have prefix MMR - Mixed Model Regression?
- We decided to combine them into a single predictor.
  - We used Principal Components Analysis, a common dimensionality reduction method.
  - A single component explains more than 60% of their common variance.
- These variables are summarized into one predictor, called MMR.
The right figure below demonstrates how high the correlations between MMR variables were.
The left figure is called a “screeplot,” it helps to select the appropriate number of principal components.
We decided to use only the first component, which explained more than 60% of the variability in the data.

Bivariate Pearson correlations of

Imputation

We cannot assume a reasonable model to fill in NA values.
We used the commonly recommended median and mode imputations.
- For missing numeric variables - we set NA’s to the median of the respective column.
- For missing categorical variables - likewise, but with the most frequent value (the mode).
This could be improved in various ways, for instance, creating indicator variables for the missingness.

Outliers

We found no considerable outliers in the subset of the data that we finally used for this project.

Research

Selected Research Questions

We briefly describe the results for each of the research hypotheses generated for Task 2 but not studied in this work. The hypothesis number refers to their order in Task 2.

We test these hypotheses, and provide the reader with an immersive interactive view of the data via the interactive dashboard.

The research project was ordered as follows:

Research Question 0. Investigation of all ten original research hypothesis, choosing those which have business implication and are testable.

Research Question 1. We start from investigation of the make and origin effects on lemon rate. This puts to the test a Null Hypothesis coming from common knowledge/sense (“some origins are more prone to lemons than others”).

Research Question 2. We seek predictors that are strongly associated with the lemon outcome.

Research Question 3. We use the predictors found in Question 2, as well as domain knowledge to train, tune, fit and assess two advanced machine learning models on the data..

Additional questions are also answered.

hypothesis	studied	remark
1	TRUE	Primary hypothesis of this project.
2	TRUE	Primary hypothesis of this project.
3	TRUE	Advised by supervisor as a preliminary check
4	TRUE	Same as 3
5	FALSE	We present a module for regression modeling of this variable
6	FALSE	The geographic data are unbalanced, and no trends were identified.
7	TRUE	We already presented this in Task 2
8	TRUE	We fitted machine learning models to the data and assessed them
9	FALSE	We found no such evidence
10	TRUE	There is no difference.

Research Question I:

The question as originally formulated:

How do the lemon rates vary across different makes and models? Are there any significant differences in lemon rates between American, Japanese, and other manufacturers?

We took two approaches - a descriptive (graphical) one, and one that is rooted in statistics.

Approach 1 - Visual inspection of the data

According to the barplot, there is a clear difference in the lemon rate between makes.
However, the origin country is not very informative.
- Some Japanese makes are reliable, others aren’t, and likewise for American and German cars.

Approach 2 - Statistical hypothesis testing

Table with lemon counts and rates.
Fisher’s exact test tests whether the distribution of lemons is similar across different origins (H0) or otherwise (H1).
According to a simulated¹ Fisher’s exact test, the lemon rate depends on the origin country.

Characteristic	germany, N = 158¹	japan, N = 7,331¹	other, N = 130¹	south korea, N = 4,295¹	sweden, N = 37¹	usa, N = 61,032¹	p-value²
lemon	27 (17%)	1,026 (14%)	11 (8.5%)	525 (12%)	0 (0%)	7,387 (12%)	<0.001
¹ n (%)
² Fisher’s Exact Test for Count Data with simulated p-value (based on 1e+05 replicates)

Table continues below
Country of origin	Lemons/100 vehicles	Cars in database	CI(95%)
germany	17.09	158	0.118-0.241
japan	13.9	7461	0.131-0.147
south korea	12.22	4295	0.113-0.132
usa	12.1	61032	0.118-0.124
sweden	0	37	0-0.117

No. of makes
2
13
2
15
1

In conclusion, we see that it is the make, and not the country of origin which is responsible for the lemon rate.
Actionable business implication: use the findings to promote domestic cars, as their lemon rate is no different than of foreign makes.
Use the estimated lemon rate in the framework quantitative risk management, i.e., pass this insight to the money people in the firm.
- If this is actually deployed, the model needs to be monitored over time to validate its performance in a changing world.

Research Question II: (originally question 1)

What are the key factors, including make, model, year, and other vehicle characteristics, that are most strongly associated with a car being a lemon?

We originally speculated: that vehicle age, make, and model might be predictors of a lemon.
Since the dataset is very large (n=72983), statistical significance may be achieved for factors that have nothing to do with lemon status of a car.
We employ the following variable screening procedure:
For each variable, we fit a univariate logistic regression model to predict the outcome.
- We use 70% of the available data, while reserving 30% for the test.
- We estimate the predictor’s Area Under the ROC Curve (AUC) from the test data.
- Variables with AUC close to 0.5 will be neglected and discarded from further analyses².
The user of the interactive report can see these models and assess their fit.

The key factors

The single strongest predictor of a lemon care is its age.
Somewhat surprisingly, the transmission is a weak single predictor.
Cost-associated columns (MNR) and the odomoter are all strongly correlated with vehicle_age.

variable	auc
vehicle_age	0.6318
mmr_current_auction_average_price	0.5983
mmr_acquisition_auction_average_price	0.5977
mmr_current_auction_clean_price	0.5947
mmr_acquisition_auction_clean_price	0.5925
mmr_current_retail_average_price	0.5895
veh_b_cost	0.589
mmr_current_retail_clean_price	0.5875
mmr_acquisition_retail_average_price	0.5758
mmr_acquisiton_retail_clean_price	0.5723
veh_odo	0.5712
transmission	0.5025

Selected variables

Variables selected due to domain knowledge: vehicle_age, origin, transmission, veh_odo
Variables not used in the final analyses: auction, veh_year, make, model, trim, sub_model, color, nationality, size, top_three_american_name, vnzip, vnst, is_online_sale, warranty_cost, purch_date, byrno
Variables considered: mmr_acquisition_auction_average_price, mmr_acquisition_auction_clean_price, mmr_acquisition_retail_average_price, mmr_acquisiton_retail_clean_price, mmr_current_auction_average_price, mmr_current_auction_clean_price, mmr_current_retail_average_price, mmr_current_retail_clean_price, veh_b_cost
The cost (veh_b_cost) and the transformed MMR variables are also included in the model.

Question III: Machine learning model

We originally asked in Task 2:

Can we develop a predictive model to estimate the probability of a car being a lemon based on its features?

In this section we fit machine learning models to classify lemon/normal cars, and assess the performance of these models. We discuss potential business implications of these models.

Models used

We used tree-based ensemble methods which are considered as well-performing (Boehmke and Greenwell 2019).
Random forest - using the R ranger package (Wright and Ziegler 2017).
- This is a “parallel” ensemble: each tree gets its own data and makes predictions.
- The predictions are aggregated by majority vote.
XGboost - using xgboost (Chen et al. 2023).
- This is a sequential ensemble of trees: each tree is fitted on the prediction errors of the previous tree.
In Random Forest, you just need enough trees.
In XGboost, the number of trees needs tuning, too many trees may result in overfitting.

Model performance metric choice: Sensitivity.

Models are trained to optimize some metric of their peformance.
We focused on sensitivity (aka true positive rate): we want to find as many lemon cars as we can.

\[\text{Sens.} = \frac{\text{True positive}}{\text{All positive}}\]

A sensitivity of 50% implies that we correctly identify 50% of the lemons out there.
This makes sense because “one rotten apple spoils the pile.”
- A dissatisfied customer who got a lemon not only costs much, but may also reduce future sales.
However, this is not free of cost.
- In exchange for the high sensitivity we get a low Specificity.
- This means a lot of false accusations of “lemonness,” thus missing out on good cars.
- The ROC curve is a plot of all possible (Sens., Spec.) pairs for the model.

Model training

We trained the models via 5-Fold crossvalidation, tuning the hyperparameters to optimize sensitivity.
- The models were trained on the same datasets, via caret packgage (Kuhn and Max 2008).

Model	Sensitivity
XGB	13.97% [12.19-15.06]
RF	2.4% [1.54-3.27]

Hyperparameter	values	best
XGBoost
nrounds	100,200,300	300
max_depth	3,6,9	6
eta	1,0.1,0.01,0.001	1
gamma	0,1,5	0
colsample_bytree	0.6,0.8,1	1
min_child_weight	1,5,10	1
subsample	0.6,0.8,1	0.6
Random Forest
mtry	2,4,6	6
min.node.size	3,5	3
splitrule	gini	gini

Models’ performance

Variable importances from the two predictive models.

ROC curve for the model that contains the make (XGBoost)

Implications

Assuming 15% sensitivity is unacceptable for a loss of about 8.8% of the trades (the specificity of the best model was 91.2%).
If the volume of offered cars (the supply) is much greater than what the company can process anyway, the sensitivity can be set higher.
- For example, if we allow falsely rejecting 50% of potential cars, we get 70% sensitivity.
A drawback is that overall the classifier is of medium predictive power, and many good deals will be missed if it is deployed.

Other Hyptoheses

H7: Lemon over time?

The notion of “time-trend” depends on what time we measure. It can refer to:
- The car’s production year.
- Current lifetime of the car (age).
- Odometer reading - some call it “the true car age.”
We fit a logistic regression model to the data using the vehicle’s age and the odometer reading.
- The odometer reading has been divided by 1000.
- The age is transformed to binary - new or old, less/more than 5 years.
The results show that the car’s age effect on the chance of lemon is statistically significant.
- An old car is 1.69 more likely to be a lemon.
- Every 1000 miles add 1% to the odds of being a lemon.
- The interaction term is not significant at 5%.
We saw this is Task 2 - the lemon rate given age is extremely similar across major American makes.

Characteristic	OR¹	95% CI¹	p-value
car_age
new	—	—
old	1.69	1.28, 2.23	<0.001
odo	1.01	1.01, 1.01	<0.001
car_age * odo
old * odo	1.00	1.00, 1.01	0.063
¹ OR = Odds Ratio, CI = Confidence Interval

We also fitted a non transformed model for the age.
We demonstrate its results in another way.
The following plot shows the predicted probability by age for three levels of odometer reading.
- The levels are the mean, and the mean plus/minus one standard deviation (Grace-Martin 2014).

From this interaction plot we see that:
- Below 3 years of age, the odometer reading has little effect on the probability of lemon.
- After 5 years it is apparent that higher odometer = more lemons.
- The ratio of the predicted probabilities given age is constant \(\exp(\hat{\beta}_{\text{odo}})\).

H10: Online lemons?

No: the lemon rate is exactly the same online and in-store.
The contingency table below reveals a steady 12% lemon rate in both online and retail sales.
The conclusion from the \(\chi^2\) test is that there is no association between online sale and lemon.

	lemon		Total	p-value¹
	0	1	Total	p-value¹
is_online_sale				0.3
0	62,375 (88%)	8,763 (12%)	71,138 (100%)
1	1,632 (88%)	213 (12%)	1,845 (100%)
Total	64,007 (88%)	8,976 (12%)	72,983 (100%)
¹ Pearson’s Chi-squared test

Conclusion

Lemon status is hard to predict.
The chance of lemon increases with car age.
More kilometers on the odometer - higher chance of lemon!
The final classifier probably does not pass reasonable threshold, a higher AUC is desired.
Domain knowledge is lacking and this notably impacts the research.
- A domain expert could aggregate informative predictors that we did not include.
- They could also suggest more advanced transformations of the predictors, such as nonlinear or multivariate.
The Sensitivity of the model can be chosen by the deployer. For example:

All cars are lemons.

This “classifier” has a perfect sensitivity but fails in specificity.
The more lemons we detect - the more good cars we omit.

Suggestions for future works

Focus on studying foreign cars, German and other European makes in particular.
Gather more lemon observations - cooperate with other dealers?
Build a more advanced shiny app to include more models (neural networks for example).

Bibliography

Boehmke, Brad, and Brandon M Greenwell. 2019. Hands-on Machine Learning with r. CRC press.

Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2023. Xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.

Grace-Martin, Karen. 2014. “3 Tips to Make Interpreting Moderation Effects Easier.” The Analysis Factor. https://www.theanalysisfactor.com/3-tips-interpreting-moderation/.

Kuhn, and Max. 2008. “Building Predictive Models in r Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.

Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.

Generating the entire distribution of the origin\(\times\)lemon contingency table is too computationally burdensome, so we used a version that resampled 100,000 of those possible tables to compute the p-value↩︎
We will retain some variables in the model following the advice of domain-experts.↩︎

Business Analytics - Task 3

Grisha Rozenshtein 317302461

Netta Amar 300232444

Hellena Ilanit Elimelech 027788215