Business Analytics - Task 3

Authors

Netta Amar 300232444

Grisha Rozenshtein 317302461

Hellena Ilanit Elimelech 027788215

1 Task 3

1.1 Data preparation

1.1.1 Transformations

  • One-hot encoding was needed for fitting xgboost.
    • We used a trick with the function model.matrix().
  • There are variables in the data which may have come from another model.
    • They are predictions of various price definitions for the given car.
    • Their correlations are very high (almost 1).
    • They have prefix MMR - Mixed Model Regression?
    • We decided to combine them into a single predictor.
      • We used Principal Components Analysis, a common dimensionality reduction method.
      • A single component explains more than 60% of their common variance.
    • These variables are summarized into one predictor, called MMR.

Figure 1: Bivariate Pearson correlations of

1.1.2 Imputation

  • We cannot assume a reasonable model to fill in NA values.
  • We used the commonly recommended median and mode imputations.
    • For missing numeric variables - we set NA’s to the median of the respective column.
    • For missing categorical variables - likewise, but with the most frequent value (the mode).
  • This could be improved in various ways, for instance, creating indicator variables for the missingness.

1.1.3 Outliers

  • We found no considerable outliers in the subset of the data that we finally used for this project.

1.2 Selected Research Questions

We briefly describe the results for each of the research hypotheses generated for Task 2 but not studied in this work.

hypothesis studied remark
1 TRUE
2 TRUE
3 TRUE Advised by supervisor as preliminary
4 TRUE Same as 3
5 FALSE We only present a small module for EDA of the price
6 FALSE The geographic data are unbalanced, and no trends were identified.
7 TRUE We already presented this in Task 2
8 TRUE We fitted machine learning models to the data and assessed them
9 FALSE We found no such evidence
10 TRUE There is no difference.

1.3 Research Question I:

The question as originally formulated:

How do the lemon rates vary across different makes and models? Are there any significant differences in lemon rates between American, Japanese, and other manufacturers?

We took two approaches - a descriptive (graphical) one, and one that is rooted in statistics.

Approach 1 - Visual inspection of the data

  • According to the barplot, there is a clear difference in the lemon rate between makes.
  • However, the origin country is not very informative.
    • Some Japanese makes are reliable, others aren’t, and likewise for American and German cars.

Approach 2 - Statistical hypothesis testing

  • Table with lemon counts and rates.
  • According to a simulated1 Fisher’s exact test, the lemon rate depends on the origin country.
Characteristic germany, N = 1581 japan, N = 7,3311 other, N = 1301 south korea, N = 4,2951 sweden, N = 371 usa, N = 61,0321 p-value2
lemon 27 (17%) 1,026 (14%) 11 (8.5%) 525 (12%) 0 (0%) 7,387 (12%) <0.001
1 n (%)
2 Fisher’s Exact Test for Count Data with simulated p-value (based on 1e+05 replicates)
Table 1: Lemon rate by country of origin. Note the variable uncertainty which results from variable sample sizes - some confidence intervals (CI) are much wider.
Country of origin Lemons/100 vehicles Cars in database CI(95%)
germany 17.09 158 0.118-0.241
japan 14 7331 0.132-0.148
south korea 12.22 4295 0.113-0.132
usa 12.1 61032 0.118-0.124
other 8.462 130 0.045-0.15
sweden 0 37 0-0.117
  • In conclusion, we see that it is the make, and not the country of origin which is responsible for the lemon rate.
  • Actionable: use the findings to promote domestic cars.

1.4 Question II: (originally question 1)

What are the key factors, including make, model, year, and other vehicle characteristics, that are most strongly associated with a car being a lemon?

  • We originally speculated: that vehicle age, make, and model might be predictors of a lemon.

  • Since the dataset is very large (n=72983), statistical significance may be achieved for factors that have nothing to do with lemon status of a car.

  • We employ the following variable screening procedure:

  • For each variable, we fit a univariate logistic regression model to predict the outcome.

    • We use 70% of the available data, while reserving 30% for the test.
    • We estimate the predictor’s Area Under the ROC Curve (AUC) from the test data.
    • Variables with AUC close to 0.5 will be neglected and discarded from further analyses2.
  • The user of the interactive report can see these models and assess their fit.

1.4.1 The key factors

  • The single strongest predictor of a lemon care is its age.

  • Somewhat surprisingly, the transmission is a weak single predictor.

  • Cost-associated columns (MNR) and the odomoter are all strongly correlated with vehicle_age.

variable auc
vehicle_age 0.6318
mmr_current_auction_average_price 0.5983
mmr_acquisition_auction_average_price 0.5977
mmr_current_auction_clean_price 0.5947
mmr_acquisition_auction_clean_price 0.5925
mmr_current_retail_average_price 0.5895
veh_b_cost 0.589
mmr_current_retail_clean_price 0.5875
mmr_acquisition_retail_average_price 0.5758
mmr_acquisiton_retail_clean_price 0.5723
veh_odo 0.5712
transmission 0.5025
  • Variables selected due to domain knowledge: vehicle_age, origin, transmission, veh_odo

  • Variables not used in the final analyses: auction, veh_year, make, model, trim, sub_model, color, nationality, size, top_three_american_name, vnzip, vnst, is_online_sale, warranty_cost, purch_date, byrno

  • Variables considered: mmr_acquisition_auction_average_price, mmr_acquisition_auction_clean_price, mmr_acquisition_retail_average_price, mmr_acquisiton_retail_clean_price, mmr_current_auction_average_price, mmr_current_auction_clean_price, mmr_current_retail_average_price, mmr_current_retail_clean_price, veh_b_cost

1.5 Question III: Machine learning model

1.5.1 Models used

  • We used tree-based ensemble methods which are considered as well-performing (Boehmke and Greenwell 2019).

  • Random forest - using the R ranger package (Wright and Ziegler 2017).

    • This is a “parallel” ensemble: each tree gets its own data and makes predictions.
    • The predictions are aggregated by majority vote.
  • XGboost - using xgboost (Chen et al. 2023).

    • This is a sequential ensemble of trees: each tree is fitted on the prediction errors of the previous tree.
  • In Random Forest, you just need enough trees.

  • In XGboost, the number of trees needs tuning.

1.5.2 Model performance metric choice: Sensitivity.

  • Models are trained to optimize some metric of their peformance.

  • We focused on sensitivity (aka true positive rate): we want to find as many lemon cars as we can.

\[\text{Sens.} = \frac{\text{True positive}}{\text{All positive}}\]

  • A sensitivity of 50% implies that we correctly identify 50% of the lemons out there.

  • This makes sense because “one rotten apple spoils the pile”.

    • A dissatisfied customer who got a lemon not only costs much, but may also reduce future sales.
  • However, this is not free of cost.

    • In exchange for the high sensitivity we get a low Specificity.
    • This means a lot of false accusations of “lemonness”, thus missing out on good cars.
    • The ROC curve is a plot of all possible (Sens., Spec.) pairs for the model.

1.5.3 Model training

  • We trained the models via 5-Fold crossvalidation, tuning the hyperparameters to optimize sensitivity.
    • The models were trained on the same datasets, via caret packgage (Kuhn and Max 2008).
Table 2: Best sensitivity attained by each model on the test set.
Model Sensitivity
XGB 13.97% [12.19-15.06]
RF 2.4% [1.54-3.27]

Figure 2: Variable importances from the two predictive models.

With make

Figure 3: ROC curve for the model that contains the make (XGBoost)

  • Assuming 15% sensitivity is unacceptable for a loss of about 8.8% of the trades (the specificity of the best model was 91.2%).

  • Explain drawbacks.

1.6 Other Hyptoheses

1.6.1 H7: Lemon over time?

  • The notion of

  • Define time:

    • If time when car made - no.
    • If time is the current lifetime of the car - yes.
    • Effective time is the odometer - some call it “the true car age”.
Characteristic OR1 95% CI1 p-value
age 1.31 1.26, 1.38 <0.001
year 1.01 0.96, 1.05 0.8
odo 1.00 1.00, 1.00 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

1.6.2 H10: Online lemons?

  • No: the lemon rate is exactly the same online and in-store.
lemon Total p-value1
0 1
is_online_sale 0.3
    0 62,375 (88%) 8,763 (12%) 71,138 (100%)
    1 1,632 (88%) 213 (12%) 1,845 (100%)
Total 64,007 (88%) 8,976 (12%) 72,983 (100%)
1 Pearson’s Chi-squared test

1.7 Conclusion

  • Lemon status is hard to predict.

1.7.1 Suggestions for t

Boehmke, Brad, and Brandon M Greenwell. 2019. Hands-on Machine Learning with r. CRC press.
Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2023. Xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.
Kuhn, and Max. 2008. “Building Predictive Models in r Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.
Wright, Marvin N., and Andreas Ziegler. 2017. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.

Footnotes

  1. Generating the entire distribution of the origin\(\times\)lemon contingency table is too computationally burdensome, so we used a version that resampled 100,000 of those possible tables to compute the p-value↩︎

  2. We will retain some variables in the model following the advice of domain-experts.↩︎