An automotive manufacturer has a data set of customers who have already purchased at least one vehicle. This task is to help this company to target these existing customers for a re-purchase campaign. It is to send a communication to customers who are highly likely to purchase a new vehicle and make a prediction for their re-purchase.
CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven methodology to provide a structured approach to plan a data mining project. As its powerful practicality, flexibility and its usefulness to solve business issues, we take this methodology into our report. 6 steps will be conducted:
For modeling, we approach two types of modeling, linear classification and tree based classification model to predict which customers are most likely to repurchase. Then a best model is selected to output both probabilities and class predictions for a ‘repurchase_validation.csv’ data. Finally, it is validated by the real result to review its performance.
A dataset of customer customer demographics, previous car type bought, the age of the vehicle, and servicing details. All numeric variables were transformed into deciles (integers 1 to 10, each has a similar number of customers). The data dictionary could be briefed below:
## [1] 131337 17
## ID Target age_band gender
## Min. : 1 Min. :0.00000 Length:131337 Length:131337
## 1st Qu.: 38563 1st Qu.:0.00000 Class :character Class :character
## Median : 77132 Median :0.00000 Mode :character Mode :character
## Mean : 77097 Mean :0.02681
## 3rd Qu.:115668 3rd Qu.:0.00000
## Max. :154139 Max. :1.00000
## car_model car_segment age_of_vehicle_years
## Length:131337 Length:131337 Min. : 1.000
## Class :character Class :character 1st Qu.: 3.000
## Mode :character Mode :character Median : 5.000
## Mean : 5.493
## 3rd Qu.: 8.000
## Max. :10.000
## sched_serv_warr non_sched_serv_warr sched_serv_paid non_sched_serv_paid
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 5.000 Median : 5.000 Median : 5.000 Median : 5.000
## Mean : 5.452 Mean : 5.473 Mean : 5.452 Mean : 5.497
## 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## total_paid_services total_services mth_since_last_serv
## Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.00
## Median : 5.000 Median : 5.000 Median : 5.00
## Mean : 5.482 Mean : 5.455 Mean : 5.47
## 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.00
## Max. :10.000 Max. :10.000 Max. :10.00
## annualised_mileage num_dealers_visited num_serv_dealer_purchased
## Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 5.000 Median : 5.000 Median : 5.000
## Mean : 5.503 Mean : 5.485 Mean : 5.481
## 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :10.000 Max. :10.000 Max. :10.000
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 131337 obs. of 17 variables:
## $ ID : num 1 2 3 5 6 7 8 9 10 11 ...
## $ Target : num 0 0 0 0 0 0 0 0 0 0 ...
## $ age_band : chr "3. 35 to 44" "NULL" "NULL" "NULL" ...
## $ gender : chr "Male" "NULL" "Male" "NULL" ...
## $ car_model : chr "model_1" "model_2" "model_3" "model_3" ...
## $ car_segment : chr "LCV" "Small/Medium" "Large/SUV" "Large/SUV" ...
## $ age_of_vehicle_years : num 9 6 9 5 8 7 8 7 1 3 ...
## $ sched_serv_warr : num 2 10 10 8 9 4 2 4 2 1 ...
## $ non_sched_serv_warr : num 10 3 9 5 4 10 8 9 1 1 ...
## $ sched_serv_paid : num 3 10 10 8 10 5 2 6 1 2 ...
## $ non_sched_serv_paid : num 7 4 9 4 7 7 9 9 3 1 ...
## $ total_paid_services : num 5 9 10 5 9 6 9 8 1 2 ...
## $ total_services : num 6 10 10 6 8 8 4 6 2 1 ...
## $ mth_since_last_serv : num 9 6 7 4 5 8 7 9 1 1 ...
## $ annualised_mileage : num 8 10 10 10 4 5 6 5 1 1 ...
## $ num_dealers_visited : num 10 7 6 9 4 10 10 5 2 1 ...
## $ num_serv_dealer_purchased: num 4 10 10 7 9 4 4 8 3 1 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. Target = col_double(),
## .. age_band = col_character(),
## .. gender = col_character(),
## .. car_model = col_character(),
## .. car_segment = col_character(),
## .. age_of_vehicle_years = col_double(),
## .. sched_serv_warr = col_double(),
## .. non_sched_serv_warr = col_double(),
## .. sched_serv_paid = col_double(),
## .. non_sched_serv_paid = col_double(),
## .. total_paid_services = col_double(),
## .. total_services = col_double(),
## .. mth_since_last_serv = col_double(),
## .. annualised_mileage = col_double(),
## .. num_dealers_visited = col_double(),
## .. num_serv_dealer_purchased = col_double()
## .. )
There are 131337 rows and 17 variables in the dataset. The summary for each variable can be seen above.To highlight, variables such as age_band, gender, car_model and car_segment are factor variables while most other variables are number but in deciles.
Plotting the missing variables, gender accounts fro more than 52 percent of missing variables while it is more than 85% for age_band variable.
The number of re-purchased customers makes only 2% of total customers.
The target percent on y axis shows the percent of re-purchased customers over total customers by each category. In most variables, the target percent distributed in all of categories. For example, target percent of car segment divided equally for each category Large/SUV, LCV, Other, Small/Medium. However, few variables do not cover all categories such as the of amount paid for scheduled services, number of scheduled services used under warranty, and total number of services. Five out of ten categories are re-purchased by customers.
Variables, which distributed on few categories, are put into the interaction between each other. The bright blue color presents the high target percent of the cell or category intersection of two variables. Given an example of car segment and total services, it can be seen that there is high percent of re-purchased customers of the combination of Other and category 3.
## [1] 98502
## [1] 32835
## [1] 131337
##
## Call:
## glm(formula = Target ~ car_model + car_segment + age_of_vehicle_years +
## sched_serv_warr + non_sched_serv_warr + sched_serv_paid +
## non_sched_serv_paid + total_paid_services + total_services +
## mth_since_last_serv + annualised_mileage + num_dealers_visited +
## num_serv_dealer_purchased, family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1410 -0.0943 -0.0032 0.0000 5.7972
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.356e+00 1.756e-01 -36.193 < 2e-16 ***
## car_model2 6.070e-01 1.024e-01 5.927 3.09e-09 ***
## car_model3 7.999e-01 1.069e-01 7.485 7.13e-14 ***
## car_model4 4.240e-01 1.273e-01 3.331 0.000866 ***
## car_model5 1.972e-01 1.075e-01 1.834 0.066664 .
## car_model6 2.633e-01 2.271e-01 1.160 0.246240
## car_model7 8.959e-01 1.296e-01 6.912 4.77e-12 ***
## car_model8 8.525e-01 1.483e-01 5.746 9.11e-09 ***
## car_model9 2.772e-01 3.067e-01 0.904 0.366116
## car_model10 -1.111e+00 2.371e-01 -4.685 2.80e-06 ***
## car_model11 -3.411e+00 4.334e-01 -7.872 3.48e-15 ***
## car_model12 -1.293e+00 5.775e-01 -2.239 0.025145 *
## car_model13 1.854e+00 2.799e-01 6.624 3.50e-11 ***
## car_model14 -1.994e+01 2.995e+03 -0.007 0.994687
## car_model15 2.296e+00 7.383e-01 3.110 0.001868 **
## car_model16 1.901e+00 9.152e-01 2.077 0.037785 *
## car_model17 -2.073e+00 1.055e+00 -1.965 0.049383 *
## car_model18 -2.563e+00 6.030e-01 -4.251 2.12e-05 ***
## car_model19 7.878e-01 2.099e+04 0.000 0.999970
## car_segmentLCV NA NA NA NA
## car_segmentOther -1.551e+01 3.961e+03 -0.004 0.996876
## car_segmentSmall/Medium NA NA NA NA
## age_of_vehicle_years2 2.373e-01 1.033e-01 2.298 0.021579 *
## age_of_vehicle_years3 7.539e-01 1.031e-01 7.316 2.56e-13 ***
## age_of_vehicle_years4 2.468e+00 1.181e-01 20.899 < 2e-16 ***
## age_of_vehicle_years5 2.436e+00 1.286e-01 18.937 < 2e-16 ***
## age_of_vehicle_years6 2.341e+00 1.491e-01 15.700 < 2e-16 ***
## age_of_vehicle_years7 1.023e+00 2.049e-01 4.992 5.97e-07 ***
## age_of_vehicle_years8 3.870e-01 2.603e-01 1.486 0.137160
## age_of_vehicle_years9 -5.936e-01 4.444e-01 -1.336 0.181625
## age_of_vehicle_years10 -2.418e+00 1.021e+00 -2.369 0.017840 *
## sched_serv_warr2 3.344e-01 8.221e-02 4.068 4.75e-05 ***
## sched_serv_warr3 -2.500e-01 1.002e-01 -2.496 0.012562 *
## sched_serv_warr4 -1.075e+00 1.278e-01 -8.413 < 2e-16 ***
## sched_serv_warr5 -2.144e+00 1.784e-01 -12.016 < 2e-16 ***
## sched_serv_warr6 -3.538e+00 2.900e-01 -12.203 < 2e-16 ***
## sched_serv_warr7 -1.823e+01 2.208e+02 -0.083 0.934189
## sched_serv_warr8 -5.519e+00 7.634e-01 -7.229 4.85e-13 ***
## sched_serv_warr9 -1.783e+01 2.150e+02 -0.083 0.933890
## sched_serv_warr10 -5.585e+00 1.190e+00 -4.691 2.71e-06 ***
## non_sched_serv_warr2 7.045e-01 1.137e-01 6.198 5.70e-10 ***
## non_sched_serv_warr3 7.658e-01 1.219e-01 6.280 3.38e-10 ***
## non_sched_serv_warr4 9.805e-01 1.674e-01 5.858 4.69e-09 ***
## non_sched_serv_warr5 6.784e-01 1.694e-01 4.005 6.20e-05 ***
## non_sched_serv_warr6 -3.149e-01 2.013e-01 -1.564 0.117705
## non_sched_serv_warr7 -6.150e-01 2.160e-01 -2.846 0.004422 **
## non_sched_serv_warr8 -1.201e+00 2.606e-01 -4.611 4.01e-06 ***
## non_sched_serv_warr9 -1.570e+00 3.552e-01 -4.420 9.88e-06 ***
## non_sched_serv_warr10 -1.554e+00 6.206e-01 -2.504 0.012278 *
## sched_serv_paid2 -1.105e-01 7.484e-02 -1.476 0.139893
## sched_serv_paid3 -3.313e-01 9.009e-02 -3.678 0.000235 ***
## sched_serv_paid4 -9.795e-01 1.251e-01 -7.827 4.98e-15 ***
## sched_serv_paid5 -1.074e+00 1.534e-01 -7.000 2.56e-12 ***
## sched_serv_paid6 -2.717e+00 3.324e-01 -8.174 2.99e-16 ***
## sched_serv_paid7 -2.685e+00 4.652e-01 -5.771 7.90e-09 ***
## sched_serv_paid8 -2.376e+00 6.219e-01 -3.821 0.000133 ***
## sched_serv_paid9 -1.730e+00 7.741e-01 -2.235 0.025442 *
## sched_serv_paid10 -4.973e-01 5.939e-01 -0.837 0.402416
## non_sched_serv_paid2 -1.464e-01 9.793e-02 -1.495 0.134792
## non_sched_serv_paid3 2.397e-02 1.024e-01 0.234 0.814982
## non_sched_serv_paid4 -2.391e-01 1.812e-01 -1.320 0.186986
## non_sched_serv_paid5 1.664e-01 2.061e-01 0.807 0.419437
## non_sched_serv_paid6 4.730e-01 2.836e-01 1.668 0.095374 .
## non_sched_serv_paid7 1.250e+00 3.867e-01 3.232 0.001228 **
## non_sched_serv_paid8 1.413e+00 4.510e-01 3.132 0.001734 **
## non_sched_serv_paid9 2.005e+00 5.121e-01 3.915 9.05e-05 ***
## non_sched_serv_paid10 2.101e+00 6.387e-01 3.290 0.001004 **
## total_paid_services2 2.495e-01 8.698e-02 2.869 0.004117 **
## total_paid_services3 7.622e-01 1.588e-01 4.799 1.59e-06 ***
## total_paid_services4 3.823e-01 2.547e-01 1.501 0.133344
## total_paid_services5 -3.233e-01 3.705e-01 -0.873 0.382858
## total_paid_services6 -3.612e-01 4.336e-01 -0.833 0.404884
## total_paid_services7 -3.211e-01 4.785e-01 -0.671 0.502212
## total_paid_services8 -7.466e-01 5.371e-01 -1.390 0.164526
## total_paid_services9 -3.464e-01 6.256e-01 -0.554 0.579796
## total_paid_services10 -2.421e-01 6.654e-01 -0.364 0.716009
## total_services2 6.133e-01 9.544e-02 6.427 1.31e-10 ***
## total_services3 8.386e-02 1.405e-01 0.597 0.550654
## total_services4 -9.269e-01 1.789e-01 -5.182 2.19e-07 ***
## total_services5 -1.700e+00 2.386e-01 -7.125 1.04e-12 ***
## total_services6 -3.822e+00 3.544e-01 -10.782 < 2e-16 ***
## total_services7 -5.729e+00 4.929e-01 -11.621 < 2e-16 ***
## total_services8 -6.202e+00 6.548e-01 -9.471 < 2e-16 ***
## total_services9 -8.457e+00 9.220e-01 -9.172 < 2e-16 ***
## total_services10 -1.021e+01 1.092e+00 -9.352 < 2e-16 ***
## mth_since_last_serv2 5.520e-01 8.773e-02 6.292 3.13e-10 ***
## mth_since_last_serv3 7.032e-01 1.146e-01 6.138 8.34e-10 ***
## mth_since_last_serv4 5.867e-01 1.178e-01 4.983 6.27e-07 ***
## mth_since_last_serv5 -2.559e-01 1.300e-01 -1.969 0.048984 *
## mth_since_last_serv6 -1.459e+00 1.659e-01 -8.797 < 2e-16 ***
## mth_since_last_serv7 -3.258e+00 2.746e-01 -11.864 < 2e-16 ***
## mth_since_last_serv8 -6.670e+00 1.011e+00 -6.595 4.25e-11 ***
## mth_since_last_serv9 -5.224e+00 3.496e-01 -14.942 < 2e-16 ***
## mth_since_last_serv10 -1.826e+01 2.330e+02 -0.078 0.937532
## annualised_mileage2 3.657e-01 1.003e-01 3.644 0.000268 ***
## annualised_mileage3 3.243e+00 1.317e-01 24.632 < 2e-16 ***
## annualised_mileage4 2.944e+00 1.332e-01 22.109 < 2e-16 ***
## annualised_mileage5 3.414e+00 1.360e-01 25.094 < 2e-16 ***
## annualised_mileage6 4.315e+00 1.456e-01 29.630 < 2e-16 ***
## annualised_mileage7 3.722e+00 1.605e-01 23.181 < 2e-16 ***
## annualised_mileage8 3.026e+00 1.931e-01 15.670 < 2e-16 ***
## annualised_mileage9 2.865e+00 2.191e-01 13.074 < 2e-16 ***
## annualised_mileage10 -2.042e-01 6.279e-01 -0.325 0.745053
## num_dealers_visited2 3.977e-01 9.413e-02 4.224 2.40e-05 ***
## num_dealers_visited3 -2.027e-01 1.316e-01 -1.540 0.123584
## num_dealers_visited4 -8.583e-01 1.757e-01 -4.885 1.04e-06 ***
## num_dealers_visited5 -6.213e-01 1.687e-01 -3.684 0.000230 ***
## num_dealers_visited6 -6.330e-01 1.641e-01 -3.857 0.000115 ***
## num_dealers_visited7 -5.319e-01 1.628e-01 -3.268 0.001082 **
## num_dealers_visited8 -6.442e-01 1.686e-01 -3.820 0.000133 ***
## num_dealers_visited9 9.391e-01 1.919e-01 4.893 9.94e-07 ***
## num_dealers_visited10 1.499e+00 1.997e-01 7.507 6.05e-14 ***
## num_serv_dealer_purchased2 5.410e-01 1.095e-01 4.940 7.82e-07 ***
## num_serv_dealer_purchased3 6.212e-01 1.094e-01 5.677 1.37e-08 ***
## num_serv_dealer_purchased4 -8.271e-01 1.606e-01 -5.149 2.62e-07 ***
## num_serv_dealer_purchased5 -7.672e-02 1.576e-01 -0.487 0.626457
## num_serv_dealer_purchased6 1.030e+00 1.663e-01 6.194 5.86e-10 ***
## num_serv_dealer_purchased7 2.610e+00 1.888e-01 13.828 < 2e-16 ***
## num_serv_dealer_purchased8 4.856e+00 2.633e-01 18.440 < 2e-16 ***
## num_serv_dealer_purchased9 5.707e+00 3.358e-01 16.994 < 2e-16 ***
## num_serv_dealer_purchased10 8.317e+00 5.458e-01 15.238 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24138.6 on 98501 degrees of freedom
## Residual deviance: 9729.1 on 98383 degrees of freedom
## AIC: 9967.1
##
## Number of Fisher Scoring iterations: 20
A logistic regression model is built with all variables except ID, age banded, gender. ID is not related to the prediction while age banded and gender are missing high proportion of values so these are taken out of the predicted variables.
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Target
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 98501 24138.6
## car_model 18 195.9 98483 23942.8 < 2.2e-16 ***
## car_segment 1 0.9 98482 23941.8 0.3383
## age_of_vehicle_years 9 2567.8 98473 21374.1 < 2.2e-16 ***
## sched_serv_warr 9 4018.9 98464 17355.2 < 2.2e-16 ***
## non_sched_serv_warr 9 712.5 98455 16642.7 < 2.2e-16 ***
## sched_serv_paid 9 182.0 98446 16460.7 < 2.2e-16 ***
## non_sched_serv_paid 9 170.0 98437 16290.8 < 2.2e-16 ***
## total_paid_services 9 37.0 98428 16253.7 2.612e-05 ***
## total_services 9 177.5 98419 16076.3 < 2.2e-16 ***
## mth_since_last_serv 9 2450.9 98410 13625.4 < 2.2e-16 ***
## annualised_mileage 9 2082.7 98401 11542.7 < 2.2e-16 ***
## num_dealers_visited 9 215.1 98392 11327.6 < 2.2e-16 ***
## num_serv_dealer_purchased 9 1598.5 98383 9729.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Performing ANOVA Chi-square test could help check the overall effect of variables on the dependent variable. From the summary, car_segment variable is not significant. We then build another model without car_segment to see if we can achieve lower AIC value.
##
## Call:
## glm(formula = Target ~ car_model + age_of_vehicle_years + sched_serv_warr +
## non_sched_serv_warr + sched_serv_paid + non_sched_serv_paid +
## total_paid_services + total_services + mth_since_last_serv +
## annualised_mileage + num_dealers_visited + num_serv_dealer_purchased,
## family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1411 -0.0943 -0.0032 0.0000 5.7973
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.357e+00 1.756e-01 -36.194 < 2e-16 ***
## car_model2 6.070e-01 1.024e-01 5.926 3.10e-09 ***
## car_model3 7.999e-01 1.069e-01 7.485 7.14e-14 ***
## car_model4 4.240e-01 1.273e-01 3.331 0.000866 ***
## car_model5 1.972e-01 1.075e-01 1.834 0.066676 .
## car_model6 2.633e-01 2.271e-01 1.159 0.246279
## car_model7 8.959e-01 1.296e-01 6.912 4.78e-12 ***
## car_model8 8.524e-01 1.483e-01 5.746 9.13e-09 ***
## car_model9 2.771e-01 3.067e-01 0.904 0.366151
## car_model10 -1.111e+00 2.371e-01 -4.685 2.80e-06 ***
## car_model11 -3.412e+00 4.334e-01 -7.873 3.45e-15 ***
## car_model12 -1.293e+00 5.775e-01 -2.239 0.025129 *
## car_model13 1.854e+00 2.799e-01 6.624 3.51e-11 ***
## car_model14 -1.994e+01 2.995e+03 -0.007 0.994687
## car_model15 2.296e+00 7.383e-01 3.110 0.001868 **
## car_model16 1.901e+00 9.152e-01 2.077 0.037821 *
## car_model17 -2.199e+00 1.046e+00 -2.102 0.035554 *
## car_model18 -2.564e+00 6.030e-01 -4.252 2.12e-05 ***
## car_model19 -1.472e+01 2.061e+04 -0.001 0.999430
## age_of_vehicle_years2 2.379e-01 1.033e-01 2.304 0.021239 *
## age_of_vehicle_years3 7.545e-01 1.031e-01 7.322 2.45e-13 ***
## age_of_vehicle_years4 2.468e+00 1.181e-01 20.905 < 2e-16 ***
## age_of_vehicle_years5 2.437e+00 1.286e-01 18.945 < 2e-16 ***
## age_of_vehicle_years6 2.342e+00 1.491e-01 15.706 < 2e-16 ***
## age_of_vehicle_years7 1.023e+00 2.049e-01 4.994 5.91e-07 ***
## age_of_vehicle_years8 3.874e-01 2.603e-01 1.488 0.136706
## age_of_vehicle_years9 -5.935e-01 4.444e-01 -1.336 0.181706
## age_of_vehicle_years10 -2.418e+00 1.021e+00 -2.369 0.017847 *
## sched_serv_warr2 3.344e-01 8.221e-02 4.067 4.76e-05 ***
## sched_serv_warr3 -2.500e-01 1.002e-01 -2.496 0.012576 *
## sched_serv_warr4 -1.075e+00 1.278e-01 -8.413 < 2e-16 ***
## sched_serv_warr5 -2.144e+00 1.784e-01 -12.016 < 2e-16 ***
## sched_serv_warr6 -3.539e+00 2.900e-01 -12.203 < 2e-16 ***
## sched_serv_warr7 -1.823e+01 2.208e+02 -0.083 0.934188
## sched_serv_warr8 -5.519e+00 7.634e-01 -7.229 4.85e-13 ***
## sched_serv_warr9 -1.783e+01 2.150e+02 -0.083 0.933890
## sched_serv_warr10 -5.585e+00 1.191e+00 -4.691 2.72e-06 ***
## non_sched_serv_warr2 7.047e-01 1.137e-01 6.200 5.64e-10 ***
## non_sched_serv_warr3 7.660e-01 1.219e-01 6.282 3.34e-10 ***
## non_sched_serv_warr4 9.806e-01 1.674e-01 5.859 4.67e-09 ***
## non_sched_serv_warr5 6.782e-01 1.694e-01 4.004 6.23e-05 ***
## non_sched_serv_warr6 -3.144e-01 2.013e-01 -1.562 0.118316
## non_sched_serv_warr7 -6.148e-01 2.161e-01 -2.845 0.004435 **
## non_sched_serv_warr8 -1.201e+00 2.606e-01 -4.610 4.03e-06 ***
## non_sched_serv_warr9 -1.569e+00 3.552e-01 -4.417 1.00e-05 ***
## non_sched_serv_warr10 -1.553e+00 6.206e-01 -2.503 0.012324 *
## sched_serv_paid2 -1.106e-01 7.485e-02 -1.477 0.139568
## sched_serv_paid3 -3.313e-01 9.009e-02 -3.677 0.000236 ***
## sched_serv_paid4 -9.797e-01 1.251e-01 -7.829 4.91e-15 ***
## sched_serv_paid5 -1.074e+00 1.534e-01 -7.001 2.53e-12 ***
## sched_serv_paid6 -2.717e+00 3.324e-01 -8.174 2.97e-16 ***
## sched_serv_paid7 -2.685e+00 4.653e-01 -5.771 7.89e-09 ***
## sched_serv_paid8 -2.376e+00 6.219e-01 -3.821 0.000133 ***
## sched_serv_paid9 -1.730e+00 7.741e-01 -2.235 0.025444 *
## sched_serv_paid10 -4.972e-01 5.939e-01 -0.837 0.402452
## non_sched_serv_paid2 -1.465e-01 9.793e-02 -1.496 0.134590
## non_sched_serv_paid3 2.389e-02 1.024e-01 0.233 0.815631
## non_sched_serv_paid4 -2.401e-01 1.812e-01 -1.325 0.185057
## non_sched_serv_paid5 1.658e-01 2.061e-01 0.804 0.421326
## non_sched_serv_paid6 4.726e-01 2.836e-01 1.666 0.095698 .
## non_sched_serv_paid7 1.250e+00 3.867e-01 3.232 0.001231 **
## non_sched_serv_paid8 1.412e+00 4.510e-01 3.132 0.001738 **
## non_sched_serv_paid9 2.004e+00 5.121e-01 3.914 9.06e-05 ***
## non_sched_serv_paid10 2.101e+00 6.387e-01 3.289 0.001005 **
## total_paid_services2 2.496e-01 8.698e-02 2.869 0.004115 **
## total_paid_services3 7.630e-01 1.588e-01 4.805 1.55e-06 ***
## total_paid_services4 3.832e-01 2.547e-01 1.505 0.132435
## total_paid_services5 -3.226e-01 3.705e-01 -0.871 0.383915
## total_paid_services6 -3.607e-01 4.336e-01 -0.832 0.405479
## total_paid_services7 -3.207e-01 4.785e-01 -0.670 0.502735
## total_paid_services8 -7.463e-01 5.371e-01 -1.389 0.164708
## total_paid_services9 -3.461e-01 6.256e-01 -0.553 0.580116
## total_paid_services10 -2.418e-01 6.654e-01 -0.363 0.716332
## total_services2 6.133e-01 9.544e-02 6.426 1.31e-10 ***
## total_services3 8.344e-02 1.405e-01 0.594 0.552611
## total_services4 -9.272e-01 1.789e-01 -5.183 2.18e-07 ***
## total_services5 -1.701e+00 2.386e-01 -7.127 1.03e-12 ***
## total_services6 -3.823e+00 3.545e-01 -10.784 < 2e-16 ***
## total_services7 -5.729e+00 4.929e-01 -11.623 < 2e-16 ***
## total_services8 -6.203e+00 6.549e-01 -9.473 < 2e-16 ***
## total_services9 -8.457e+00 9.220e-01 -9.173 < 2e-16 ***
## total_services10 -1.021e+01 1.092e+00 -9.353 < 2e-16 ***
## mth_since_last_serv2 5.519e-01 8.773e-02 6.291 3.16e-10 ***
## mth_since_last_serv3 7.035e-01 1.146e-01 6.141 8.22e-10 ***
## mth_since_last_serv4 5.868e-01 1.178e-01 4.984 6.24e-07 ***
## mth_since_last_serv5 -2.557e-01 1.300e-01 -1.967 0.049126 *
## mth_since_last_serv6 -1.459e+00 1.659e-01 -8.797 < 2e-16 ***
## mth_since_last_serv7 -3.258e+00 2.746e-01 -11.863 < 2e-16 ***
## mth_since_last_serv8 -6.670e+00 1.011e+00 -6.595 4.26e-11 ***
## mth_since_last_serv9 -5.224e+00 3.496e-01 -14.942 < 2e-16 ***
## mth_since_last_serv10 -1.826e+01 2.330e+02 -0.078 0.937537
## annualised_mileage2 3.657e-01 1.003e-01 3.645 0.000268 ***
## annualised_mileage3 3.243e+00 1.316e-01 24.631 < 2e-16 ***
## annualised_mileage4 2.944e+00 1.332e-01 22.109 < 2e-16 ***
## annualised_mileage5 3.414e+00 1.360e-01 25.095 < 2e-16 ***
## annualised_mileage6 4.315e+00 1.456e-01 29.630 < 2e-16 ***
## annualised_mileage7 3.721e+00 1.605e-01 23.180 < 2e-16 ***
## annualised_mileage8 3.026e+00 1.931e-01 15.669 < 2e-16 ***
## annualised_mileage9 2.865e+00 2.192e-01 13.073 < 2e-16 ***
## annualised_mileage10 -2.046e-01 6.279e-01 -0.326 0.744527
## num_dealers_visited2 3.975e-01 9.414e-02 4.222 2.42e-05 ***
## num_dealers_visited3 -2.024e-01 1.316e-01 -1.538 0.124151
## num_dealers_visited4 -8.580e-01 1.757e-01 -4.883 1.04e-06 ***
## num_dealers_visited5 -6.210e-01 1.687e-01 -3.682 0.000232 ***
## num_dealers_visited6 -6.324e-01 1.641e-01 -3.854 0.000116 ***
## num_dealers_visited7 -5.315e-01 1.628e-01 -3.266 0.001092 **
## num_dealers_visited8 -6.438e-01 1.686e-01 -3.818 0.000135 ***
## num_dealers_visited9 9.398e-01 1.920e-01 4.896 9.79e-07 ***
## num_dealers_visited10 1.500e+00 1.997e-01 7.509 5.94e-14 ***
## num_serv_dealer_purchased2 5.410e-01 1.095e-01 4.940 7.82e-07 ***
## num_serv_dealer_purchased3 6.211e-01 1.094e-01 5.676 1.38e-08 ***
## num_serv_dealer_purchased4 -8.275e-01 1.606e-01 -5.151 2.59e-07 ***
## num_serv_dealer_purchased5 -7.694e-02 1.576e-01 -0.488 0.625468
## num_serv_dealer_purchased6 1.030e+00 1.663e-01 6.191 5.98e-10 ***
## num_serv_dealer_purchased7 2.610e+00 1.888e-01 13.825 < 2e-16 ***
## num_serv_dealer_purchased8 4.855e+00 2.633e-01 18.437 < 2e-16 ***
## num_serv_dealer_purchased9 5.707e+00 3.358e-01 16.993 < 2e-16 ***
## num_serv_dealer_purchased10 8.317e+00 5.458e-01 15.237 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24138.6 on 98501 degrees of freedom
## Residual deviance: 9729.3 on 98384 degrees of freedom
## AIC: 9965.3
##
## Number of Fisher Scoring iterations: 20
The AIC is lower than the previous model. Again, we can use ANOVA test to compare both models.
## Analysis of Deviance Table
##
## Model 1: Target ~ car_model + age_of_vehicle_years + sched_serv_warr +
## non_sched_serv_warr + sched_serv_paid + non_sched_serv_paid +
## total_paid_services + total_services + mth_since_last_serv +
## annualised_mileage + num_dealers_visited + num_serv_dealer_purchased
## Model 2: Target ~ car_model + car_segment + age_of_vehicle_years + sched_serv_warr +
## non_sched_serv_warr + sched_serv_paid + non_sched_serv_paid +
## total_paid_services + total_services + mth_since_last_serv +
## annualised_mileage + num_dealers_visited + num_serv_dealer_purchased
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 98384 9729.3
## 2 98383 9729.1 1 0.23465 0.6281
The greater than 0.05 of p presents its evidence that the second model is better than the first model. Now it is time to evaluate the model’s performance running the model on the test data and consider few metrics as confusion matrix, accuracy.
## true
## predicted 0 1
## 0 31850 426
## 1 79 480
## [1] 0.9846201
The accuracy is greater than 98%, which is quite high. The wrong prediction of re-purchased customers while they actually do not buy new ones is lower than the reverse case. So this prediciton is good if commpany wants to manufacture less cars than the customers’ needs and maximise their profit.
Next, we want to see if we build the tree based classification model to compare with our above logistic regression model in terms of its accuracy of prediction.
##
## Call:
## randomForest(formula = Target ~ ., data = df_train_rf, xtest = df_test_rf[setdiff(names(df_test_rf), "Target")], ytest = df_test_rf$Target)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.81%
## Confusion matrix:
## 0 1 class.error
## 0 95778 109 0.001136755
## 1 692 1923 0.264627151
## Test set error rate: 0.83%
## Confusion matrix:
## 0 1 class.error
## 0 31900 29 0.0009082652
## 1 245 661 0.2704194260
## [1] 0.9916471
## [1] 0.9918655
Again, we build the random forest model which excluded variables age banded, gender and ID. The accuracy on is 0.99 which is higher than 0.98 of logistic regression.
Here, random forests provide a handful of tuning parameters to tune the performance of the model. The key parameter is to adjust the nummber of candidate variables at each split. However there are a few additional hyperparameters as below:
## [1] 80
## mtry node_size sampe_size Accuracy OOB_RMSE
## 1 6 3 0.800 0.9931677 0.006832349
## 2 4 3 0.800 0.9931575 0.006842501
## 3 4 5 0.632 0.9931473 0.006852653
## 4 6 3 0.700 0.9931372 0.006862805
## 5 6 5 0.700 0.9931169 0.006883109
## 6 6 5 0.800 0.9931169 0.006883109
## 7 6 7 0.800 0.9931169 0.006883109
## 8 6 9 0.700 0.9931067 0.006893261
## 9 4 7 0.800 0.9931067 0.006893261
## 10 4 3 0.632 0.9930966 0.006903413
We built a grid of hyperparameter combination and then run it through ranger, a C++ implementation of Brieman’s random forest algorith. Then we can compare the performance for each combination.
From above table, the combination (mtry, node_size, sample_size) as (6, 3, 0.8) give the highest accuracy and lowest prediction error.
With a hyperparameter combination of (mtry, node_size, sample_size) as (6, 3, 0.8), we run it 100 times and plot its changing values on histogram, we can see the high concentration of accuracy greater than 0.993. Its stability gives us confidence that this model is little bit better than previous random forest.
## [1] 0.9846201
## [1] 0.9930966
Evaluate the model on the test data, we found the random forest performs better than logistic regression model, which has accuracy of 0.993 than 0.984.
## car_model car_segment
## 108.13080 40.80992
## age_of_vehicle_years sched_serv_warr
## 375.29651 248.79636
## non_sched_serv_warr sched_serv_paid
## 251.74077 177.64189
## non_sched_serv_paid total_paid_services
## 145.70770 245.76113
## total_services mth_since_last_serv
## 456.63991 731.32864
## annualised_mileage num_dealers_visited
## 469.65603 256.87515
## num_serv_dealer_purchased
## 463.90761
| Df | Deviance | Resid. | Df Resid. | Dev Pr(>Chi) | |
|---|---|---|---|---|---|
| NULL | 98501 | 24138.6 | |||
| car_model | 18 | 195.9 | 98483 | 23942.8 | < 2.2e-16 *** |
| car_segment | 1 | 0.9 | 98482 | 23941.8 | 0.3383 |
| age_of_vehicle_years | 9 | 2567.8 | 98473 | 21374.1 | < 2.2e-16 *** |
| sched_serv_warr | 9 | 4018.9 | 98464 | 17355.2 | < 2.2e-16 *** |
| non_sched_serv_warr | 9 | 712.5 | 98455 | 16642.7 | < 2.2e-16 *** |
| sched_serv_paid | 9 | 182.0 | 98446 | 16460.7 | < 2.2e-16 *** |
| non_sched_serv_paid | 9 | 170.0 | 98437 | 16290.8 | < 2.2e-16 *** |
| total_paid_services | 9 | 37.0 | 98428 | 16253.7 | 2.612e-05 *** |
| total_services | 9 | 177.5 | 98419 | 16076.3 | < 2.2e-16 *** |
| mth_since_last_serv | 9 | 2450.9 | 98410 | 13625.4 | < 2.2e-16 *** |
| annualised_mileage | 9 | 2082.7 | 98401 | 11542.7 | < 2.2e-16 *** |
| num_dealers_visited | 9 | 215.1 | 98392 | 11327.6 | < 2.2e-16 *** |
| num_serv_dealer_purchased | 9 | 1598.5 | 98383 | 9729.1 | < 2.2e-16 *** |
With logistic regression model, the ANOVA table presents the deviance changes on the introduction of new variables. We found top variables generate high deviance reduction as sched_serv_warr, age_of_vehicle_years, mth_since_last_serv, annualised_mileage, num_serv_dealer_purchased. Meanwhile, random forest generates same set of top variables mth_since_last_serv, annualised_mileage, num_serv_dealer_purchased, age_of_vehicle_years. However, the priorities of significance between two models is different.
In business sense, high importance variables drive the significant impact on the outcome values. Given our business case, it is intuitive to understand that high importance fall into variables as age of their last vehicle or annualised vehicle mileage or number of services had at the same dealer where the vehicle was purchased. Customers are likely to buy new car when their existing one is aged or broken which needs high number of services.
In this report, we apply both logistic regression and random forest model to predict whether customers make a re-purchase a car. The data set to train into models are existing customers including information about customer demographics, previous car type bought, the age of the vehicle, and servicing details.
Before doing the model training, variables as ID, age banded, gender are excluded because its high proportion of missing or unrelated data. With logistic regression model, two versions of models were built to train the data, one with full variables and other with reduced variables. The reduced variable is slightly better than the other one, which gives lower AIC. Random forest model proves to be best when it gives higher accuracy of 0.993 higher than 0.984 of logistic regression model. With random forest, we made a few tuning by running model through a few hyper parameter combination to find out best performed one.
Both logistic regression and random forest model give same set of top high important variables but the priorities of these variables are not the same. It is intuitive to explain its importance as customers are likely to buy new car when their existing one is aged or broken.
https://stackoverflow.com/questions/25342646/merge-two-matrices-by-column-names https://stackoverflow.com/questions/60625837/heatmap-showing-two-variables-per-cell https://dplyr.tidyverse.org/articles/programming.html https://leaherb.com/pass-column-name-as-parameter-to-a-function-using-dplyr/ http://www.sthda.com/english/wiki/colors-in-r https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/logistic-regression-analysis-r/tutorial/ https://stats.stackexchange.com/questions/274151/anova-to-compare-models https://stats.stackexchange.com/questions/271853/null-hypothesis-of-an-anova-when-comparing-regression-models https://stats.stackexchange.com/questions/172782/how-to-use-r-anova-results-to-select-best-model https://cran.rstudio.com/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html https://www.r-bloggers.com/explaining-predictions-random-forest-post-hoc-analysis-randomforestexplainer-package/ https://stackoverflow.com/questions/14996619/random-forest-output-interpretation https://stats.stackexchange.com/questions/401437/the-importance-in-randomforest-returns-different-results-how-to-interpret-thi https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 https://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/ https://stats.stackexchange.com/questions/172842/best-practices-with-data-wrangling-before-running-random-forest-predictions https://www.edureka.co/blog/random-forest-classifier/#Practical%20Implementation%20of%20Random%20Forest%20In%20R https://datascienceplus.com/random-forests-in-r/ https://uc-r.github.io/random_forests https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6 https://www.displayr.com/how-is-variable-importance-calculated-for-a-random-forest/