Assignment 2 for DAM

0.1 Introduction
0.2 Business understanding
0.3 Ẽxploration Data Analysis
0.4 Logistic regression model
0.5 Random forest model
0.6 Important variables
0.7 Conclusion
0.8 References

0.1 Introduction

An automotive manufacturer has a data set of customers who have already purchased at least one vehicle. This task is to help this company to target these existing customers for a re-purchase campaign. It is to send a communication to customers who are highly likely to purchase a new vehicle and make a prediction for their re-purchase.

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven methodology to provide a structured approach to plan a data mining project. As its powerful practicality, flexibility and its usefulness to solve business issues, we take this methodology into our report. 6 steps will be conducted:

Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

For modeling, we approach two types of modeling, linear classification and tree based classification model to predict which customers are most likely to repurchase. Then a best model is selected to output both probabilities and class predictions for a ‘repurchase_validation.csv’ data. Finally, it is validated by the real result to review its performance.

0.2 Business understanding

A dataset of customer customer demographics, previous car type bought, the age of the vehicle, and servicing details. All numeric variables were transformed into deciles (integers 1 to 10, each has a similar number of customers). The data dictionary could be briefed below:

ID: Unique ID of customer
Target: Model target. 1 if the customer has purchased more than 1 vehicle, 0 if they have only purchased 1.
Age: Age banded into categories
Gender: Male, Female or Missing
Car model: The model of vehicle, 18 models in total
Car segment: The type of vehicle
Age of vehicles
Number of scheduled services (e.g. regular check-ups) used under warranty, in deciles
Number of non-scheduled services (e.g. something broke out of the service cycle) used under warranty, in deciles
Amount paid for scheduled services, in deciles
Amount paid for non scheduled services, in deciles
Amount paid in total for services, in deciles
Total number of services, in deciles
The number of months since the last service, in deciles
Annualised vehicle mileage, in deciles
Number of different dealers visited for servicing, in deciles
Number of services had at the same dealer where the vehicle was purchased, in deciles

0.3 Ẽxploration Data Analysis

## [1] 131337     17

##        ID             Target          age_band            gender         
##  Min.   :     1   Min.   :0.00000   Length:131337      Length:131337     
##  1st Qu.: 38563   1st Qu.:0.00000   Class :character   Class :character  
##  Median : 77132   Median :0.00000   Mode  :character   Mode  :character  
##  Mean   : 77097   Mean   :0.02681                                        
##  3rd Qu.:115668   3rd Qu.:0.00000                                        
##  Max.   :154139   Max.   :1.00000                                        
##   car_model         car_segment        age_of_vehicle_years
##  Length:131337      Length:131337      Min.   : 1.000      
##  Class :character   Class :character   1st Qu.: 3.000      
##  Mode  :character   Mode  :character   Median : 5.000      
##                                        Mean   : 5.493      
##                                        3rd Qu.: 8.000      
##                                        Max.   :10.000      
##  sched_serv_warr  non_sched_serv_warr sched_serv_paid  non_sched_serv_paid
##  Min.   : 1.000   Min.   : 1.000      Min.   : 1.000   Min.   : 1.000     
##  1st Qu.: 3.000   1st Qu.: 3.000      1st Qu.: 3.000   1st Qu.: 3.000     
##  Median : 5.000   Median : 5.000      Median : 5.000   Median : 5.000     
##  Mean   : 5.452   Mean   : 5.473      Mean   : 5.452   Mean   : 5.497     
##  3rd Qu.: 8.000   3rd Qu.: 8.000      3rd Qu.: 8.000   3rd Qu.: 8.000     
##  Max.   :10.000   Max.   :10.000      Max.   :10.000   Max.   :10.000     
##  total_paid_services total_services   mth_since_last_serv
##  Min.   : 1.000      Min.   : 1.000   Min.   : 1.00      
##  1st Qu.: 3.000      1st Qu.: 3.000   1st Qu.: 3.00      
##  Median : 5.000      Median : 5.000   Median : 5.00      
##  Mean   : 5.482      Mean   : 5.455   Mean   : 5.47      
##  3rd Qu.: 8.000      3rd Qu.: 8.000   3rd Qu.: 8.00      
##  Max.   :10.000      Max.   :10.000   Max.   :10.00      
##  annualised_mileage num_dealers_visited num_serv_dealer_purchased
##  Min.   : 1.000     Min.   : 1.000      Min.   : 1.000           
##  1st Qu.: 3.000     1st Qu.: 3.000      1st Qu.: 3.000           
##  Median : 5.000     Median : 5.000      Median : 5.000           
##  Mean   : 5.503     Mean   : 5.485      Mean   : 5.481           
##  3rd Qu.: 8.000     3rd Qu.: 8.000      3rd Qu.: 8.000           
##  Max.   :10.000     Max.   :10.000      Max.   :10.000

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 131337 obs. of  17 variables:
##  $ ID                       : num  1 2 3 5 6 7 8 9 10 11 ...
##  $ Target                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ age_band                 : chr  "3. 35 to 44" "NULL" "NULL" "NULL" ...
##  $ gender                   : chr  "Male" "NULL" "Male" "NULL" ...
##  $ car_model                : chr  "model_1" "model_2" "model_3" "model_3" ...
##  $ car_segment              : chr  "LCV" "Small/Medium" "Large/SUV" "Large/SUV" ...
##  $ age_of_vehicle_years     : num  9 6 9 5 8 7 8 7 1 3 ...
##  $ sched_serv_warr          : num  2 10 10 8 9 4 2 4 2 1 ...
##  $ non_sched_serv_warr      : num  10 3 9 5 4 10 8 9 1 1 ...
##  $ sched_serv_paid          : num  3 10 10 8 10 5 2 6 1 2 ...
##  $ non_sched_serv_paid      : num  7 4 9 4 7 7 9 9 3 1 ...
##  $ total_paid_services      : num  5 9 10 5 9 6 9 8 1 2 ...
##  $ total_services           : num  6 10 10 6 8 8 4 6 2 1 ...
##  $ mth_since_last_serv      : num  9 6 7 4 5 8 7 9 1 1 ...
##  $ annualised_mileage       : num  8 10 10 10 4 5 6 5 1 1 ...
##  $ num_dealers_visited      : num  10 7 6 9 4 10 10 5 2 1 ...
##  $ num_serv_dealer_purchased: num  4 10 10 7 9 4 4 8 3 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ID = col_double(),
##   ..   Target = col_double(),
##   ..   age_band = col_character(),
##   ..   gender = col_character(),
##   ..   car_model = col_character(),
##   ..   car_segment = col_character(),
##   ..   age_of_vehicle_years = col_double(),
##   ..   sched_serv_warr = col_double(),
##   ..   non_sched_serv_warr = col_double(),
##   ..   sched_serv_paid = col_double(),
##   ..   non_sched_serv_paid = col_double(),
##   ..   total_paid_services = col_double(),
##   ..   total_services = col_double(),
##   ..   mth_since_last_serv = col_double(),
##   ..   annualised_mileage = col_double(),
##   ..   num_dealers_visited = col_double(),
##   ..   num_serv_dealer_purchased = col_double()
##   .. )

There are 131337 rows and 17 variables in the dataset. The summary for each variable can be seen above.To highlight, variables such as age_band, gender, car_model and car_segment are factor variables while most other variables are number but in deciles.

Plotting the missing variables, gender accounts fro more than 52 percent of missing variables while it is more than 85% for age_band variable.

The number of re-purchased customers makes only 2% of total customers.

The target percent on y axis shows the percent of re-purchased customers over total customers by each category. In most variables, the target percent distributed in all of categories. For example, target percent of car segment divided equally for each category Large/SUV, LCV, Other, Small/Medium. However, few variables do not cover all categories such as the of amount paid for scheduled services, number of scheduled services used under warranty, and total number of services. Five out of ten categories are re-purchased by customers.

Variables, which distributed on few categories, are put into the interaction between each other. The bright blue color presents the high target percent of the cell or category intersection of two variables. Given an example of car segment and total services, it can be seen that there is high percent of re-purchased customers of the combination of Other and category 3.

0.4 Logistic regression model

## [1] 98502

## [1] 32835

## [1] 131337

## 
## Call:
## glm(formula = Target ~ car_model + car_segment + age_of_vehicle_years + 
##     sched_serv_warr + non_sched_serv_warr + sched_serv_paid + 
##     non_sched_serv_paid + total_paid_services + total_services + 
##     mth_since_last_serv + annualised_mileage + num_dealers_visited + 
##     num_serv_dealer_purchased, family = "binomial", data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1410  -0.0943  -0.0032   0.0000   5.7972  
## 
## Coefficients: (2 not defined because of singularities)
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -6.356e+00  1.756e-01 -36.193  < 2e-16 ***
## car_model2                   6.070e-01  1.024e-01   5.927 3.09e-09 ***
## car_model3                   7.999e-01  1.069e-01   7.485 7.13e-14 ***
## car_model4                   4.240e-01  1.273e-01   3.331 0.000866 ***
## car_model5                   1.972e-01  1.075e-01   1.834 0.066664 .  
## car_model6                   2.633e-01  2.271e-01   1.160 0.246240    
## car_model7                   8.959e-01  1.296e-01   6.912 4.77e-12 ***
## car_model8                   8.525e-01  1.483e-01   5.746 9.11e-09 ***
## car_model9                   2.772e-01  3.067e-01   0.904 0.366116    
## car_model10                 -1.111e+00  2.371e-01  -4.685 2.80e-06 ***
## car_model11                 -3.411e+00  4.334e-01  -7.872 3.48e-15 ***
## car_model12                 -1.293e+00  5.775e-01  -2.239 0.025145 *  
## car_model13                  1.854e+00  2.799e-01   6.624 3.50e-11 ***
## car_model14                 -1.994e+01  2.995e+03  -0.007 0.994687    
## car_model15                  2.296e+00  7.383e-01   3.110 0.001868 ** 
## car_model16                  1.901e+00  9.152e-01   2.077 0.037785 *  
## car_model17                 -2.073e+00  1.055e+00  -1.965 0.049383 *  
## car_model18                 -2.563e+00  6.030e-01  -4.251 2.12e-05 ***
## car_model19                  7.878e-01  2.099e+04   0.000 0.999970    
## car_segmentLCV                      NA         NA      NA       NA    
## car_segmentOther            -1.551e+01  3.961e+03  -0.004 0.996876    
## car_segmentSmall/Medium             NA         NA      NA       NA    
## age_of_vehicle_years2        2.373e-01  1.033e-01   2.298 0.021579 *  
## age_of_vehicle_years3        7.539e-01  1.031e-01   7.316 2.56e-13 ***
## age_of_vehicle_years4        2.468e+00  1.181e-01  20.899  < 2e-16 ***
## age_of_vehicle_years5        2.436e+00  1.286e-01  18.937  < 2e-16 ***
## age_of_vehicle_years6        2.341e+00  1.491e-01  15.700  < 2e-16 ***
## age_of_vehicle_years7        1.023e+00  2.049e-01   4.992 5.97e-07 ***
## age_of_vehicle_years8        3.870e-01  2.603e-01   1.486 0.137160    
## age_of_vehicle_years9       -5.936e-01  4.444e-01  -1.336 0.181625    
## age_of_vehicle_years10      -2.418e+00  1.021e+00  -2.369 0.017840 *  
## sched_serv_warr2             3.344e-01  8.221e-02   4.068 4.75e-05 ***
## sched_serv_warr3            -2.500e-01  1.002e-01  -2.496 0.012562 *  
## sched_serv_warr4            -1.075e+00  1.278e-01  -8.413  < 2e-16 ***
## sched_serv_warr5            -2.144e+00  1.784e-01 -12.016  < 2e-16 ***
## sched_serv_warr6            -3.538e+00  2.900e-01 -12.203  < 2e-16 ***
## sched_serv_warr7            -1.823e+01  2.208e+02  -0.083 0.934189    
## sched_serv_warr8            -5.519e+00  7.634e-01  -7.229 4.85e-13 ***
## sched_serv_warr9            -1.783e+01  2.150e+02  -0.083 0.933890    
## sched_serv_warr10           -5.585e+00  1.190e+00  -4.691 2.71e-06 ***
## non_sched_serv_warr2         7.045e-01  1.137e-01   6.198 5.70e-10 ***
## non_sched_serv_warr3         7.658e-01  1.219e-01   6.280 3.38e-10 ***
## non_sched_serv_warr4         9.805e-01  1.674e-01   5.858 4.69e-09 ***
## non_sched_serv_warr5         6.784e-01  1.694e-01   4.005 6.20e-05 ***
## non_sched_serv_warr6        -3.149e-01  2.013e-01  -1.564 0.117705    
## non_sched_serv_warr7        -6.150e-01  2.160e-01  -2.846 0.004422 ** 
## non_sched_serv_warr8        -1.201e+00  2.606e-01  -4.611 4.01e-06 ***
## non_sched_serv_warr9        -1.570e+00  3.552e-01  -4.420 9.88e-06 ***
## non_sched_serv_warr10       -1.554e+00  6.206e-01  -2.504 0.012278 *  
## sched_serv_paid2            -1.105e-01  7.484e-02  -1.476 0.139893    
## sched_serv_paid3            -3.313e-01  9.009e-02  -3.678 0.000235 ***
## sched_serv_paid4            -9.795e-01  1.251e-01  -7.827 4.98e-15 ***
## sched_serv_paid5            -1.074e+00  1.534e-01  -7.000 2.56e-12 ***
## sched_serv_paid6            -2.717e+00  3.324e-01  -8.174 2.99e-16 ***
## sched_serv_paid7            -2.685e+00  4.652e-01  -5.771 7.90e-09 ***
## sched_serv_paid8            -2.376e+00  6.219e-01  -3.821 0.000133 ***
## sched_serv_paid9            -1.730e+00  7.741e-01  -2.235 0.025442 *  
## sched_serv_paid10           -4.973e-01  5.939e-01  -0.837 0.402416    
## non_sched_serv_paid2        -1.464e-01  9.793e-02  -1.495 0.134792    
## non_sched_serv_paid3         2.397e-02  1.024e-01   0.234 0.814982    
## non_sched_serv_paid4        -2.391e-01  1.812e-01  -1.320 0.186986    
## non_sched_serv_paid5         1.664e-01  2.061e-01   0.807 0.419437    
## non_sched_serv_paid6         4.730e-01  2.836e-01   1.668 0.095374 .  
## non_sched_serv_paid7         1.250e+00  3.867e-01   3.232 0.001228 ** 
## non_sched_serv_paid8         1.413e+00  4.510e-01   3.132 0.001734 ** 
## non_sched_serv_paid9         2.005e+00  5.121e-01   3.915 9.05e-05 ***
## non_sched_serv_paid10        2.101e+00  6.387e-01   3.290 0.001004 ** 
## total_paid_services2         2.495e-01  8.698e-02   2.869 0.004117 ** 
## total_paid_services3         7.622e-01  1.588e-01   4.799 1.59e-06 ***
## total_paid_services4         3.823e-01  2.547e-01   1.501 0.133344    
## total_paid_services5        -3.233e-01  3.705e-01  -0.873 0.382858    
## total_paid_services6        -3.612e-01  4.336e-01  -0.833 0.404884    
## total_paid_services7        -3.211e-01  4.785e-01  -0.671 0.502212    
## total_paid_services8        -7.466e-01  5.371e-01  -1.390 0.164526    
## total_paid_services9        -3.464e-01  6.256e-01  -0.554 0.579796    
## total_paid_services10       -2.421e-01  6.654e-01  -0.364 0.716009    
## total_services2              6.133e-01  9.544e-02   6.427 1.31e-10 ***
## total_services3              8.386e-02  1.405e-01   0.597 0.550654    
## total_services4             -9.269e-01  1.789e-01  -5.182 2.19e-07 ***
## total_services5             -1.700e+00  2.386e-01  -7.125 1.04e-12 ***
## total_services6             -3.822e+00  3.544e-01 -10.782  < 2e-16 ***
## total_services7             -5.729e+00  4.929e-01 -11.621  < 2e-16 ***
## total_services8             -6.202e+00  6.548e-01  -9.471  < 2e-16 ***
## total_services9             -8.457e+00  9.220e-01  -9.172  < 2e-16 ***
## total_services10            -1.021e+01  1.092e+00  -9.352  < 2e-16 ***
## mth_since_last_serv2         5.520e-01  8.773e-02   6.292 3.13e-10 ***
## mth_since_last_serv3         7.032e-01  1.146e-01   6.138 8.34e-10 ***
## mth_since_last_serv4         5.867e-01  1.178e-01   4.983 6.27e-07 ***
## mth_since_last_serv5        -2.559e-01  1.300e-01  -1.969 0.048984 *  
## mth_since_last_serv6        -1.459e+00  1.659e-01  -8.797  < 2e-16 ***
## mth_since_last_serv7        -3.258e+00  2.746e-01 -11.864  < 2e-16 ***
## mth_since_last_serv8        -6.670e+00  1.011e+00  -6.595 4.25e-11 ***
## mth_since_last_serv9        -5.224e+00  3.496e-01 -14.942  < 2e-16 ***
## mth_since_last_serv10       -1.826e+01  2.330e+02  -0.078 0.937532    
## annualised_mileage2          3.657e-01  1.003e-01   3.644 0.000268 ***
## annualised_mileage3          3.243e+00  1.317e-01  24.632  < 2e-16 ***
## annualised_mileage4          2.944e+00  1.332e-01  22.109  < 2e-16 ***
## annualised_mileage5          3.414e+00  1.360e-01  25.094  < 2e-16 ***
## annualised_mileage6          4.315e+00  1.456e-01  29.630  < 2e-16 ***
## annualised_mileage7          3.722e+00  1.605e-01  23.181  < 2e-16 ***
## annualised_mileage8          3.026e+00  1.931e-01  15.670  < 2e-16 ***
## annualised_mileage9          2.865e+00  2.191e-01  13.074  < 2e-16 ***
## annualised_mileage10        -2.042e-01  6.279e-01  -0.325 0.745053    
## num_dealers_visited2         3.977e-01  9.413e-02   4.224 2.40e-05 ***
## num_dealers_visited3        -2.027e-01  1.316e-01  -1.540 0.123584    
## num_dealers_visited4        -8.583e-01  1.757e-01  -4.885 1.04e-06 ***
## num_dealers_visited5        -6.213e-01  1.687e-01  -3.684 0.000230 ***
## num_dealers_visited6        -6.330e-01  1.641e-01  -3.857 0.000115 ***
## num_dealers_visited7        -5.319e-01  1.628e-01  -3.268 0.001082 ** 
## num_dealers_visited8        -6.442e-01  1.686e-01  -3.820 0.000133 ***
## num_dealers_visited9         9.391e-01  1.919e-01   4.893 9.94e-07 ***
## num_dealers_visited10        1.499e+00  1.997e-01   7.507 6.05e-14 ***
## num_serv_dealer_purchased2   5.410e-01  1.095e-01   4.940 7.82e-07 ***
## num_serv_dealer_purchased3   6.212e-01  1.094e-01   5.677 1.37e-08 ***
## num_serv_dealer_purchased4  -8.271e-01  1.606e-01  -5.149 2.62e-07 ***
## num_serv_dealer_purchased5  -7.672e-02  1.576e-01  -0.487 0.626457    
## num_serv_dealer_purchased6   1.030e+00  1.663e-01   6.194 5.86e-10 ***
## num_serv_dealer_purchased7   2.610e+00  1.888e-01  13.828  < 2e-16 ***
## num_serv_dealer_purchased8   4.856e+00  2.633e-01  18.440  < 2e-16 ***
## num_serv_dealer_purchased9   5.707e+00  3.358e-01  16.994  < 2e-16 ***
## num_serv_dealer_purchased10  8.317e+00  5.458e-01  15.238  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24138.6  on 98501  degrees of freedom
## Residual deviance:  9729.1  on 98383  degrees of freedom
## AIC: 9967.1
## 
## Number of Fisher Scoring iterations: 20

A logistic regression model is built with all variables except ID, age banded, gender. ID is not related to the prediction while age banded and gender are missing high proportion of values so these are taken out of the predicted variables.

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Target
## 
## Terms added sequentially (first to last)
## 
## 
##                           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                      98501    24138.6              
## car_model                 18    195.9     98483    23942.8 < 2.2e-16 ***
## car_segment                1      0.9     98482    23941.8    0.3383    
## age_of_vehicle_years       9   2567.8     98473    21374.1 < 2.2e-16 ***
## sched_serv_warr            9   4018.9     98464    17355.2 < 2.2e-16 ***
## non_sched_serv_warr        9    712.5     98455    16642.7 < 2.2e-16 ***
## sched_serv_paid            9    182.0     98446    16460.7 < 2.2e-16 ***
## non_sched_serv_paid        9    170.0     98437    16290.8 < 2.2e-16 ***
## total_paid_services        9     37.0     98428    16253.7 2.612e-05 ***
## total_services             9    177.5     98419    16076.3 < 2.2e-16 ***
## mth_since_last_serv        9   2450.9     98410    13625.4 < 2.2e-16 ***
## annualised_mileage         9   2082.7     98401    11542.7 < 2.2e-16 ***
## num_dealers_visited        9    215.1     98392    11327.6 < 2.2e-16 ***
## num_serv_dealer_purchased  9   1598.5     98383     9729.1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Performing ANOVA Chi-square test could help check the overall effect of variables on the dependent variable. From the summary, car_segment variable is not significant. We then build another model without car_segment to see if we can achieve lower AIC value.

## 
## Call:
## glm(formula = Target ~ car_model + age_of_vehicle_years + sched_serv_warr + 
##     non_sched_serv_warr + sched_serv_paid + non_sched_serv_paid + 
##     total_paid_services + total_services + mth_since_last_serv + 
##     annualised_mileage + num_dealers_visited + num_serv_dealer_purchased, 
##     family = "binomial", data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1411  -0.0943  -0.0032   0.0000   5.7973  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -6.357e+00  1.756e-01 -36.194  < 2e-16 ***
## car_model2                   6.070e-01  1.024e-01   5.926 3.10e-09 ***
## car_model3                   7.999e-01  1.069e-01   7.485 7.14e-14 ***
## car_model4                   4.240e-01  1.273e-01   3.331 0.000866 ***
## car_model5                   1.972e-01  1.075e-01   1.834 0.066676 .  
## car_model6                   2.633e-01  2.271e-01   1.159 0.246279    
## car_model7                   8.959e-01  1.296e-01   6.912 4.78e-12 ***
## car_model8                   8.524e-01  1.483e-01   5.746 9.13e-09 ***
## car_model9                   2.771e-01  3.067e-01   0.904 0.366151    
## car_model10                 -1.111e+00  2.371e-01  -4.685 2.80e-06 ***
## car_model11                 -3.412e+00  4.334e-01  -7.873 3.45e-15 ***
## car_model12                 -1.293e+00  5.775e-01  -2.239 0.025129 *  
## car_model13                  1.854e+00  2.799e-01   6.624 3.51e-11 ***
## car_model14                 -1.994e+01  2.995e+03  -0.007 0.994687    
## car_model15                  2.296e+00  7.383e-01   3.110 0.001868 ** 
## car_model16                  1.901e+00  9.152e-01   2.077 0.037821 *  
## car_model17                 -2.199e+00  1.046e+00  -2.102 0.035554 *  
## car_model18                 -2.564e+00  6.030e-01  -4.252 2.12e-05 ***
## car_model19                 -1.472e+01  2.061e+04  -0.001 0.999430    
## age_of_vehicle_years2        2.379e-01  1.033e-01   2.304 0.021239 *  
## age_of_vehicle_years3        7.545e-01  1.031e-01   7.322 2.45e-13 ***
## age_of_vehicle_years4        2.468e+00  1.181e-01  20.905  < 2e-16 ***
## age_of_vehicle_years5        2.437e+00  1.286e-01  18.945  < 2e-16 ***
## age_of_vehicle_years6        2.342e+00  1.491e-01  15.706  < 2e-16 ***
## age_of_vehicle_years7        1.023e+00  2.049e-01   4.994 5.91e-07 ***
## age_of_vehicle_years8        3.874e-01  2.603e-01   1.488 0.136706    
## age_of_vehicle_years9       -5.935e-01  4.444e-01  -1.336 0.181706    
## age_of_vehicle_years10      -2.418e+00  1.021e+00  -2.369 0.017847 *  
## sched_serv_warr2             3.344e-01  8.221e-02   4.067 4.76e-05 ***
## sched_serv_warr3            -2.500e-01  1.002e-01  -2.496 0.012576 *  
## sched_serv_warr4            -1.075e+00  1.278e-01  -8.413  < 2e-16 ***
## sched_serv_warr5            -2.144e+00  1.784e-01 -12.016  < 2e-16 ***
## sched_serv_warr6            -3.539e+00  2.900e-01 -12.203  < 2e-16 ***
## sched_serv_warr7            -1.823e+01  2.208e+02  -0.083 0.934188    
## sched_serv_warr8            -5.519e+00  7.634e-01  -7.229 4.85e-13 ***
## sched_serv_warr9            -1.783e+01  2.150e+02  -0.083 0.933890    
## sched_serv_warr10           -5.585e+00  1.191e+00  -4.691 2.72e-06 ***
## non_sched_serv_warr2         7.047e-01  1.137e-01   6.200 5.64e-10 ***
## non_sched_serv_warr3         7.660e-01  1.219e-01   6.282 3.34e-10 ***
## non_sched_serv_warr4         9.806e-01  1.674e-01   5.859 4.67e-09 ***
## non_sched_serv_warr5         6.782e-01  1.694e-01   4.004 6.23e-05 ***
## non_sched_serv_warr6        -3.144e-01  2.013e-01  -1.562 0.118316    
## non_sched_serv_warr7        -6.148e-01  2.161e-01  -2.845 0.004435 ** 
## non_sched_serv_warr8        -1.201e+00  2.606e-01  -4.610 4.03e-06 ***
## non_sched_serv_warr9        -1.569e+00  3.552e-01  -4.417 1.00e-05 ***
## non_sched_serv_warr10       -1.553e+00  6.206e-01  -2.503 0.012324 *  
## sched_serv_paid2            -1.106e-01  7.485e-02  -1.477 0.139568    
## sched_serv_paid3            -3.313e-01  9.009e-02  -3.677 0.000236 ***
## sched_serv_paid4            -9.797e-01  1.251e-01  -7.829 4.91e-15 ***
## sched_serv_paid5            -1.074e+00  1.534e-01  -7.001 2.53e-12 ***
## sched_serv_paid6            -2.717e+00  3.324e-01  -8.174 2.97e-16 ***
## sched_serv_paid7            -2.685e+00  4.653e-01  -5.771 7.89e-09 ***
## sched_serv_paid8            -2.376e+00  6.219e-01  -3.821 0.000133 ***
## sched_serv_paid9            -1.730e+00  7.741e-01  -2.235 0.025444 *  
## sched_serv_paid10           -4.972e-01  5.939e-01  -0.837 0.402452    
## non_sched_serv_paid2        -1.465e-01  9.793e-02  -1.496 0.134590    
## non_sched_serv_paid3         2.389e-02  1.024e-01   0.233 0.815631    
## non_sched_serv_paid4        -2.401e-01  1.812e-01  -1.325 0.185057    
## non_sched_serv_paid5         1.658e-01  2.061e-01   0.804 0.421326    
## non_sched_serv_paid6         4.726e-01  2.836e-01   1.666 0.095698 .  
## non_sched_serv_paid7         1.250e+00  3.867e-01   3.232 0.001231 ** 
## non_sched_serv_paid8         1.412e+00  4.510e-01   3.132 0.001738 ** 
## non_sched_serv_paid9         2.004e+00  5.121e-01   3.914 9.06e-05 ***
## non_sched_serv_paid10        2.101e+00  6.387e-01   3.289 0.001005 ** 
## total_paid_services2         2.496e-01  8.698e-02   2.869 0.004115 ** 
## total_paid_services3         7.630e-01  1.588e-01   4.805 1.55e-06 ***
## total_paid_services4         3.832e-01  2.547e-01   1.505 0.132435    
## total_paid_services5        -3.226e-01  3.705e-01  -0.871 0.383915    
## total_paid_services6        -3.607e-01  4.336e-01  -0.832 0.405479    
## total_paid_services7        -3.207e-01  4.785e-01  -0.670 0.502735    
## total_paid_services8        -7.463e-01  5.371e-01  -1.389 0.164708    
## total_paid_services9        -3.461e-01  6.256e-01  -0.553 0.580116    
## total_paid_services10       -2.418e-01  6.654e-01  -0.363 0.716332    
## total_services2              6.133e-01  9.544e-02   6.426 1.31e-10 ***
## total_services3              8.344e-02  1.405e-01   0.594 0.552611    
## total_services4             -9.272e-01  1.789e-01  -5.183 2.18e-07 ***
## total_services5             -1.701e+00  2.386e-01  -7.127 1.03e-12 ***
## total_services6             -3.823e+00  3.545e-01 -10.784  < 2e-16 ***
## total_services7             -5.729e+00  4.929e-01 -11.623  < 2e-16 ***
## total_services8             -6.203e+00  6.549e-01  -9.473  < 2e-16 ***
## total_services9             -8.457e+00  9.220e-01  -9.173  < 2e-16 ***
## total_services10            -1.021e+01  1.092e+00  -9.353  < 2e-16 ***
## mth_since_last_serv2         5.519e-01  8.773e-02   6.291 3.16e-10 ***
## mth_since_last_serv3         7.035e-01  1.146e-01   6.141 8.22e-10 ***
## mth_since_last_serv4         5.868e-01  1.178e-01   4.984 6.24e-07 ***
## mth_since_last_serv5        -2.557e-01  1.300e-01  -1.967 0.049126 *  
## mth_since_last_serv6        -1.459e+00  1.659e-01  -8.797  < 2e-16 ***
## mth_since_last_serv7        -3.258e+00  2.746e-01 -11.863  < 2e-16 ***
## mth_since_last_serv8        -6.670e+00  1.011e+00  -6.595 4.26e-11 ***
## mth_since_last_serv9        -5.224e+00  3.496e-01 -14.942  < 2e-16 ***
## mth_since_last_serv10       -1.826e+01  2.330e+02  -0.078 0.937537    
## annualised_mileage2          3.657e-01  1.003e-01   3.645 0.000268 ***
## annualised_mileage3          3.243e+00  1.316e-01  24.631  < 2e-16 ***
## annualised_mileage4          2.944e+00  1.332e-01  22.109  < 2e-16 ***
## annualised_mileage5          3.414e+00  1.360e-01  25.095  < 2e-16 ***
## annualised_mileage6          4.315e+00  1.456e-01  29.630  < 2e-16 ***
## annualised_mileage7          3.721e+00  1.605e-01  23.180  < 2e-16 ***
## annualised_mileage8          3.026e+00  1.931e-01  15.669  < 2e-16 ***
## annualised_mileage9          2.865e+00  2.192e-01  13.073  < 2e-16 ***
## annualised_mileage10        -2.046e-01  6.279e-01  -0.326 0.744527    
## num_dealers_visited2         3.975e-01  9.414e-02   4.222 2.42e-05 ***
## num_dealers_visited3        -2.024e-01  1.316e-01  -1.538 0.124151    
## num_dealers_visited4        -8.580e-01  1.757e-01  -4.883 1.04e-06 ***
## num_dealers_visited5        -6.210e-01  1.687e-01  -3.682 0.000232 ***
## num_dealers_visited6        -6.324e-01  1.641e-01  -3.854 0.000116 ***
## num_dealers_visited7        -5.315e-01  1.628e-01  -3.266 0.001092 ** 
## num_dealers_visited8        -6.438e-01  1.686e-01  -3.818 0.000135 ***
## num_dealers_visited9         9.398e-01  1.920e-01   4.896 9.79e-07 ***
## num_dealers_visited10        1.500e+00  1.997e-01   7.509 5.94e-14 ***
## num_serv_dealer_purchased2   5.410e-01  1.095e-01   4.940 7.82e-07 ***
## num_serv_dealer_purchased3   6.211e-01  1.094e-01   5.676 1.38e-08 ***
## num_serv_dealer_purchased4  -8.275e-01  1.606e-01  -5.151 2.59e-07 ***
## num_serv_dealer_purchased5  -7.694e-02  1.576e-01  -0.488 0.625468    
## num_serv_dealer_purchased6   1.030e+00  1.663e-01   6.191 5.98e-10 ***
## num_serv_dealer_purchased7   2.610e+00  1.888e-01  13.825  < 2e-16 ***
## num_serv_dealer_purchased8   4.855e+00  2.633e-01  18.437  < 2e-16 ***
## num_serv_dealer_purchased9   5.707e+00  3.358e-01  16.993  < 2e-16 ***
## num_serv_dealer_purchased10  8.317e+00  5.458e-01  15.237  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24138.6  on 98501  degrees of freedom
## Residual deviance:  9729.3  on 98384  degrees of freedom
## AIC: 9965.3
## 
## Number of Fisher Scoring iterations: 20

The AIC is lower than the previous model. Again, we can use ANOVA test to compare both models.

## Analysis of Deviance Table
## 
## Model 1: Target ~ car_model + age_of_vehicle_years + sched_serv_warr + 
##     non_sched_serv_warr + sched_serv_paid + non_sched_serv_paid + 
##     total_paid_services + total_services + mth_since_last_serv + 
##     annualised_mileage + num_dealers_visited + num_serv_dealer_purchased
## Model 2: Target ~ car_model + car_segment + age_of_vehicle_years + sched_serv_warr + 
##     non_sched_serv_warr + sched_serv_paid + non_sched_serv_paid + 
##     total_paid_services + total_services + mth_since_last_serv + 
##     annualised_mileage + num_dealers_visited + num_serv_dealer_purchased
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     98384     9729.3                     
## 2     98383     9729.1  1  0.23465   0.6281

The greater than 0.05 of p presents its evidence that the second model is better than the first model. Now it is time to evaluate the model’s performance running the model on the test data and consider few metrics as confusion matrix, accuracy.

##          true
## predicted     0     1
##         0 31850   426
##         1    79   480

## [1] 0.9846201

The accuracy is greater than 98%, which is quite high. The wrong prediction of re-purchased customers while they actually do not buy new ones is lower than the reverse case. So this prediciton is good if commpany wants to manufacture less cars than the customers’ needs and maximise their profit.

0.5 Random forest model

Next, we want to see if we build the tree based classification model to compare with our above logistic regression model in terms of its accuracy of prediction.

## 
## Call:
##  randomForest(formula = Target ~ ., data = df_train_rf, xtest = df_test_rf[setdiff(names(df_test_rf),      "Target")], ytest = df_test_rf$Target) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.81%
## Confusion matrix:
##       0    1 class.error
## 0 95778  109 0.001136755
## 1   692 1923 0.264627151
##                 Test set error rate: 0.83%
## Confusion matrix:
##       0   1  class.error
## 0 31900  29 0.0009082652
## 1   245 661 0.2704194260

## [1] 0.9916471

## [1] 0.9918655

Again, we build the random forest model which excluded variables age banded, gender and ID. The accuracy on is 0.99 which is higher than 0.98 of logistic regression.

0.5.1 Tuning

Here, random forests provide a handful of tuning parameters to tune the performance of the model. The key parameter is to adjust the nummber of candidate variables at each split. However there are a few additional hyperparameters as below:

Number of trees to generate
Number of variables to randomly sample as candidates at each split
Number of samples to train on
Minimum number of samples within the terminal nodes
Maximum number of terminal nodes

## [1] 80

##    mtry node_size sampe_size  Accuracy    OOB_RMSE
## 1     6         3      0.800 0.9931677 0.006832349
## 2     4         3      0.800 0.9931575 0.006842501
## 3     4         5      0.632 0.9931473 0.006852653
## 4     6         3      0.700 0.9931372 0.006862805
## 5     6         5      0.700 0.9931169 0.006883109
## 6     6         5      0.800 0.9931169 0.006883109
## 7     6         7      0.800 0.9931169 0.006883109
## 8     6         9      0.700 0.9931067 0.006893261
## 9     4         7      0.800 0.9931067 0.006893261
## 10    4         3      0.632 0.9930966 0.006903413

We built a grid of hyperparameter combination and then run it through ranger, a C++ implementation of Brieman’s random forest algorith. Then we can compare the performance for each combination.

From above table, the combination (mtry, node_size, sample_size) as (6, 3, 0.8) give the highest accuracy and lowest prediction error.

With a hyperparameter combination of (mtry, node_size, sample_size) as (6, 3, 0.8), we run it 100 times and plot its changing values on histogram, we can see the high concentration of accuracy greater than 0.993. Its stability gives us confidence that this model is little bit better than previous random forest.

## [1] 0.9846201

## [1] 0.9930966

Evaluate the model on the test data, we found the random forest performs better than logistic regression model, which has accuracy of 0.993 than 0.984.

0.6 Important variables

##                 car_model               car_segment 
##                 108.13080                  40.80992 
##      age_of_vehicle_years           sched_serv_warr 
##                 375.29651                 248.79636 
##       non_sched_serv_warr           sched_serv_paid 
##                 251.74077                 177.64189 
##       non_sched_serv_paid       total_paid_services 
##                 145.70770                 245.76113 
##            total_services       mth_since_last_serv 
##                 456.63991                 731.32864 
##        annualised_mileage       num_dealers_visited 
##                 469.65603                 256.87515 
## num_serv_dealer_purchased 
##                 463.90761

	Df	Deviance	Resid.	Df Resid.	Dev Pr(>Chi)
NULL			98501	24138.6
car_model	18	195.9	98483	23942.8	< 2.2e-16 ***
car_segment	1	0.9	98482	23941.8	0.3383
age_of_vehicle_years	9	2567.8	98473	21374.1	< 2.2e-16 ***
sched_serv_warr	9	4018.9	98464	17355.2	< 2.2e-16 ***
non_sched_serv_warr	9	712.5	98455	16642.7	< 2.2e-16 ***
sched_serv_paid	9	182.0	98446	16460.7	< 2.2e-16 ***
non_sched_serv_paid	9	170.0	98437	16290.8	< 2.2e-16 ***
total_paid_services	9	37.0	98428	16253.7	2.612e-05 ***
total_services	9	177.5	98419	16076.3	< 2.2e-16 ***
mth_since_last_serv	9	2450.9	98410	13625.4	< 2.2e-16 ***
annualised_mileage	9	2082.7	98401	11542.7	< 2.2e-16 ***
num_dealers_visited	9	215.1	98392	11327.6	< 2.2e-16 ***
num_serv_dealer_purchased	9	1598.5	98383	9729.1	< 2.2e-16 ***

With logistic regression model, the ANOVA table presents the deviance changes on the introduction of new variables. We found top variables generate high deviance reduction as sched_serv_warr, age_of_vehicle_years, mth_since_last_serv, annualised_mileage, num_serv_dealer_purchased. Meanwhile, random forest generates same set of top variables mth_since_last_serv, annualised_mileage, num_serv_dealer_purchased, age_of_vehicle_years. However, the priorities of significance between two models is different.

In business sense, high importance variables drive the significant impact on the outcome values. Given our business case, it is intuitive to understand that high importance fall into variables as age of their last vehicle or annualised vehicle mileage or number of services had at the same dealer where the vehicle was purchased. Customers are likely to buy new car when their existing one is aged or broken which needs high number of services.

0.7 Conclusion

In this report, we apply both logistic regression and random forest model to predict whether customers make a re-purchase a car. The data set to train into models are existing customers including information about customer demographics, previous car type bought, the age of the vehicle, and servicing details.

Before doing the model training, variables as ID, age banded, gender are excluded because its high proportion of missing or unrelated data. With logistic regression model, two versions of models were built to train the data, one with full variables and other with reduced variables. The reduced variable is slightly better than the other one, which gives lower AIC. Random forest model proves to be best when it gives higher accuracy of 0.993 higher than 0.984 of logistic regression model. With random forest, we made a few tuning by running model through a few hyper parameter combination to find out best performed one.

Both logistic regression and random forest model give same set of top high important variables but the priorities of these variables are not the same. It is intuitive to explain its importance as customers are likely to buy new car when their existing one is aged or broken.

0.8 References

https://stackoverflow.com/questions/25342646/merge-two-matrices-by-column-names https://stackoverflow.com/questions/60625837/heatmap-showing-two-variables-per-cell https://dplyr.tidyverse.org/articles/programming.html https://leaherb.com/pass-column-name-as-parameter-to-a-function-using-dplyr/ http://www.sthda.com/english/wiki/colors-in-r https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/logistic-regression-analysis-r/tutorial/ https://stats.stackexchange.com/questions/274151/anova-to-compare-models https://stats.stackexchange.com/questions/271853/null-hypothesis-of-an-anova-when-comparing-regression-models https://stats.stackexchange.com/questions/172782/how-to-use-r-anova-results-to-select-best-model https://cran.rstudio.com/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html https://www.r-bloggers.com/explaining-predictions-random-forest-post-hoc-analysis-randomforestexplainer-package/ https://stackoverflow.com/questions/14996619/random-forest-output-interpretation https://stats.stackexchange.com/questions/401437/the-importance-in-randomforest-returns-different-results-how-to-interpret-thi https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 https://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/ https://stats.stackexchange.com/questions/172842/best-practices-with-data-wrangling-before-running-random-forest-predictions https://www.edureka.co/blog/random-forest-classifier/#Practical%20Implementation%20of%20Random%20Forest%20In%20R https://datascienceplus.com/random-forests-in-r/ https://uc-r.github.io/random_forests https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6 https://www.displayr.com/how-is-variable-importance-calculated-for-a-random-forest/