Case

After good experiences with RapidMiner and first entrance in data mining with R Blackwell Electronics has a deep analytics project to do. The sales team conducted a survey to find out more about brand preferences of existing customers. This again should contribute to strategic decision on which brand to deepen the relationship with. Due to survey related issues the dataset was not properly captured. As a consequence, there are about 1/3 of responds without brand preference filled in (incomplete). The investigation is considered to investigate which responds of survey questions could enable a prediction of these missing customer preferences. If there are any a subsequent prediction these missing preferences are asked for within the analysis. For prediction C5.0 and Random Forest are intended to train, compare, and use for prediction.

Executive summary

Procedure

The analysis followed the common data mining approach containing data exploration, pre-processing, modelling and optimization, prediction, and evaluation.

For the analysis the ‘caret’ package for statistical programming language R is used applying C5.0 and Random Forest.

By the time a model is trained and evaluated in performance in terms of accuracy, kappa, and confusion matrix it is used for prediction of missing customer brand preferences for the incomplete survey data.

Results

Both datasets of customer survey data - complete and incomplete - seems to be from the same survey or at least from the same sampling as both come with almost equal distributions within the data.

Brand preference in complete survey data comes with a partition of 1/3 to 2/3 indecently of education level, car, zipcode, or credit of the survey customer. This is probably the result of a stratified sampling approach which could lead to under-/overrepresentation as the population of each cluster is not taken into account.

Only salary and age comes with a pattern that indicates ‘Sony’ to be most favoured in age range 20 - 40 with high income, between 40 - 60 and low income and from 60 - 80 with mid - high income. In contrast ‘Acer’ is rather preferred in mid or mid - high income range for the first two 20 - 40 and 40 - 60 respectively and in low income range within the age range of 60 - 80.

As consequence both algorithms prioritize salary and age and almost not taking the other features into account as they have no information supporting the classification problem.

Both algorithms applied achieve an Accuracy of over 90 % and a Kappa of around .83. Besides this the make the same portion of false predictions (~8 %) for both classes. Hence both can be used to predict the missing values of customer preference in the incomplete survey data.

The partition among the predicted customer preference obtains to be 1/3 ‘Acer’ and 2/3 ‘Sony’ what equals the partition for the complete survey data stated above.

A deep dive into the distribution of customer preferences for all features draws the same picture as for the complete customer survey data.

Technical documentation

1. Data exploration & pre-processing

Overview of dataset:

Complete survey data:

## 'data.frame':    9898 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : int  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : int  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : int  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: int  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : Factor w/ 2 levels "Acer","Sony": 1 2 1 2 1 2 2 2 1 2 ...
##      salary            age            elevel           car       
##  Min.   : 20000   Min.   :20.00   Min.   :0.000   Min.   : 1.00  
##  1st Qu.: 52082   1st Qu.:35.00   1st Qu.:1.000   1st Qu.: 6.00  
##  Median : 84950   Median :50.00   Median :2.000   Median :11.00  
##  Mean   : 84871   Mean   :49.78   Mean   :1.983   Mean   :10.52  
##  3rd Qu.:117162   3rd Qu.:65.00   3rd Qu.:3.000   3rd Qu.:15.75  
##  Max.   :150000   Max.   :80.00   Max.   :4.000   Max.   :20.00  
##     zipcode          credit        brand     
##  Min.   :0.000   Min.   :     0   Acer:3744  
##  1st Qu.:2.000   1st Qu.:120807   Sony:6154  
##  Median :4.000   Median :250607              
##  Mean   :4.041   Mean   :249176              
##  3rd Qu.:6.000   3rd Qu.:374640              
##  Max.   :8.000   Max.   :500000

Incomplete survey data:

## 'data.frame':    5000 obs. of  7 variables:
##  $ salary : num  110500 140894 119160 20000 93956 ...
##  $ age    : int  54 44 49 56 59 71 32 33 32 58 ...
##  $ elevel : int  3 4 2 0 1 2 1 4 1 2 ...
##  $ car    : int  15 20 1 9 15 7 17 17 19 8 ...
##  $ zipcode: int  4 7 3 1 1 2 1 0 2 4 ...
##  $ credit : num  354724 395015 122025 99630 458680 ...
##  $ brand  : int  0 0 0 0 0 0 0 0 0 0 ...
##      salary            age            elevel           car       
##  Min.   : 20000   Min.   :20.00   Min.   :0.000   Min.   : 1.00  
##  1st Qu.: 52242   1st Qu.:35.00   1st Qu.:1.000   1st Qu.: 6.00  
##  Median : 85969   Median :50.00   Median :2.000   Median :11.00  
##  Mean   : 85560   Mean   :49.87   Mean   :2.011   Mean   :10.58  
##  3rd Qu.:118380   3rd Qu.:65.00   3rd Qu.:3.000   3rd Qu.:16.00  
##  Max.   :150000   Max.   :80.00   Max.   :4.000   Max.   :20.00  
##     zipcode          credit           brand  
##  Min.   :0.000   Min.   :     0   Min.   :0  
##  1st Qu.:2.000   1st Qu.:121879   1st Qu.:0  
##  Median :4.000   Median :250871   Median :0  
##  Mean   :4.043   Mean   :249510   Mean   :0  
##  3rd Qu.:6.000   3rd Qu.:375425   3rd Qu.:0  
##  Max.   :8.000   Max.   :500000   Max.   :0

The complete dataset contains 0 missing values and 0 rows / 0 colums duplicated.
The incomplete dataset contains 0 missing values and 0 rows / 0 colums duplicated.

Scatterplot of selected features.

Plotting salary and age against brand unveils some patterns within the data whereas all other features plotted with salary/age against brand does not show (almost) any pattern. This suggest that only salary and age are of meaningful value and comes with information on the classification problem of predicting customer brand preference within the incomplete survey data.

Histograms of all features:

This overview of histograms for all features confirms the thesis made before. Only salary and age has some kind of pattern in distribution of brand.

Boxplot of features:

Neither the complete survey date (Plot A - C) nor the incomplete (plots D - F) part contains any outliers as shown by the following boxplots.

Date type and normalization:

  • Features ‘elevel’, ‘car’, ‘zipcode’, and ‘brand’ are switched from data type integer to factor.
  • Features ‘salary’, ‘age’, and ‘credit’ are normalized.

The structure of the data received after factorization and normalization is as follows (first = complete, second = incomplete):

## 'data.frame':    9898 obs. of  7 variables:
##  $ salary : num  0.926 0.584 -0.182 -0.562 -0.901 ...
##  $ age    : num  -0.2716 0.7514 -1.5218 0.0694 -1.6923 ...
##  $ elevel : Factor w/ 5 levels "0","1","2","3",..: 1 2 1 4 4 4 5 4 5 2 ...
##  $ car    : Factor w/ 20 levels "1","2","3","4",..: 14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 5 7 3 6 5 4 6 1 1 5 ...
##  $ credit : num  1.328 -1.406 -1.38 -1.434 0.715 ...
##  $ brand  : Factor w/ 2 levels "Acer","Sony": 1 2 1 2 1 2 2 2 1 2 ...
## 'data.frame':    5000 obs. of  7 variables:
##  $ salary : num  0.659 1.462 0.888 -1.732 0.222 ...
##  $ age    : num  0.2339 -0.3323 -0.0492 0.3471 0.5169 ...
##  $ elevel : Factor w/ 5 levels "0","1","2","3",..: 4 5 3 1 2 3 2 5 2 3 ...
##  $ car    : Factor w/ 20 levels "1","2","3","4",..: 15 20 1 9 15 7 17 17 19 8 ...
##  $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 5 8 4 2 2 3 2 1 3 5 ...
##  $ credit : num  0.721 0.997 -0.874 -1.027 1.434 ...
##  $ brand  : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...

2. Feature Engineering - Transformation

No feature transformations/engineering taken into account in the first iteration. Later on a clustering of certain features has been conducted but did not unveil any patterns within remaining data (‘elevel’, ‘car’, ‘zipcode’, ‘credit’) beneficial for the classifiaction. However, the procedure executed is included in the following.

3. Data split

Train and test set have been compiled as follows:

  • Set seed to 123
  • Applying split ratio of 75/25 results in a Training set of 7424 and a Test set of 2474 objects.

4 Modelling

I. C5.0

The model has been build applying the decision tree C5.0 on the training set by using 10-fold cross validation and an Automatic Tuning Grid with tuneLength = 2.

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2881             nan     0.1000    0.0178
##      2        1.2567             nan     0.1000    0.0151
##      3        1.2280             nan     0.1000    0.0139
##      4        1.2051             nan     0.1000    0.0110
##      5        1.1846             nan     0.1000    0.0100
##      6        1.1677             nan     0.1000    0.0085
##      7        1.1522             nan     0.1000    0.0074
##      8        1.1383             nan     0.1000    0.0069
##      9        1.1260             nan     0.1000    0.0053
##     10        1.1161             nan     0.1000    0.0048
##     20        1.0412             nan     0.1000    0.0022
##     40        0.8895             nan     0.1000    0.0040
##     60        0.7457             nan     0.1000    0.0001
##     80        0.6512             nan     0.1000    0.0029
##    100        0.6088             nan     0.1000    0.0000

II. RF

A random Forest model has been trained with 10-fold cross validation and five different values (2, 5, 7, 9, sqrt(features)) for mtry.

tg_mtry <- expand.grid(.mtry = c(2, 5, 7, 9, sqrt(ncol(trainS)), 15))  #define tuneGrid w/ c() values for mtry

ctrl_rf <- trainControl(method = "repeatedcv",              #Type of resampling: repeated cross-validation
                        repeats = 3,                         
                        number = 10,                        #number of folds K = 10 (by default)
                        #summaryFunction = twoClassSummary, #estimates perf. using observerd and predicted values
                                                            #twoClassSummary especially for 2-class problems
                        classProbs = FALSE                  #calculation of predicted class probabilities for AUC ROC
) 

mRF <- train(brand ~ .,
             data = trainS[ , c(1, 2, 7)],
             method = "rf",
            #preProc = c("center", "scale"),                #Center and scale the predictors for the training
                                                            #set and all future samples.
             metric = "Accuracy",
             #tuneLength = 2,                               #Automatic Tuning Grid with a tuneLength of 2
             tuneGrid = tg_mtry,
             trControl = ctrl_rf
)
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid
## mtry: reset to within valid range

5. Model performance

To evaluate model performace Accuracy and Kappa scores are recorded for each parameter value the model used during training. Besides this accuarcy is plotted against parameter configuration and the function VarImp() is used to estimate how the model prioritized each feature in the training. One can draw that features salary and age are the most important features for the model whereas all others have no impact as the do not come with any information/differences in brand distribution (cf. data exploratation - histogram).

I. C5.0

## Stochastic Gradient Boosting 
## 
## 7424 samples
##    6 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 6682, 6681, 6682, 6681, 6682, 6681, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7310528  0.4338832
##   1                  100      0.7305142  0.4314887
##   2                   50      0.8187469  0.6207274
##   2                  100      0.8831279  0.7560998
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

II. RF

To evaluate model performace Accuracy and Kappa scores are recorded for each parameter value the model used during training. Besides this the function VarImp() is used to estimate how the model prioritized each feature in the training.

## Random Forest 
## 
## 7424 samples
##    2 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 6681, 6681, 6683, 6682, 6683, 6681, ... 
## Resampling results across tuning parameters:
## 
##   mtry       Accuracy   Kappa    
##    2.000000  0.9105190  0.8100066
##    2.645751  0.9105641  0.8101578
##    5.000000  0.9109222  0.8108954
##    7.000000  0.9106081  0.8101809
##    9.000000  0.9109681  0.8109326
##   15.000000  0.9108775  0.8107631
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 9.

## rf variable importance
## 
##        Overall
## salary     100
## age          0

6. Prediction

For model evaluation the performance of best model iteration and the confusion matrix are calculated and included in the following for each model (C5.0, RF) each.

I. C5.0

confusionMatrix(p_mC5, testS$brand)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  840  183
##       Sony   96 1355
##                                           
##                Accuracy : 0.8872          
##                  95% CI : (0.8741, 0.8994)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7645          
##                                           
##  Mcnemar's Test P-Value : 2.623e-07       
##                                           
##             Sensitivity : 0.8974          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.8211          
##          Neg Pred Value : 0.9338          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3395          
##    Detection Prevalence : 0.4135          
##       Balanced Accuracy : 0.8892          
##                                           
##        'Positive' Class : Acer            
## 

II. RF

confusionMatrix(p_mRF, testS$brand)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  829  123
##       Sony  107 1415
##                                           
##                Accuracy : 0.907           
##                  95% CI : (0.8949, 0.9182)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.803           
##                                           
##  Mcnemar's Test P-Value : 0.3226          
##                                           
##             Sensitivity : 0.8857          
##             Specificity : 0.9200          
##          Pos Pred Value : 0.8708          
##          Neg Pred Value : 0.9297          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3351          
##    Detection Prevalence : 0.3848          
##       Balanced Accuracy : 0.9029          
##                                           
##        'Positive' Class : Acer            
## 

The measures calculated show that both models applied - C5.0 and Random Forest - performe similiar indicated by almost equal values for Accuracy and Kappa. The first model achieves an Accuracy of 0.887227162489895 and a Kappa of 0.764542505398242 whereas the second model comes with an Accuracy of 0.907033144704931 and a Kappa of 0.803023295744068. Thus, both models can be used for prediction of customer brand preference for the incomplete survey data.

7. Prediction of customer brand preference

As both complete and incomplete survey datasets have equal distributions the predicted brand preferences are expected to be of same percentage (about 1/3 Acer and 2/3 Sony) than within the complete survey data. This holds true as indicated by the following pie charts comparing actual values of complete survey data with predicted values of brand preference for incomplete survey data.

8. Comparison of complete and incomplete survey data

Looking into more details and comparing distribution of ‘brand’ for each feature one can see that there are no differences between complete and incomplete predicted distribution of ‘brand’. This is gathered within the following plots A. and B.