After good experiences with RapidMiner and first entrance in data mining with R Blackwell Electronics has a deep analytics project to do. The sales team conducted a survey to find out more about brand preferences of existing customers. This again should contribute to strategic decision on which brand to deepen the relationship with. Due to survey related issues the dataset was not properly captured. As a consequence, there are about 1/3 of responds without brand preference filled in (incomplete). The investigation is considered to investigate which responds of survey questions could enable a prediction of these missing customer preferences. If there are any a subsequent prediction these missing preferences are asked for within the analysis. For prediction C5.0 and Random Forest are intended to train, compare, and use for prediction.
Procedure
The analysis followed the common data mining approach containing data exploration, pre-processing, modelling and optimization, prediction, and evaluation.
For the analysis the ‘caret’ package for statistical programming language R is used applying C5.0 and Random Forest.
By the time a model is trained and evaluated in performance in terms of accuracy, kappa, and confusion matrix it is used for prediction of missing customer brand preferences for the incomplete survey data.
Results
Both datasets of customer survey data - complete and incomplete - seems to be from the same survey or at least from the same sampling as both come with almost equal distributions within the data.
Brand preference in complete survey data comes with a partition of 1/3 to 2/3 indecently of education level, car, zipcode, or credit of the survey customer. This is probably the result of a stratified sampling approach which could lead to under-/overrepresentation as the population of each cluster is not taken into account.
Only salary and age comes with a pattern that indicates ‘Sony’ to be most favoured in age range 20 - 40 with high income, between 40 - 60 and low income and from 60 - 80 with mid - high income. In contrast ‘Acer’ is rather preferred in mid or mid - high income range for the first two 20 - 40 and 40 - 60 respectively and in low income range within the age range of 60 - 80.
As consequence both algorithms prioritize salary and age and almost not taking the other features into account as they have no information supporting the classification problem.
Both algorithms applied achieve an Accuracy of over 90 % and a Kappa of around .83. Besides this the make the same portion of false predictions (~8 %) for both classes. Hence both can be used to predict the missing values of customer preference in the incomplete survey data.
The partition among the predicted customer preference obtains to be 1/3 ‘Acer’ and 2/3 ‘Sony’ what equals the partition for the complete survey data stated above.
A deep dive into the distribution of customer preferences for all features draws the same picture as for the complete customer survey data.
Overview of dataset:
Complete survey data:
## 'data.frame': 9898 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : int 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : int 0 1 0 3 3 3 4 3 4 1 ...
## $ car : int 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: int 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : Factor w/ 2 levels "Acer","Sony": 1 2 1 2 1 2 2 2 1 2 ...
## salary age elevel car
## Min. : 20000 Min. :20.00 Min. :0.000 Min. : 1.00
## 1st Qu.: 52082 1st Qu.:35.00 1st Qu.:1.000 1st Qu.: 6.00
## Median : 84950 Median :50.00 Median :2.000 Median :11.00
## Mean : 84871 Mean :49.78 Mean :1.983 Mean :10.52
## 3rd Qu.:117162 3rd Qu.:65.00 3rd Qu.:3.000 3rd Qu.:15.75
## Max. :150000 Max. :80.00 Max. :4.000 Max. :20.00
## zipcode credit brand
## Min. :0.000 Min. : 0 Acer:3744
## 1st Qu.:2.000 1st Qu.:120807 Sony:6154
## Median :4.000 Median :250607
## Mean :4.041 Mean :249176
## 3rd Qu.:6.000 3rd Qu.:374640
## Max. :8.000 Max. :500000
Incomplete survey data:
## 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 110500 140894 119160 20000 93956 ...
## $ age : int 54 44 49 56 59 71 32 33 32 58 ...
## $ elevel : int 3 4 2 0 1 2 1 4 1 2 ...
## $ car : int 15 20 1 9 15 7 17 17 19 8 ...
## $ zipcode: int 4 7 3 1 1 2 1 0 2 4 ...
## $ credit : num 354724 395015 122025 99630 458680 ...
## $ brand : int 0 0 0 0 0 0 0 0 0 0 ...
## salary age elevel car
## Min. : 20000 Min. :20.00 Min. :0.000 Min. : 1.00
## 1st Qu.: 52242 1st Qu.:35.00 1st Qu.:1.000 1st Qu.: 6.00
## Median : 85969 Median :50.00 Median :2.000 Median :11.00
## Mean : 85560 Mean :49.87 Mean :2.011 Mean :10.58
## 3rd Qu.:118380 3rd Qu.:65.00 3rd Qu.:3.000 3rd Qu.:16.00
## Max. :150000 Max. :80.00 Max. :4.000 Max. :20.00
## zipcode credit brand
## Min. :0.000 Min. : 0 Min. :0
## 1st Qu.:2.000 1st Qu.:121879 1st Qu.:0
## Median :4.000 Median :250871 Median :0
## Mean :4.043 Mean :249510 Mean :0
## 3rd Qu.:6.000 3rd Qu.:375425 3rd Qu.:0
## Max. :8.000 Max. :500000 Max. :0
The complete dataset contains 0 missing values and 0 rows / 0 colums duplicated.
The incomplete dataset contains 0 missing values and 0 rows / 0 colums duplicated.
Scatterplot of selected features.
Plotting salary and age against brand unveils some patterns within the data whereas all other features plotted with salary/age against brand does not show (almost) any pattern. This suggest that only salary and age are of meaningful value and comes with information on the classification problem of predicting customer brand preference within the incomplete survey data.
Histograms of all features:
This overview of histograms for all features confirms the thesis made before. Only salary and age has some kind of pattern in distribution of brand.
Boxplot of features:
Neither the complete survey date (Plot A - C) nor the incomplete (plots D - F) part contains any outliers as shown by the following boxplots.
Date type and normalization:
The structure of the data received after factorization and normalization is as follows (first = complete, second = incomplete):
## 'data.frame': 9898 obs. of 7 variables:
## $ salary : num 0.926 0.584 -0.182 -0.562 -0.901 ...
## $ age : num -0.2716 0.7514 -1.5218 0.0694 -1.6923 ...
## $ elevel : Factor w/ 5 levels "0","1","2","3",..: 1 2 1 4 4 4 5 4 5 2 ...
## $ car : Factor w/ 20 levels "1","2","3","4",..: 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 5 7 3 6 5 4 6 1 1 5 ...
## $ credit : num 1.328 -1.406 -1.38 -1.434 0.715 ...
## $ brand : Factor w/ 2 levels "Acer","Sony": 1 2 1 2 1 2 2 2 1 2 ...
## 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 0.659 1.462 0.888 -1.732 0.222 ...
## $ age : num 0.2339 -0.3323 -0.0492 0.3471 0.5169 ...
## $ elevel : Factor w/ 5 levels "0","1","2","3",..: 4 5 3 1 2 3 2 5 2 3 ...
## $ car : Factor w/ 20 levels "1","2","3","4",..: 15 20 1 9 15 7 17 17 19 8 ...
## $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 5 8 4 2 2 3 2 1 3 5 ...
## $ credit : num 0.721 0.997 -0.874 -1.027 1.434 ...
## $ brand : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
No feature transformations/engineering taken into account in the first iteration. Later on a clustering of certain features has been conducted but did not unveil any patterns within remaining data (‘elevel’, ‘car’, ‘zipcode’, ‘credit’) beneficial for the classifiaction. However, the procedure executed is included in the following.
Train and test set have been compiled as follows:
I. C5.0
The model has been build applying the decision tree C5.0 on the training set by using 10-fold cross validation and an Automatic Tuning Grid with tuneLength = 2.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2881 nan 0.1000 0.0178
## 2 1.2567 nan 0.1000 0.0151
## 3 1.2280 nan 0.1000 0.0139
## 4 1.2051 nan 0.1000 0.0110
## 5 1.1846 nan 0.1000 0.0100
## 6 1.1677 nan 0.1000 0.0085
## 7 1.1522 nan 0.1000 0.0074
## 8 1.1383 nan 0.1000 0.0069
## 9 1.1260 nan 0.1000 0.0053
## 10 1.1161 nan 0.1000 0.0048
## 20 1.0412 nan 0.1000 0.0022
## 40 0.8895 nan 0.1000 0.0040
## 60 0.7457 nan 0.1000 0.0001
## 80 0.6512 nan 0.1000 0.0029
## 100 0.6088 nan 0.1000 0.0000
II. RF
A random Forest model has been trained with 10-fold cross validation and five different values (2, 5, 7, 9, sqrt(features)) for mtry.
tg_mtry <- expand.grid(.mtry = c(2, 5, 7, 9, sqrt(ncol(trainS)), 15)) #define tuneGrid w/ c() values for mtry
ctrl_rf <- trainControl(method = "repeatedcv", #Type of resampling: repeated cross-validation
repeats = 3,
number = 10, #number of folds K = 10 (by default)
#summaryFunction = twoClassSummary, #estimates perf. using observerd and predicted values
#twoClassSummary especially for 2-class problems
classProbs = FALSE #calculation of predicted class probabilities for AUC ROC
)
mRF <- train(brand ~ .,
data = trainS[ , c(1, 2, 7)],
method = "rf",
#preProc = c("center", "scale"), #Center and scale the predictors for the training
#set and all future samples.
metric = "Accuracy",
#tuneLength = 2, #Automatic Tuning Grid with a tuneLength of 2
tuneGrid = tg_mtry,
trControl = ctrl_rf
)
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid
## mtry: reset to within valid range
To evaluate model performace Accuracy and Kappa scores are recorded for each parameter value the model used during training. Besides this accuarcy is plotted against parameter configuration and the function VarImp() is used to estimate how the model prioritized each feature in the training. One can draw that features salary and age are the most important features for the model whereas all others have no impact as the do not come with any information/differences in brand distribution (cf. data exploratation - histogram).
I. C5.0
## Stochastic Gradient Boosting
##
## 7424 samples
## 6 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6682, 6681, 6682, 6681, 6682, 6681, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7310528 0.4338832
## 1 100 0.7305142 0.4314887
## 2 50 0.8187469 0.6207274
## 2 100 0.8831279 0.7560998
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
II. RF
To evaluate model performace Accuracy and Kappa scores are recorded for each parameter value the model used during training. Besides this the function VarImp() is used to estimate how the model prioritized each feature in the training.
## Random Forest
##
## 7424 samples
## 2 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6681, 6681, 6683, 6682, 6683, 6681, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2.000000 0.9105190 0.8100066
## 2.645751 0.9105641 0.8101578
## 5.000000 0.9109222 0.8108954
## 7.000000 0.9106081 0.8101809
## 9.000000 0.9109681 0.8109326
## 15.000000 0.9108775 0.8107631
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 9.
## rf variable importance
##
## Overall
## salary 100
## age 0
For model evaluation the performance of best model iteration and the confusion matrix are calculated and included in the following for each model (C5.0, RF) each.
I. C5.0
confusionMatrix(p_mC5, testS$brand)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 840 183
## Sony 96 1355
##
## Accuracy : 0.8872
## 95% CI : (0.8741, 0.8994)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7645
##
## Mcnemar's Test P-Value : 2.623e-07
##
## Sensitivity : 0.8974
## Specificity : 0.8810
## Pos Pred Value : 0.8211
## Neg Pred Value : 0.9338
## Prevalence : 0.3783
## Detection Rate : 0.3395
## Detection Prevalence : 0.4135
## Balanced Accuracy : 0.8892
##
## 'Positive' Class : Acer
##
II. RF
confusionMatrix(p_mRF, testS$brand)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 829 123
## Sony 107 1415
##
## Accuracy : 0.907
## 95% CI : (0.8949, 0.9182)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.803
##
## Mcnemar's Test P-Value : 0.3226
##
## Sensitivity : 0.8857
## Specificity : 0.9200
## Pos Pred Value : 0.8708
## Neg Pred Value : 0.9297
## Prevalence : 0.3783
## Detection Rate : 0.3351
## Detection Prevalence : 0.3848
## Balanced Accuracy : 0.9029
##
## 'Positive' Class : Acer
##
The measures calculated show that both models applied - C5.0 and Random Forest - performe similiar indicated by almost equal values for Accuracy and Kappa. The first model achieves an Accuracy of 0.887227162489895 and a Kappa of 0.764542505398242 whereas the second model comes with an Accuracy of 0.907033144704931 and a Kappa of 0.803023295744068. Thus, both models can be used for prediction of customer brand preference for the incomplete survey data.
As both complete and incomplete survey datasets have equal distributions the predicted brand preferences are expected to be of same percentage (about 1/3 Acer and 2/3 Sony) than within the complete survey data. This holds true as indicated by the following pie charts comparing actual values of complete survey data with predicted values of brand preference for incomplete survey data.
Looking into more details and comparing distribution of ‘brand’ for each feature one can see that there are no differences between complete and incomplete predicted distribution of ‘brand’. This is gathered within the following plots A. and B.