1 Introduction

In this work we have been asked to work with data mining tools to resolve some problems of the electronic devices company Blackwell Electronics. This company has made some surveys in order to know better customers, but some surveys have unresponsed questions, more specifically which computer brand they prefer. We can use customer responses to some survey questions (e.g.Ā income, age, etc.) to predict the answer to the brand preference question.

What are we trying to predict?

We need to predict the brand preferences for the incomplete survey responses.

What type of problem is it? Classification or Regression? Binary or Multi-class? Uni-variate or Multi-variate?

It is a binary classification problem with multiple features.

What type of data do we have?

We have three files, one to build trained and predictive models (CompleteResponses), another to explain the survey questions (SurveyKey) and the last one, is the data set from we have to predict brand preferences (Survey_incomplete).

2 Load libraries

First, we need to activate packages to be used during the project.

3 Import Data

The next step is to upload the data we are going to be work with.

4 Initial exploration of data

With the function glimpse() we can check the dimensions and the structure of our data set.

## Observations: 9,898
## Variables: 7
## $ salary  <dbl> 119806.54, 106880.48, 78020.75, 63689.94, 50873.62, 130812....
## $ age     <dbl> 45, 63, 23, 51, 20, 56, 24, 62, 29, 41, 48, 52, 52, 33, 62,...
## $ elevel  <dbl> 0, 1, 0, 3, 3, 3, 4, 3, 4, 1, 4, 1, 3, 4, 2, 1, 2, 1, 2, 0,...
## $ car     <dbl> 14, 11, 15, 6, 14, 14, 8, 3, 17, 5, 16, 6, 20, 13, 6, 11, 7...
## $ zipcode <dbl> 4, 6, 2, 5, 4, 3, 5, 0, 0, 4, 5, 0, 4, 3, 3, 4, 7, 2, 8, 2,...
## $ credit  <dbl> 442037.71, 45007.18, 48795.32, 40888.88, 352951.50, 135943....
## $ brand   <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1,...

To finish with the initial exploration we will check for missing values and in case of finding them we will transform them.

## [1] 0

There are not missing values so we can proceed with data exploration.

5 Pre-processing

In this part we are goint to adequate data to our analysis.

First, we are going to change attributes names.

## [1] "salary"  "age"     "elevel"  "car"     "zipcode" "credit"  "brand"

What happens with the data types. Does the type of data correspond to each value? In the previous glimpse() function, we verify that all data is double, but from the Survey Key data set we know that the educational level classifies people into 5 categories, the Car attribute corresponds to the main car brand and the ZIPcode says which region each family lives in. Finally, our dependent variable corresponds to which brand is preferred. Then, these four variables must be changed to categorical or factor.

At the same time, the labels of these variables will be changed according to that seen in Survey Key data set.

6 Exploratory Data Analysis (EDA)

In this part we will check how data is distributed. We will perform some plots using ā€œggplot2ā€ package.

6.1 Histogram; numeric variables.

Modifying the aesthetics of geom_histogram() we can observe the relation of each numerical data with dependant variable.

There is a relationship between Salary and Brand. In fact, lowest salaries ( ~ 10.000) and highest salaries ( > 130.000) prefer Sony in almost 100%. People with salary between 30.000 - 50.000 and 110.000 - 130.000 still prefer Sony but not as much as before and the rest people prefer Acer.

6.2 Scatter plot ; numerical data

Scatter plot can relate 2 variables with the dependant variable to check these correlations.

From this plot it can be observed the following:

  • Age < 40 and 50000 < Salary < 100000 –> Acer
  • 40 < Age < 60 and 80000 < Salary < 120000 –> Acer
  • 60 < Age and Salary < 70000 –> Acer

Now its time to check Salary and Credit relation scatter plot.

We can conclude the following:

  • Salary > 120000 and all credit range –> Sony

Finally we check the relation scatter plot of Age and Credit and nothing clear can be concluded.

6.3 Boxplot ; numerical data

Th last plot to check numerical data that is going to be used is the boxplot. In this type of plots outlier values of each variable can be determined and in this case there have not been found any outlier.

It can also be observed the relation of numerical data with dependant variable in boxplots.

It can be observed that people that prefer Sony has higher salary than people that prefer Acer.

6.4 Bar chart ; categorical data

In order to know how categorical data is ditributed the most common used plot is the bar chart.

In this plot it is observed the differences in brand preferences.

7 Training and testing sets

Once data is preprocessed, it“s time to create training and testing sets for the predictive model that must be performed.

First the seed number must be set. The seed number is a number that you choose for a starting point used to create a sequence of random numbers. It is also helpful for others who want to recreate your same results.

The next step is to split data into two sets, training and testing set. A common split is 75/25, which means that 75% of the data will be the training set and 25% of the data will be the test set. For that, createDataPartition() function is used, that does a stratified random split of the data.

Now is time to run different algorithms.

8 Modelling

We will run 3 different methods and then we will compare. The best is going to be used to predict our incomplete survey“s brand preferences.

This algorithms can be compared if all have the same resampling method and the same number of repetitions. To modify resampling method trainControl() function can be used.

8.1 CART decision tree

The first method that is going to be performed is Classification and Regression tree algorithm. To run any algorithm the train() function is used. In this case, the method that will be used is called ā€œrpartā€.

In the first iteration it is going to perform the feature engineering. With the function varImp() we can know which variable is important and which not.

## CART 
## 
## 7424 samples
##    6 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6682, 6682, 6682, 6681, 6682, 6682, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6858819  0.2594072
## 
## Tuning parameter 'cp' was held constant at a value of 0.09425451

It can be observed that only Age and Salary variables have the enough significance to be taken into account. So we have to modified the training and testing sets in order to reduce noise and improve accuracy and re-run the algorithm.

It is also possible to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.

Once the training algorithm is performed it has to be tried to predict in testing set.

Finally it will be generated the confusion matrix and statistics will be checked.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  899  435
##       Sony   37 1103
##                                           
##                Accuracy : 0.8092          
##                  95% CI : (0.7932, 0.8245)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6256          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9605          
##             Specificity : 0.7172          
##          Pos Pred Value : 0.6739          
##          Neg Pred Value : 0.9675          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3634          
##    Detection Prevalence : 0.5392          
##       Balanced Accuracy : 0.8388          
##                                           
##        'Positive' Class : Acer            
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Sony Acer
##       Sony 1103   37
##       Acer  435  899
##                                           
##                Accuracy : 0.8092          
##                  95% CI : (0.7932, 0.8245)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6256          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7172          
##             Specificity : 0.9605          
##          Pos Pred Value : 0.9675          
##          Neg Pred Value : 0.6739          
##              Prevalence : 0.6217          
##          Detection Rate : 0.4458          
##    Detection Prevalence : 0.4608          
##       Balanced Accuracy : 0.8388          
##                                           
##        'Positive' Class : Sony            
## 
##   Prediction Reference Freq
## 1       Sony      Sony 1103
## 2       Acer      Sony  435
## 3       Sony      Acer   37
## 4       Acer      Acer  899

8.2 C50

The next method that is going to be performed is C5.0 decision tree algorithm. In this case, the method that will be used in the train() function is called ā€œC5.0ā€.

Like with CART decision tree, in the first iteration variable importance has to be checked with varImp() function.

It can be observed that only Age and Salary variables have the enough significance to be taken into account, the same that happened with CART decision tree. So we have to used previously modified training and testing sets.

It is also possible, as before, to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.

This decision tree C5.0 with an Automatic Tuning Grid with an optimized tuneLength of 2.

Once the training algorithm is performed it has to be tried to predict in testing set.

Finally it will be generated the confusion matrix and statistics will be checked.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  830   83
##       Sony  106 1455
##                                           
##                Accuracy : 0.9236          
##                  95% CI : (0.9124, 0.9338)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8368          
##                                           
##  Mcnemar's Test P-Value : 0.1095          
##                                           
##             Sensitivity : 0.8868          
##             Specificity : 0.9460          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.9321          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3355          
##    Detection Prevalence : 0.3690          
##       Balanced Accuracy : 0.9164          
##                                           
##        'Positive' Class : Acer            
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Sony Acer
##       Sony 1455  106
##       Acer   83  830
##                                           
##                Accuracy : 0.9236          
##                  95% CI : (0.9124, 0.9338)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8368          
##                                           
##  Mcnemar's Test P-Value : 0.1095          
##                                           
##             Sensitivity : 0.9460          
##             Specificity : 0.8868          
##          Pos Pred Value : 0.9321          
##          Neg Pred Value : 0.9091          
##              Prevalence : 0.6217          
##          Detection Rate : 0.5881          
##    Detection Prevalence : 0.6310          
##       Balanced Accuracy : 0.9164          
##                                           
##        'Positive' Class : Sony            
## 

8.3 Random forest

The last method that is going to be performed is Random Forest algorithm. In this case, the method that will be used in the train() function is called ā€œrfā€.

Like with the other models, first iteration is to check the importance of variables.

It can be observed that only Age and Salary variables have the enough significance to be taken into account, the same that happened with CART and c5.0 decision trees. So we have to used previously modified training and testing sets.

It is also possible, as before, to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.

Once the training algorithm is performed it has to be tried to predict in testing set.

Finally it will be generated the confusion matrix and statistics will be checked.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  823  124
##       Sony  113 1414
##                                           
##                Accuracy : 0.9042          
##                  95% CI : (0.8919, 0.9155)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7968          
##                                           
##  Mcnemar's Test P-Value : 0.516           
##                                           
##             Sensitivity : 0.8793          
##             Specificity : 0.9194          
##          Pos Pred Value : 0.8691          
##          Neg Pred Value : 0.9260          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3327          
##    Detection Prevalence : 0.3828          
##       Balanced Accuracy : 0.8993          
##                                           
##        'Positive' Class : Acer            
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Sony Acer
##       Sony 1414  113
##       Acer  124  823
##                                           
##                Accuracy : 0.9042          
##                  95% CI : (0.8919, 0.9155)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7968          
##                                           
##  Mcnemar's Test P-Value : 0.516           
##                                           
##             Sensitivity : 0.9194          
##             Specificity : 0.8793          
##          Pos Pred Value : 0.9260          
##          Neg Pred Value : 0.8691          
##              Prevalence : 0.6217          
##          Detection Rate : 0.5715          
##    Detection Prevalence : 0.6172          
##       Balanced Accuracy : 0.8993          
##                                           
##        'Positive' Class : Sony            
## 

9 Resamples

After making the predictions using the test set use postResample() function to assess the metrics of the new predictions compared to the Ground Truth.

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: rpart, c50, rf 
## Number of resamples: 10 
## 
## Accuracy 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart 0.7913863 0.8965633 0.9002675 0.8849689 0.9077137 0.9219381    0
## c50   0.9057873 0.9164420 0.9218855 0.9214719 0.9255645 0.9407008    0
## rf    0.8894879 0.9056604 0.9111102 0.9084044 0.9147864 0.9164420    0
## 
## Kappa 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart 0.5925252 0.7843747 0.7921431 0.7649501 0.8080546 0.8351756    0
## c50   0.7979975 0.8218746 0.8331427 0.8329157 0.8421463 0.8745157    0
## rf    0.7661303 0.7988226 0.8119569 0.8055248 0.8183492 0.8236597    0
## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##       rpart   c50      rf      
## rpart         -0.03650 -0.02344
## c50   0.11281           0.01307
## rf    0.44853 0.01685          
## 
## Kappa 
##       rpart   c50      rf      
## rpart         -0.06797 -0.04057
## c50   0.12067           0.02739
## rf    0.53639 0.02006
##                Model  Accuracy     Kappa
## 1 CART decision tree 0.8849689 0.7649501
## 2 c5.0 decision tree 0.9214719 0.8329157
## 3      Random Forest 0.9084044 0.8055248

##           dt_param c50_param  rf_param
## Accuracy 0.8092158 0.9236055 0.9042037
## Kappa    0.6255760 0.8368102 0.7968166

Among used different methods the c5.0 method gives the best results. So the performed predictive model that is going to apply to predict brand preferences in the incomplete survey is going to be c5.0.

Finally we will train our predictive model in the whole set and not just in our testing set.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer 3391  350
##       Sony  353 5804
##                                          
##                Accuracy : 0.929          
##                  95% CI : (0.9237, 0.934)
##     No Information Rate : 0.6217         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.849          
##                                          
##  Mcnemar's Test P-Value : 0.9399         
##                                          
##             Sensitivity : 0.9057         
##             Specificity : 0.9431         
##          Pos Pred Value : 0.9064         
##          Neg Pred Value : 0.9427         
##              Prevalence : 0.3783         
##          Detection Rate : 0.3426         
##    Detection Prevalence : 0.3780         
##       Balanced Accuracy : 0.9244         
##                                          
##        'Positive' Class : Acer           
## 
##                      Model Accuracy Kappa
## 1       c5.0 decision tree      NaN   NaN
## 2 Final c5.o decision tree      NaN   NaN

10 Survey incomplete

The final step is to apply the performed predictive model to the incomplete survey in order to predict missing brand preferences.

To apply the predictive model, the data set must suffer the same modifications.

Load data.

Change attributes name.

Change data types and rename atributte“s labels.

Now is time to apply performed predictive model.

## Acer Sony 
## 1885 3115
##       Acer       Sony
## 1 1.000000 0.00000000
## 2 0.000000 1.00000000
## 3 0.908265 0.09173495
## 4 0.000000 1.00000000
## 5 0.625029 0.37497097
## 6 1.000000 0.00000000