1 Introduction

The aim of this project is to know which is the preferred computer brand of our customers. With this purpose, Blackwell Electronics’ marketing team elaborated a survey where the the information about distinct characteristics of our customers were collected. This information will help Blackwell Electronics decide which manufacturer should pursue a deeper strategic relationship. However, due to an error of recording the data, the answer about the preferred brand of computers between two possibilities were lost on a part of the surveys. The data avialable is contained in two separated datasets. The first one contains the complete survey. The second dataset contains all the information except the preferred brand.

The information registered on both data sets is (see annex for further information about the codification of the variables):

The preferred brand of the customers is only recorded on the Complete Survey and its codification is:

The information of the complete surveys will be used to predict the missing information on the incomplete surveys through two different Machine Learning algorithms: k-NN and Random Forest.

In order to perform the analysis, the libraries required for this project are:

library(ggplot2)
library(party)
library(caret)

2 Preprocessing

After importing the datasets into “complete” and “incomplete” data frames, the first step is to check the structure of the variables on the complete data:

str(complete)
## 'data.frame':    10000 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : int  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : int  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : int  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: int  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : int  0 1 0 1 0 1 1 1 0 1 ...

The variables elevel,car, zipcodeand brand were imported as integer. They must be converted into factor to become operative:

complete[, 3:5] <- lapply(complete[, 3:5], factor)
complete$brand <- factor(complete$brand, levels = c(0,1) , labels = c("Acer", "Sony"))

3 Exploratory data analysis

The next step is to check the distribution of the data both with summary and through visualization. This will facilitate the approach of the mining in order to extract knowledge of the data:

summary(complete)
##      salary            age        elevel        car          zipcode    
##  Min.   : 20000   Min.   :20.00   0:2075   15     : 544   6      :1167  
##  1st Qu.: 52109   1st Qu.:35.00   1:1965   18     : 530   8      :1144  
##  Median : 84969   Median :50.00   2:2003   17     : 517   2      :1125  
##  Mean   : 84897   Mean   :49.81   3:1967   8      : 516   5      :1113  
##  3rd Qu.:117168   3rd Qu.:65.00   4:1990   2      : 513   4      :1103  
##  Max.   :150000   Max.   :80.00            5      : 511   0      :1097  
##                                            (Other):6869   (Other):3251  
##      credit        brand     
##  Min.   :     0   Acer:3783  
##  1st Qu.:121155   Sony:6217  
##  Median :250607              
##  Mean   :249245              
##  3rd Qu.:374872              
##  Max.   :500000              
## 
myplots <- list()
for (i in 1:7) local({
  i <- i
  if (class(complete[,i]) == "numeric" | class(complete[,i]) == "integer"){
  plots <- ggplot(complete, aes(x = complete[,i], fill = brand)) + 
    geom_histogram(binwidth = (max(complete[,i]) - min(complete[,i]))/10) +
    xlab(colnames(complete)[i])
  } else {
    plots <- ggplot(complete, aes(x = complete[,i], fill = brand)) + geom_bar() +
    xlab(colnames(complete)[i])
    }
  
  print(plots)
  myplots[[i]] <<- plots
})

The distribution of the variables shows that salary seems to be the only aspect that influences the brand preference. Nevertheless, a decision tree will be used to perform a deeper analysis (the depth of the tree is restricted to have only three levels in order to be readable):

tree <- ctree(brand ~ ., complete, controls = ctree_control(maxdepth = 3))
plot(tree)

It can be seen that not only salary influences the brand preference of Blackwell Electronics customers but also the age has an important role.

Moreover, if the variables salary and age are plotted together the following pattern can be seen:

ggplot(complete, aes(x = age, y = salary, color = brand)) + geom_point()

There is a clear definition of the pattern formed by the interaction of these two variables that is summed up on the next table:

Salary Age
20-40 40-60 60-80
20k-60k Sony Sony Acer
60k-100k Acer ~50% Sony/ ~50 %Acer ~50% Sony/ ~50% Acer
100k-150k Sony ~50%Sony/ ~50%Acer Sony

4 Modeling

It is time to build the models that allow the prediction of the customer brand preference. Before that, it is necessary to split the data into train and test sets in order to check and prevent overfitting:

set.seed(123)
indexes <- createDataPartition(complete$brand, p = 0.75, list = F)

training <- complete[indexes,]
test     <- complete[-indexes,]

4.1 k-NN models

Let’s build and compare different k-NN models with differents preprocessings.

4.1.1 k-NN with all the variables

knn.model1 <- train(brand ~., data = training, method = "knn")
knn.model1
## k-Nearest Neighbors 
## 
## 7501 samples
##    6 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6580912  0.2710797
##   7  0.6680733  0.2906814
##   9  0.6739988  0.3024616
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.
knn.pred1 <- predict(knn.model1, newdata = test)

confusionMatrix(test$brand, knn.pred1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  548  397
##       Sony  370 1184
##                                           
##                Accuracy : 0.6931          
##                  95% CI : (0.6746, 0.7111)
##     No Information Rate : 0.6327          
##     P-Value [Acc > NIR] : 1.263e-10       
##                                           
##                   Kappa : 0.3437          
##  Mcnemar's Test P-Value : 0.3478          
##                                           
##             Sensitivity : 0.5969          
##             Specificity : 0.7489          
##          Pos Pred Value : 0.5799          
##          Neg Pred Value : 0.7619          
##              Prevalence : 0.3673          
##          Detection Rate : 0.2193          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.6729          
##                                           
##        'Positive' Class : Acer            
## 

4.1.2 k-NN with all the variables normalized

knn.model2 <- train(brand ~., data = training, method = "knn", preProcess = c("center", "scale"))
knn.model2
## k-Nearest Neighbors 
## 
## 7501 samples
##    6 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa     
##   5  0.5493034  0.03925497
##   7  0.5580448  0.04682471
##   9  0.5671442  0.05610268
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.
knn.pred2 <- predict(knn.model2, newdata = test)

confusionMatrix(test$brand, knn.pred2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  291  654
##       Sony  403 1151
##                                           
##                Accuracy : 0.577           
##                  95% CI : (0.5574, 0.5965)
##     No Information Rate : 0.7223          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0513          
##  Mcnemar's Test P-Value : 1.476e-14       
##                                           
##             Sensitivity : 0.4193          
##             Specificity : 0.6377          
##          Pos Pred Value : 0.3079          
##          Neg Pred Value : 0.7407          
##              Prevalence : 0.2777          
##          Detection Rate : 0.1164          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.5285          
##                                           
##        'Positive' Class : Acer            
## 

4.1.3 k-NN with salary and age

knn.model3 <- train(brand ~ salary + age, data = training, method = "knn")
knn.model3
## k-Nearest Neighbors 
## 
## 7501 samples
##    2 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6781369  0.3152600
##   7  0.6817040  0.3226715
##   9  0.6854230  0.3294435
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.
knn.pred3 <- predict(knn.model3, newdata = test)

confusionMatrix(test$brand, knn.pred3)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  548  397
##       Sony  337 1217
##                                          
##                Accuracy : 0.7063         
##                  95% CI : (0.688, 0.7241)
##     No Information Rate : 0.6459         
##     P-Value [Acc > NIR] : 8.565e-11      
##                                          
##                   Kappa : 0.3676         
##  Mcnemar's Test P-Value : 0.02943        
##                                          
##             Sensitivity : 0.6192         
##             Specificity : 0.7540         
##          Pos Pred Value : 0.5799         
##          Neg Pred Value : 0.7831         
##              Prevalence : 0.3541         
##          Detection Rate : 0.2193         
##    Detection Prevalence : 0.3782         
##       Balanced Accuracy : 0.6866         
##                                          
##        'Positive' Class : Acer           
## 

4.1.4 k-NN with salary and age normalized

knn.model4 <- train(brand ~ salary + age, data = training, method = "knn", preProcess = c("center", "scale"))
knn.model4
## k-Nearest Neighbors 
## 
## 7501 samples
##    2 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9062063  0.8003114
##   7  0.9105253  0.8096999
##   9  0.9127478  0.8145525
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.
knn.pred4 <- predict(knn.model4, newdata = test)

confusionMatrix(test$brand, knn.pred4)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  837  108
##       Sony   99 1455
##                                           
##                Accuracy : 0.9172          
##                  95% CI : (0.9057, 0.9277)
##     No Information Rate : 0.6255          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8235          
##  Mcnemar's Test P-Value : 0.5782          
##                                           
##             Sensitivity : 0.8942          
##             Specificity : 0.9309          
##          Pos Pred Value : 0.8857          
##          Neg Pred Value : 0.9363          
##              Prevalence : 0.3745          
##          Detection Rate : 0.3349          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.9126          
##                                           
##        'Positive' Class : Acer            
## 

As a summary, the next table shows the accuracy and the kappa of all the k-NN models in the test set:

Model 1 Model 2 Model 3 Model 4
Accuracy 0.6931 0.5766 0.7067 0.9172
Kappa 0.3437 0.0506 0.3686 0.8235

As it can be seen, the models that contains all the variables, especially the model with the normalized variables, have the lowest accuracy. The reason of this is because including non relevant variables to the model produce noise on the fit. Furthermore, if the non relevant variables elevel,car and zipcode, whose ranges are far narrower than salary, are normalized, they will take the same importance as salary when k-NN calculates the euclidean distance and this fact will lead tha algorithm to error. The third model, that contains only salary and age without normalization, still has a low accuracy. This happens because the variable age, which has an important role in the customer preference, takes little importance when calculating the euclidean distance. The range of age goes from 20 to 80 while the range of the salary goes from 20k to 150k. However, in the fourth model, when both salary and age are normalized and have the same importance when calculating the euclidean distance, the accuracy and the kappa of the model rises up.

4.2 Random Forest models

Let’s check now the results of the models repeating the same process with Random Forest:

4.2.1 Random Forest with all the variables

rf.model1 <- train(brand ~., data = training, method = "rf")
rf.model1
## Random Forest 
## 
## 7501 samples
##    6 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      
##    2    0.6234077  0.002066527
##   18    0.9186010  0.826835888
##   34    0.9129570  0.814730837
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 18.
rf.pred1 <- predict(rf.model1, newdata = test)

confusionMatrix(test$brand, rf.pred1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  829  116
##       Sony   81 1473
##                                           
##                Accuracy : 0.9212          
##                  95% CI : (0.9099, 0.9314)
##     No Information Rate : 0.6359          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8312          
##  Mcnemar's Test P-Value : 0.01542         
##                                           
##             Sensitivity : 0.9110          
##             Specificity : 0.9270          
##          Pos Pred Value : 0.8772          
##          Neg Pred Value : 0.9479          
##              Prevalence : 0.3641          
##          Detection Rate : 0.3317          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.9190          
##                                           
##        'Positive' Class : Acer            
## 

4.2.2 Random Forest with all the variables normalized

rf.model2 <- train(brand ~., data = training, method = "rf", preProcess = c("center", "scale"))
rf.model2
## Random Forest 
## 
## 7501 samples
##    6 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      
##    2    0.6223611  0.001821305
##   18    0.9181596  0.825937371
##   34    0.9121071  0.812998733
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 18.
rf.pred2 <- predict(rf.model2, newdata = test)

confusionMatrix(test$brand, rf.pred2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  825  120
##       Sony   78 1476
##                                           
##                Accuracy : 0.9208          
##                  95% CI : (0.9095, 0.9311)
##     No Information Rate : 0.6387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8301          
##  Mcnemar's Test P-Value : 0.003571        
##                                           
##             Sensitivity : 0.9136          
##             Specificity : 0.9248          
##          Pos Pred Value : 0.8730          
##          Neg Pred Value : 0.9498          
##              Prevalence : 0.3613          
##          Detection Rate : 0.3301          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.9192          
##                                           
##        'Positive' Class : Acer            
## 

4.2.3 Random Forest with salary and age

rf.model3 <- train(brand ~ salary + age, data = training, method = "rf")
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
rf.model3
## Random Forest 
## 
## 7501 samples
##    2 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9048754  0.7977815
## 
## Tuning parameter 'mtry' was held constant at a value of 2
rf.pred3 <- predict(rf.model3, newdata = test)

confusionMatrix(test$brand, rf.pred3)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  805  140
##       Sony  110 1444
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8875, 0.9115)
##     No Information Rate : 0.6339          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.786           
##  Mcnemar's Test P-Value : 0.06664         
##                                           
##             Sensitivity : 0.8798          
##             Specificity : 0.9116          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.9292          
##              Prevalence : 0.3661          
##          Detection Rate : 0.3221          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.8957          
##                                           
##        'Positive' Class : Acer            
## 

4.2.4 Random Forest with salary and age normalized

rf.model4 <- train(brand ~ salary + age, data = training, method = "rf", preProcess = c("center", "scale"))
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
rf.model4
## Random Forest 
## 
## 7501 samples
##    2 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9052102  0.7986786
## 
## Tuning parameter 'mtry' was held constant at a value of 2
rf.pred4 <- predict(rf.model4, newdata = test)

confusionMatrix(test$brand, rf.pred4)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  806  139
##       Sony  113 1441
##                                           
##                Accuracy : 0.8992          
##                  95% CI : (0.8867, 0.9107)
##     No Information Rate : 0.6323          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7844          
##  Mcnemar's Test P-Value : 0.1153          
##                                           
##             Sensitivity : 0.8770          
##             Specificity : 0.9120          
##          Pos Pred Value : 0.8529          
##          Neg Pred Value : 0.9273          
##              Prevalence : 0.3677          
##          Detection Rate : 0.3225          
##    Detection Prevalence : 0.3782          
##       Balanced Accuracy : 0.8945          
##                                           
##        'Positive' Class : Acer            
## 

The next table show the summary of the Random Forest models:

Model 1 Model 2 Model 3 Model 4
Accuracy 0.9212 0.9220 0.9004 0.8992
Kappa 0.8309 0.8327 0.7868 0.7844

The different models show about the same level of accuracy. This is due to the way Random Forests works. As it’s a bunch of decision trees, it doesn’t matter if the variables are normalized or not. In the case of the number of variables took by the model, there is a slight improve on the accuracy in regard to the models that only includes salary and age. Why it happens is because RF is able to find hidden patterns on the distribution of the variables that make noise on the k-NN algorithms.

4.3 Model Selection

Although the RF model with all the variables normalized has the highest accuracy, the model chosen to predict the brand preferred by the customers of the incomplete survey is the k-NN model with salary and age normalized. The reason of the choice is because the k-NN model is more explainable than the RF model, whose performance is less interpretable. Besides, building k-NN model is more efficient in terms of computational time.

5 Incomplete Survey Brand Preference Prediction

Before applying the model built on the previous section, the distribution of the variables of the incomplete survey must be checked. In order to do that, the same preprocess executed with the complete survey must be done with the incomplete survey:

str(incomplete)
## 'data.frame':    5000 obs. of  7 variables:
##  $ salary : num  110500 140894 119160 20000 93956 ...
##  $ age    : int  54 44 49 56 59 71 32 33 32 58 ...
##  $ elevel : int  3 4 2 0 1 2 1 4 1 2 ...
##  $ car    : int  15 20 1 9 15 7 17 17 19 8 ...
##  $ zipcode: int  4 7 3 1 1 2 1 0 2 4 ...
##  $ credit : num  354724 395015 122025 99630 458680 ...
##  $ brand  : Factor w/ 1 level "?": 1 1 1 1 1 1 1 1 1 1 ...
incomplete[, 3:5] <- lapply(incomplete[, 3:5], factor)
incomplete$brand <- factor(incomplete$brand, levels = c(0,1) , labels = c("Acer", "Sony"))

summary(incomplete)
##      salary            age        elevel        car          zipcode    
##  Min.   : 20000   Min.   :20.00   0: 981   19     : 276   4      : 585  
##  1st Qu.: 52242   1st Qu.:35.00   1: 993   8      : 263   7      : 571  
##  Median : 85969   Median :50.00   2:1020   18     : 262   5      : 568  
##  Mean   : 85560   Mean   :49.87   3:1001   11     : 259   8      : 563  
##  3rd Qu.:118380   3rd Qu.:65.00   4:1005   16     : 259   1      : 549  
##  Max.   :150000   Max.   :80.00            5      : 255   3      : 548  
##                                            (Other):3426   (Other):1616  
##      credit        brand     
##  Min.   :     0   Acer:   0  
##  1st Qu.:121879   Sony:   0  
##  Median :250871   NA's:5000  
##  Mean   :249510              
##  3rd Qu.:375425              
##  Max.   :500000              
## 
myplots1 <- list()
for (i in 1:6) local({
  i <- i
  if (class(incomplete[,i]) == "numeric" | class(incomplete[,i]) == "integer"){
  plots <- ggplot(incomplete, aes(x = incomplete[,i])) + 
    geom_histogram(binwidth = (max(incomplete[,i]) - min(incomplete[,i]))/10) +
    xlab(colnames(incomplete)[i])
  } else {
    plots <- ggplot(incomplete, aes(x = incomplete[,i])) + geom_bar() +
    xlab(colnames(incomplete)[i])
    }
  
  print(plots)
  myplots1[[i]] <<- plots
})

The data distribution is very similar to the data of the complete survey, so the performance of the model on the test set can be extrapolable to the incomplete survey.

prediction <- predict(knn.model4, newdata = incomplete)
summary(prediction)
## Acer Sony 
## 1864 3136
count <- data.frame(Brand = c("Acer", "Sony"), 
                    Count = c(sum(prediction == "Acer"), sum(prediction == "Sony")))

pie <- ggplot(count, aes(x = "", y = Count, fill = Brand)) + geom_bar(width = 1, stat = "identity")

pie <- pie + coord_polar("y", start = 0) + geom_label(label = count$Count, show.legend = F)
pie

The results show that the preferred brand of the customers from the incomplete survey is Sony, with an estimation of 3136 customers which prefer Sony and 1864 customers which prefer Acer.

6 Conclusions and Recommendations

Answering the business question from Blackwell Electronics, the brand preferred by the customers of the survey is Sony (6217 customers on the complete survey and an estimation of 3136 on the incomplete, that is a total of 9353). Nevertheless, the amount of customers that prefer Acer is 3783 on the complete survey and an estimation of 1864 on the incomplete survey, with a total of 5647, meaning more than 37% of the responses. The analysis also shows that there are different clusters of customers which prefer one brand or another depending their characteristics. So, Blackwell not only can deeper the strategic relationships with Sony but also can focus the target of the marketing campaigns, especially through the e-commerce, to their customers based on the age and the range of salary of the customer. It is also important to take into account that the results of this analysis are from a survey that, due to the distribution found on the variables, is not from a random sample, but from an incidental extraction of data. In order to go further with the analysis would be a good point to compare this results with the distribution of the demographic data of Blackwell Electronics customers.

7 Annex

Codification of elevel:

Codification of car variable:

Codification of zipcode: