The aim of this project is to know which is the preferred computer brand of our customers. With this purpose, Blackwell Electronics’ marketing team elaborated a survey where the the information about distinct characteristics of our customers were collected. This information will help Blackwell Electronics decide which manufacturer should pursue a deeper strategic relationship. However, due to an error of recording the data, the answer about the preferred brand of computers between two possibilities were lost on a part of the surveys. The data avialable is contained in two separated datasets. The first one contains the complete survey. The second dataset contains all the information except the preferred brand.
The information registered on both data sets is (see annex for further information about the codification of the variables):
The preferred brand of the customers is only recorded on the Complete Survey and its codification is:
The information of the complete surveys will be used to predict the missing information on the incomplete surveys through two different Machine Learning algorithms: k-NN and Random Forest.
In order to perform the analysis, the libraries required for this project are:
library(ggplot2)
library(party)
library(caret)
After importing the datasets into “complete” and “incomplete” data frames, the first step is to check the structure of the variables on the complete data:
str(complete)
## 'data.frame': 10000 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : int 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : int 0 1 0 3 3 3 4 3 4 1 ...
## $ car : int 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: int 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : int 0 1 0 1 0 1 1 1 0 1 ...
The variables elevel,car, zipcodeand brand were imported as integer. They must be converted into factor to become operative:
complete[, 3:5] <- lapply(complete[, 3:5], factor)
complete$brand <- factor(complete$brand, levels = c(0,1) , labels = c("Acer", "Sony"))
The next step is to check the distribution of the data both with summary and through visualization. This will facilitate the approach of the mining in order to extract knowledge of the data:
summary(complete)
## salary age elevel car zipcode
## Min. : 20000 Min. :20.00 0:2075 15 : 544 6 :1167
## 1st Qu.: 52109 1st Qu.:35.00 1:1965 18 : 530 8 :1144
## Median : 84969 Median :50.00 2:2003 17 : 517 2 :1125
## Mean : 84897 Mean :49.81 3:1967 8 : 516 5 :1113
## 3rd Qu.:117168 3rd Qu.:65.00 4:1990 2 : 513 4 :1103
## Max. :150000 Max. :80.00 5 : 511 0 :1097
## (Other):6869 (Other):3251
## credit brand
## Min. : 0 Acer:3783
## 1st Qu.:121155 Sony:6217
## Median :250607
## Mean :249245
## 3rd Qu.:374872
## Max. :500000
##
myplots <- list()
for (i in 1:7) local({
i <- i
if (class(complete[,i]) == "numeric" | class(complete[,i]) == "integer"){
plots <- ggplot(complete, aes(x = complete[,i], fill = brand)) +
geom_histogram(binwidth = (max(complete[,i]) - min(complete[,i]))/10) +
xlab(colnames(complete)[i])
} else {
plots <- ggplot(complete, aes(x = complete[,i], fill = brand)) + geom_bar() +
xlab(colnames(complete)[i])
}
print(plots)
myplots[[i]] <<- plots
})
The distribution of the variables shows that salary seems to be the only aspect that influences the brand preference. Nevertheless, a decision tree will be used to perform a deeper analysis (the depth of the tree is restricted to have only three levels in order to be readable):
tree <- ctree(brand ~ ., complete, controls = ctree_control(maxdepth = 3))
plot(tree)
It can be seen that not only salary influences the brand preference of Blackwell Electronics customers but also the age has an important role.
Moreover, if the variables salary and age are plotted together the following pattern can be seen:
ggplot(complete, aes(x = age, y = salary, color = brand)) + geom_point()
There is a clear definition of the pattern formed by the interaction of these two variables that is summed up on the next table:
| Salary | Age | ||
|---|---|---|---|
| 20-40 | 40-60 | 60-80 | |
| 20k-60k | Sony | Sony | Acer |
| 60k-100k | Acer | ~50% Sony/ ~50 %Acer | ~50% Sony/ ~50% Acer |
| 100k-150k | Sony | ~50%Sony/ ~50%Acer | Sony |
It is time to build the models that allow the prediction of the customer brand preference. Before that, it is necessary to split the data into train and test sets in order to check and prevent overfitting:
set.seed(123)
indexes <- createDataPartition(complete$brand, p = 0.75, list = F)
training <- complete[indexes,]
test <- complete[-indexes,]
Let’s build and compare different k-NN models with differents preprocessings.
knn.model1 <- train(brand ~., data = training, method = "knn")
knn.model1
## k-Nearest Neighbors
##
## 7501 samples
## 6 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6580912 0.2710797
## 7 0.6680733 0.2906814
## 9 0.6739988 0.3024616
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
knn.pred1 <- predict(knn.model1, newdata = test)
confusionMatrix(test$brand, knn.pred1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 548 397
## Sony 370 1184
##
## Accuracy : 0.6931
## 95% CI : (0.6746, 0.7111)
## No Information Rate : 0.6327
## P-Value [Acc > NIR] : 1.263e-10
##
## Kappa : 0.3437
## Mcnemar's Test P-Value : 0.3478
##
## Sensitivity : 0.5969
## Specificity : 0.7489
## Pos Pred Value : 0.5799
## Neg Pred Value : 0.7619
## Prevalence : 0.3673
## Detection Rate : 0.2193
## Detection Prevalence : 0.3782
## Balanced Accuracy : 0.6729
##
## 'Positive' Class : Acer
##
knn.model2 <- train(brand ~., data = training, method = "knn", preProcess = c("center", "scale"))
knn.model2
## k-Nearest Neighbors
##
## 7501 samples
## 6 predictor
## 2 classes: 'Acer', 'Sony'
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.5493034 0.03925497
## 7 0.5580448 0.04682471
## 9 0.5671442 0.05610268
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
knn.pred2 <- predict(knn.model2, newdata = test)
confusionMatrix(test$brand, knn.pred2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 291 654
## Sony 403 1151
##
## Accuracy : 0.577
## 95% CI : (0.5574, 0.5965)
## No Information Rate : 0.7223
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0513
## Mcnemar's Test P-Value : 1.476e-14
##
## Sensitivity : 0.4193
## Specificity : 0.6377
## Pos Pred Value : 0.3079
## Neg Pred Value : 0.7407
## Prevalence : 0.2777
## Detection Rate : 0.1164
## Detection Prevalence : 0.3782
## Balanced Accuracy : 0.5285
##
## 'Positive' Class : Acer
##
knn.model3 <- train(brand ~ salary + age, data = training, method = "knn")
knn.model3
## k-Nearest Neighbors
##
## 7501 samples
## 2 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6781369 0.3152600
## 7 0.6817040 0.3226715
## 9 0.6854230 0.3294435
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
knn.pred3 <- predict(knn.model3, newdata = test)
confusionMatrix(test$brand, knn.pred3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 548 397
## Sony 337 1217
##
## Accuracy : 0.7063
## 95% CI : (0.688, 0.7241)
## No Information Rate : 0.6459
## P-Value [Acc > NIR] : 8.565e-11
##
## Kappa : 0.3676
## Mcnemar's Test P-Value : 0.02943
##
## Sensitivity : 0.6192
## Specificity : 0.7540
## Pos Pred Value : 0.5799
## Neg Pred Value : 0.7831
## Prevalence : 0.3541
## Detection Rate : 0.2193
## Detection Prevalence : 0.3782
## Balanced Accuracy : 0.6866
##
## 'Positive' Class : Acer
##
knn.model4 <- train(brand ~ salary + age, data = training, method = "knn", preProcess = c("center", "scale"))
knn.model4
## k-Nearest Neighbors
##
## 7501 samples
## 2 predictor
## 2 classes: 'Acer', 'Sony'
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9062063 0.8003114
## 7 0.9105253 0.8096999
## 9 0.9127478 0.8145525
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
knn.pred4 <- predict(knn.model4, newdata = test)
confusionMatrix(test$brand, knn.pred4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 837 108
## Sony 99 1455
##
## Accuracy : 0.9172
## 95% CI : (0.9057, 0.9277)
## No Information Rate : 0.6255
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8235
## Mcnemar's Test P-Value : 0.5782
##
## Sensitivity : 0.8942
## Specificity : 0.9309
## Pos Pred Value : 0.8857
## Neg Pred Value : 0.9363
## Prevalence : 0.3745
## Detection Rate : 0.3349
## Detection Prevalence : 0.3782
## Balanced Accuracy : 0.9126
##
## 'Positive' Class : Acer
##
As a summary, the next table shows the accuracy and the kappa of all the k-NN models in the test set:
| Model 1 | Model 2 | Model 3 | Model 4 | |
|---|---|---|---|---|
| Accuracy | 0.6931 | 0.5766 | 0.7067 | 0.9172 |
| Kappa | 0.3437 | 0.0506 | 0.3686 | 0.8235 |
As it can be seen, the models that contains all the variables, especially the model with the normalized variables, have the lowest accuracy. The reason of this is because including non relevant variables to the model produce noise on the fit. Furthermore, if the non relevant variables elevel,car and zipcode, whose ranges are far narrower than salary, are normalized, they will take the same importance as salary when k-NN calculates the euclidean distance and this fact will lead tha algorithm to error. The third model, that contains only salary and age without normalization, still has a low accuracy. This happens because the variable age, which has an important role in the customer preference, takes little importance when calculating the euclidean distance. The range of age goes from 20 to 80 while the range of the salary goes from 20k to 150k. However, in the fourth model, when both salary and age are normalized and have the same importance when calculating the euclidean distance, the accuracy and the kappa of the model rises up.
Let’s check now the results of the models repeating the same process with Random Forest:
rf.model1 <- train(brand ~., data = training, method = "rf")
rf.model1
## Random Forest
##
## 7501 samples
## 6 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6234077 0.002066527
## 18 0.9186010 0.826835888
## 34 0.9129570 0.814730837
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
rf.pred1 <- predict(rf.model1, newdata = test)
confusionMatrix(test$brand, rf.pred1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 829 116
## Sony 81 1473
##
## Accuracy : 0.9212
## 95% CI : (0.9099, 0.9314)
## No Information Rate : 0.6359
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8312
## Mcnemar's Test P-Value : 0.01542
##
## Sensitivity : 0.9110
## Specificity : 0.9270
## Pos Pred Value : 0.8772
## Neg Pred Value : 0.9479
## Prevalence : 0.3641
## Detection Rate : 0.3317
## Detection Prevalence : 0.3782
## Balanced Accuracy : 0.9190
##
## 'Positive' Class : Acer
##
rf.model2 <- train(brand ~., data = training, method = "rf", preProcess = c("center", "scale"))
rf.model2
## Random Forest
##
## 7501 samples
## 6 predictor
## 2 classes: 'Acer', 'Sony'
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6223611 0.001821305
## 18 0.9181596 0.825937371
## 34 0.9121071 0.812998733
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
rf.pred2 <- predict(rf.model2, newdata = test)
confusionMatrix(test$brand, rf.pred2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 825 120
## Sony 78 1476
##
## Accuracy : 0.9208
## 95% CI : (0.9095, 0.9311)
## No Information Rate : 0.6387
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8301
## Mcnemar's Test P-Value : 0.003571
##
## Sensitivity : 0.9136
## Specificity : 0.9248
## Pos Pred Value : 0.8730
## Neg Pred Value : 0.9498
## Prevalence : 0.3613
## Detection Rate : 0.3301
## Detection Prevalence : 0.3782
## Balanced Accuracy : 0.9192
##
## 'Positive' Class : Acer
##
rf.model3 <- train(brand ~ salary + age, data = training, method = "rf")
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
rf.model3
## Random Forest
##
## 7501 samples
## 2 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9048754 0.7977815
##
## Tuning parameter 'mtry' was held constant at a value of 2
rf.pred3 <- predict(rf.model3, newdata = test)
confusionMatrix(test$brand, rf.pred3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 805 140
## Sony 110 1444
##
## Accuracy : 0.9
## 95% CI : (0.8875, 0.9115)
## No Information Rate : 0.6339
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.786
## Mcnemar's Test P-Value : 0.06664
##
## Sensitivity : 0.8798
## Specificity : 0.9116
## Pos Pred Value : 0.8519
## Neg Pred Value : 0.9292
## Prevalence : 0.3661
## Detection Rate : 0.3221
## Detection Prevalence : 0.3782
## Balanced Accuracy : 0.8957
##
## 'Positive' Class : Acer
##
rf.model4 <- train(brand ~ salary + age, data = training, method = "rf", preProcess = c("center", "scale"))
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
rf.model4
## Random Forest
##
## 7501 samples
## 2 predictor
## 2 classes: 'Acer', 'Sony'
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7501, 7501, 7501, 7501, 7501, 7501, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9052102 0.7986786
##
## Tuning parameter 'mtry' was held constant at a value of 2
rf.pred4 <- predict(rf.model4, newdata = test)
confusionMatrix(test$brand, rf.pred4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 806 139
## Sony 113 1441
##
## Accuracy : 0.8992
## 95% CI : (0.8867, 0.9107)
## No Information Rate : 0.6323
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7844
## Mcnemar's Test P-Value : 0.1153
##
## Sensitivity : 0.8770
## Specificity : 0.9120
## Pos Pred Value : 0.8529
## Neg Pred Value : 0.9273
## Prevalence : 0.3677
## Detection Rate : 0.3225
## Detection Prevalence : 0.3782
## Balanced Accuracy : 0.8945
##
## 'Positive' Class : Acer
##
The next table show the summary of the Random Forest models:
| Model 1 | Model 2 | Model 3 | Model 4 | |
|---|---|---|---|---|
| Accuracy | 0.9212 | 0.9220 | 0.9004 | 0.8992 |
| Kappa | 0.8309 | 0.8327 | 0.7868 | 0.7844 |
The different models show about the same level of accuracy. This is due to the way Random Forests works. As it’s a bunch of decision trees, it doesn’t matter if the variables are normalized or not. In the case of the number of variables took by the model, there is a slight improve on the accuracy in regard to the models that only includes salary and age. Why it happens is because RF is able to find hidden patterns on the distribution of the variables that make noise on the k-NN algorithms.
Although the RF model with all the variables normalized has the highest accuracy, the model chosen to predict the brand preferred by the customers of the incomplete survey is the k-NN model with salary and age normalized. The reason of the choice is because the k-NN model is more explainable than the RF model, whose performance is less interpretable. Besides, building k-NN model is more efficient in terms of computational time.
Before applying the model built on the previous section, the distribution of the variables of the incomplete survey must be checked. In order to do that, the same preprocess executed with the complete survey must be done with the incomplete survey:
str(incomplete)
## 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 110500 140894 119160 20000 93956 ...
## $ age : int 54 44 49 56 59 71 32 33 32 58 ...
## $ elevel : int 3 4 2 0 1 2 1 4 1 2 ...
## $ car : int 15 20 1 9 15 7 17 17 19 8 ...
## $ zipcode: int 4 7 3 1 1 2 1 0 2 4 ...
## $ credit : num 354724 395015 122025 99630 458680 ...
## $ brand : Factor w/ 1 level "?": 1 1 1 1 1 1 1 1 1 1 ...
incomplete[, 3:5] <- lapply(incomplete[, 3:5], factor)
incomplete$brand <- factor(incomplete$brand, levels = c(0,1) , labels = c("Acer", "Sony"))
summary(incomplete)
## salary age elevel car zipcode
## Min. : 20000 Min. :20.00 0: 981 19 : 276 4 : 585
## 1st Qu.: 52242 1st Qu.:35.00 1: 993 8 : 263 7 : 571
## Median : 85969 Median :50.00 2:1020 18 : 262 5 : 568
## Mean : 85560 Mean :49.87 3:1001 11 : 259 8 : 563
## 3rd Qu.:118380 3rd Qu.:65.00 4:1005 16 : 259 1 : 549
## Max. :150000 Max. :80.00 5 : 255 3 : 548
## (Other):3426 (Other):1616
## credit brand
## Min. : 0 Acer: 0
## 1st Qu.:121879 Sony: 0
## Median :250871 NA's:5000
## Mean :249510
## 3rd Qu.:375425
## Max. :500000
##
myplots1 <- list()
for (i in 1:6) local({
i <- i
if (class(incomplete[,i]) == "numeric" | class(incomplete[,i]) == "integer"){
plots <- ggplot(incomplete, aes(x = incomplete[,i])) +
geom_histogram(binwidth = (max(incomplete[,i]) - min(incomplete[,i]))/10) +
xlab(colnames(incomplete)[i])
} else {
plots <- ggplot(incomplete, aes(x = incomplete[,i])) + geom_bar() +
xlab(colnames(incomplete)[i])
}
print(plots)
myplots1[[i]] <<- plots
})
The data distribution is very similar to the data of the complete survey, so the performance of the model on the test set can be extrapolable to the incomplete survey.
prediction <- predict(knn.model4, newdata = incomplete)
summary(prediction)
## Acer Sony
## 1864 3136
count <- data.frame(Brand = c("Acer", "Sony"),
Count = c(sum(prediction == "Acer"), sum(prediction == "Sony")))
pie <- ggplot(count, aes(x = "", y = Count, fill = Brand)) + geom_bar(width = 1, stat = "identity")
pie <- pie + coord_polar("y", start = 0) + geom_label(label = count$Count, show.legend = F)
pie
The results show that the preferred brand of the customers from the incomplete survey is Sony, with an estimation of 3136 customers which prefer Sony and 1864 customers which prefer Acer.
Answering the business question from Blackwell Electronics, the brand preferred by the customers of the survey is Sony (6217 customers on the complete survey and an estimation of 3136 on the incomplete, that is a total of 9353). Nevertheless, the amount of customers that prefer Acer is 3783 on the complete survey and an estimation of 1864 on the incomplete survey, with a total of 5647, meaning more than 37% of the responses. The analysis also shows that there are different clusters of customers which prefer one brand or another depending their characteristics. So, Blackwell not only can deeper the strategic relationships with Sony but also can focus the target of the marketing campaigns, especially through the e-commerce, to their customers based on the age and the range of salary of the customer. It is also important to take into account that the results of this analysis are from a survey that, due to the distribution found on the variables, is not from a random sample, but from an incidental extraction of data. In order to go further with the analysis would be a good point to compare this results with the distribution of the demographic data of Blackwell Electronics customers.
Codification of elevel:
Codification of car variable:
Codification of zipcode: