In this work we have been asked to work with data mining tools to resolve some problems of the electronic devices company Blackwell Electronics. This company has made some surveys in order to know better customers, but some surveys have unresponsed questions, more specifically which computer brand they prefer. We can use customer responses to some survey questions (e.g.Ā income, age, etc.) to predict the answer to the brand preference question.
What are we trying to predict?
We need to predict the brand preferences for the incomplete survey responses.
What type of problem is it? Classification or Regression? Binary or Multi-class? Uni-variate or Multi-variate?
It is a binary classification problem with multiple features.
What type of data do we have?
We have three files, one to build trained and predictive models (CompleteResponses), another to explain the survey questions (SurveyKey) and the last one, is the data set from we have to predict brand preferences (Survey_incomplete).
First, we need to activate packages to be used during the project.
The next step is to upload the data we are going to be work with.
With the function glimpse() we can check the dimensions and the structure of our data set.
## Observations: 9,898
## Variables: 7
## $ salary <dbl> 119806.54, 106880.48, 78020.75, 63689.94, 50873.62, 130812....
## $ age <dbl> 45, 63, 23, 51, 20, 56, 24, 62, 29, 41, 48, 52, 52, 33, 62,...
## $ elevel <dbl> 0, 1, 0, 3, 3, 3, 4, 3, 4, 1, 4, 1, 3, 4, 2, 1, 2, 1, 2, 0,...
## $ car <dbl> 14, 11, 15, 6, 14, 14, 8, 3, 17, 5, 16, 6, 20, 13, 6, 11, 7...
## $ zipcode <dbl> 4, 6, 2, 5, 4, 3, 5, 0, 0, 4, 5, 0, 4, 3, 3, 4, 7, 2, 8, 2,...
## $ credit <dbl> 442037.71, 45007.18, 48795.32, 40888.88, 352951.50, 135943....
## $ brand <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1,...
To finish with the initial exploration we will check for missing values and in case of finding them we will transform them.
## [1] 0
There are not missing values so we can proceed with data exploration.
In this part we are goint to adequate data to our analysis.
First, we are going to change attributes names.
## [1] "salary" "age" "elevel" "car" "zipcode" "credit" "brand"
What happens with the data types. Does the type of data correspond to each value? In the previous glimpse() function, we verify that all data is double, but from the Survey Key data set we know that the educational level classifies people into 5 categories, the Car attribute corresponds to the main car brand and the ZIPcode says which region each family lives in. Finally, our dependent variable corresponds to which brand is preferred. Then, these four variables must be changed to categorical or factor.
At the same time, the labels of these variables will be changed according to that seen in Survey Key data set.
CompleteResponses$Educational_level <- factor(CompleteResponses$Educational_level, levels = c(0,1,2,3,4), labels = c("Less than High School Degree",
"High School Degree",
"Some College",
"4-Year College Degree",
"Master's, Doctoral or Professional Degree"))
CompleteResponses$Car <- factor(CompleteResponses$Car, levels = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), labels = c("BMW", "Buick", "Cadillac", "Chevrolet", "Chrysler", "Dodge",
"Ford", "Honda", "Hyundai", "Jeep", "Kia", "Lincoln", "Mazda",
"Mercedes Benz", "Mitsubishi", "Nissan", "Ram", "Subaru",
"Toyota", "None of the above"))
CompleteResponses$ZIPcode <- factor(CompleteResponses$ZIPcode, levels = c(0,1,2,3,4,5,6,7,8), labels = c("New England", "Mid-Atlantic", "East North Central",
"West North Central", "South Atlantic", "East South Central",
"West South Central", "Mountain", "Pacific"))
CompleteResponses$Brand <- factor(CompleteResponses$Brand, levels = c(0,1), labels = c("Acer", "Sony"))In this part we will check how data is distributed. We will perform some plots using āggplot2ā package.
Modifying the aesthetics of geom_histogram() we can observe the relation of each numerical data with dependant variable.
There is a relationship between Salary and Brand. In fact, lowest salaries ( ~ 10.000) and highest salaries ( > 130.000) prefer Sony in almost 100%. People with salary between 30.000 - 50.000 and 110.000 - 130.000 still prefer Sony but not as much as before and the rest people prefer Acer.
Scatter plot can relate 2 variables with the dependant variable to check these correlations.
From this plot it can be observed the following:
Now its time to check Salary and Credit relation scatter plot.
We can conclude the following:
Finally we check the relation scatter plot of Age and Credit and nothing clear can be concluded.
Th last plot to check numerical data that is going to be used is the boxplot. In this type of plots outlier values of each variable can be determined and in this case there have not been found any outlier.
It can also be observed the relation of numerical data with dependant variable in boxplots.
It can be observed that people that prefer Sony has higher salary than people that prefer Acer.
In order to know how categorical data is ditributed the most common used plot is the bar chart.
In this plot it is observed the differences in brand preferences.
Once data is preprocessed, it“s time to create training and testing sets for the predictive model that must be performed.
First the seed number must be set. The seed number is a number that you choose for a starting point used to create a sequence of random numbers. It is also helpful for others who want to recreate your same results.
The next step is to split data into two sets, training and testing set. A common split is 75/25, which means that 75% of the data will be the training set and 25% of the data will be the test set. For that, createDataPartition() function is used, that does a stratified random split of the data.
inTrain <- createDataPartition(y = CompleteResponses$Brand, p=.75, list = FALSE)
# To partition the data:
training <- CompleteResponses[ inTrain,] # Training set for normal data set.
testing <- CompleteResponses[-inTrain,] # Testing set for normal data set.Now is time to run different algorithms.
We will run 3 different methods and then we will compare. The best is going to be used to predict our incomplete survey“s brand preferences.
This algorithms can be compared if all have the same resampling method and the same number of repetitions. To modify resampling method trainControl() function can be used.
The first method that is going to be performed is Classification and Regression tree algorithm. To run any algorithm the train() function is used. In this case, the method that will be used is called ārpartā.
In the first iteration it is going to perform the feature engineering. With the function varImp() we can know which variable is important and which not.
# CART decision tree
dt <- train(Brand ~., data = training, method = "rpart", trControl=fitControl, tuneLength=1) # rpart --> CART, Classification And Regression Tree.
dt## CART
##
## 7424 samples
## 6 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6682, 6682, 6682, 6681, 6682, 6682, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6858819 0.2594072
##
## Tuning parameter 'cp' was held constant at a value of 0.09425451
It can be observed that only Age and Salary variables have the enough significance to be taken into account. So we have to modified the training and testing sets in order to reduce noise and improve accuracy and re-run the algorithm.
CompleteResponses_1 <- data.frame(CompleteResponses$Salary, CompleteResponses$Age, CompleteResponses$Brand)
names(CompleteResponses_1) <- c("Salary", "Age", "Brand")
inTrain_1 <- createDataPartition(y = CompleteResponses_1$Brand, p=.75, list = FALSE)
# To partition the data:
training_1 <- CompleteResponses_1[ inTrain_1,] # Training set for the data frame that only takes into account Salary and Age variables.
testing_1 <- CompleteResponses_1[-inTrain_1,] # Testing set for the data frame that only takes into account Salary and Age variables.It is also possible to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.
# CART decision tree
dt_1 <- train(Brand ~., data = training_1, method = "rpart", trControl=fitControl, preProc = c("center", "scale"))
rpart.plot(dt_1$finalMode)Once the training algorithm is performed it has to be tried to predict in testing set.
Finally it will be generated the confusion matrix and statistics will be checked.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 899 435
## Sony 37 1103
##
## Accuracy : 0.8092
## 95% CI : (0.7932, 0.8245)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6256
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9605
## Specificity : 0.7172
## Pos Pred Value : 0.6739
## Neg Pred Value : 0.9675
## Prevalence : 0.3783
## Detection Rate : 0.3634
## Detection Prevalence : 0.5392
## Balanced Accuracy : 0.8388
##
## 'Positive' Class : Acer
##
lvs <- c("Acer", "Sony")
truth <- factor(rep(lvs,
times = c(936, 1538)),
levels = rev(lvs))
pred <- factor(c(rep(lvs,
times = c(899, 37)),
rep(lvs, times = c(435, 1103))),
levels = rev(lvs))
confusionMatrix(pred, truth)## Confusion Matrix and Statistics
##
## Reference
## Prediction Sony Acer
## Sony 1103 37
## Acer 435 899
##
## Accuracy : 0.8092
## 95% CI : (0.7932, 0.8245)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6256
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7172
## Specificity : 0.9605
## Pos Pred Value : 0.9675
## Neg Pred Value : 0.6739
## Prevalence : 0.6217
## Detection Rate : 0.4458
## Detection Prevalence : 0.4608
## Balanced Accuracy : 0.8388
##
## 'Positive' Class : Sony
##
## Prediction Reference Freq
## 1 Sony Sony 1103
## 2 Acer Sony 435
## 3 Sony Acer 37
## 4 Acer Acer 899
plotTable <- table_dt %>% mutate(goodbad = ifelse(table_dt$Prediction == table_dt$Reference, "good", "bad")) %>% group_by(Reference) %>% mutate(prop = Freq/sum(Freq))
# fill alpha relative to sensitivity/specificity by proportional outcomes within reference groups (see dplyr code above as well as original confusion matrix for comparison)
ggplot(data = plotTable, mapping = aes(x = Reference, y = Prediction, fill = goodbad, alpha = prop)) + geom_tile() + geom_text(aes(label = Freq), vjust = .5, fontface = "bold", alpha = 1) + scale_fill_manual(values = c(good = "green", bad = "red")) + theme_bw() + xlim(rev(levels(table_dt$Reference)))The next method that is going to be performed is C5.0 decision tree algorithm. In this case, the method that will be used in the train() function is called āC5.0ā.
Like with CART decision tree, in the first iteration variable importance has to be checked with varImp() function.
# c5.0 decision tree
c50 <- train(Brand ~., data = training, method = "C5.0", trControl=fitControl, tuneLength=2)
variable_importance_c50 <- varImp(c50)It can be observed that only Age and Salary variables have the enough significance to be taken into account, the same that happened with CART decision tree. So we have to used previously modified training and testing sets.
It is also possible, as before, to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.
This decision tree C5.0 with an Automatic Tuning Grid with an optimized tuneLength of 2.
c50_1 <- train(Brand~., data = training_1, method = "C5.0", trControl=fitControl, preProc = c("center", "scale"), tuneLength=2) # The tuneGrid parameter lets us decide which values the main parameter will take. While tuneLength only limit the number of default parameters to use.
plot(c50_1) # winnow --> a logical that enacts a feature selection step prior to model building..Once the training algorithm is performed it has to be tried to predict in testing set.
Finally it will be generated the confusion matrix and statistics will be checked.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 830 83
## Sony 106 1455
##
## Accuracy : 0.9236
## 95% CI : (0.9124, 0.9338)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8368
##
## Mcnemar's Test P-Value : 0.1095
##
## Sensitivity : 0.8868
## Specificity : 0.9460
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.9321
## Prevalence : 0.3783
## Detection Rate : 0.3355
## Detection Prevalence : 0.3690
## Balanced Accuracy : 0.9164
##
## 'Positive' Class : Acer
##
lvs <- c("Acer", "Sony")
truth <- factor(rep(lvs, times = c(936, 1538)),
levels = rev(lvs))
pred_1 <- factor(c(rep(lvs, times = c(830, 106)),
rep(lvs, times = c(83, 1455))),
levels = rev(lvs))
confusionMatrix(pred_1, truth)## Confusion Matrix and Statistics
##
## Reference
## Prediction Sony Acer
## Sony 1455 106
## Acer 83 830
##
## Accuracy : 0.9236
## 95% CI : (0.9124, 0.9338)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8368
##
## Mcnemar's Test P-Value : 0.1095
##
## Sensitivity : 0.9460
## Specificity : 0.8868
## Pos Pred Value : 0.9321
## Neg Pred Value : 0.9091
## Prevalence : 0.6217
## Detection Rate : 0.5881
## Detection Prevalence : 0.6310
## Balanced Accuracy : 0.9164
##
## 'Positive' Class : Sony
##
table_c50 <- data.frame(confusionMatrix(pred_1, truth)$table)
plotTable_c50 <- table_c50 %>% mutate(goodbad = ifelse(table_c50$Prediction == table_c50$Reference, "good", "bad")) %>% group_by(Reference) %>% mutate(prop = Freq/sum(Freq))
# fill alpha relative to sensitivity/specificity by proportional outcomes within reference groups (see dplyr code above as well as original confusion matrix for comparison)
ggplot(data = plotTable_c50, mapping = aes(x = Reference, y = Prediction, fill = goodbad, alpha = prop)) + geom_tile() + geom_text(aes(label = Freq), vjust = .5, fontface = "bold", alpha = 1) + scale_fill_manual(values = c(good = "green", bad = "red")) + theme_bw() + xlim(rev(levels(table_c50$Reference)))The last method that is going to be performed is Random Forest algorithm. In this case, the method that will be used in the train() function is called ārfā.
Like with the other models, first iteration is to check the importance of variables.
# Random Forest
rf <- train(Brand ~., data = training, method = "rf", trControl=fitControl)
variable_importance_rf <- varImp(rf)It can be observed that only Age and Salary variables have the enough significance to be taken into account, the same that happened with CART and c5.0 decision trees. So we have to used previously modified training and testing sets.
It is also possible, as before, to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.
# Random Forest
rfGrid <- expand.grid(mtry=2)
rf_1 <- train(Brand~.,data=training_1, method="rf", trControl=fitControl, preProc = c("center", "scale"), tuneGrid=rfGrid)Once the training algorithm is performed it has to be tried to predict in testing set.
Finally it will be generated the confusion matrix and statistics will be checked.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 823 124
## Sony 113 1414
##
## Accuracy : 0.9042
## 95% CI : (0.8919, 0.9155)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7968
##
## Mcnemar's Test P-Value : 0.516
##
## Sensitivity : 0.8793
## Specificity : 0.9194
## Pos Pred Value : 0.8691
## Neg Pred Value : 0.9260
## Prevalence : 0.3783
## Detection Rate : 0.3327
## Detection Prevalence : 0.3828
## Balanced Accuracy : 0.8993
##
## 'Positive' Class : Acer
##
lvs <- c("Acer", "Sony")
truth <- factor(rep(lvs, times = c(936, 1538)),
levels = rev(lvs))
pred_rf <- factor(c(rep(lvs, times = c(823, 113)),
rep(lvs, times = c(124, 1414))),
levels = rev(lvs))
confusionMatrix(pred_rf, truth)## Confusion Matrix and Statistics
##
## Reference
## Prediction Sony Acer
## Sony 1414 113
## Acer 124 823
##
## Accuracy : 0.9042
## 95% CI : (0.8919, 0.9155)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7968
##
## Mcnemar's Test P-Value : 0.516
##
## Sensitivity : 0.9194
## Specificity : 0.8793
## Pos Pred Value : 0.9260
## Neg Pred Value : 0.8691
## Prevalence : 0.6217
## Detection Rate : 0.5715
## Detection Prevalence : 0.6172
## Balanced Accuracy : 0.8993
##
## 'Positive' Class : Sony
##
table_rf <- data.frame(confusionMatrix(pred_rf, truth)$table)
plotTable_rf <- table_rf %>% mutate(goodbad = ifelse(table_rf$Prediction == table_rf$Reference, "good", "bad")) %>% group_by(Reference) %>% mutate(prop = Freq/sum(Freq))
# fill alpha relative to sensitivity/specificity by proportional outcomes within reference groups (see dplyr code above as well as original confusion matrix for comparison)
ggplot(data = plotTable_rf, mapping = aes(x = Reference, y = Prediction, fill = goodbad, alpha = prop)) + geom_tile() + geom_text(aes(label = Freq), vjust = .5, fontface = "bold", alpha = 1) + scale_fill_manual(values = c(good = "green", bad = "red")) + theme_bw() + xlim(rev(levels(table_rf$Reference)))After making the predictions using the test set use postResample() function to assess the metrics of the new predictions compared to the Ground Truth.
resamps <- resamples(list(rpart = dt_1, c50 = c50_1, rf = rf_1))
summary(resamps) # Show the accuracy and kappa of each model. Remember that the number of resamples was fit to 10 in fitcontrol object.##
## Call:
## summary.resamples(object = resamps)
##
## Models: rpart, c50, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.7913863 0.8965633 0.9002675 0.8849689 0.9077137 0.9219381 0
## c50 0.9057873 0.9164420 0.9218855 0.9214719 0.9255645 0.9407008 0
## rf 0.8894879 0.9056604 0.9111102 0.9084044 0.9147864 0.9164420 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.5925252 0.7843747 0.7921431 0.7649501 0.8080546 0.8351756 0
## c50 0.7979975 0.8218746 0.8331427 0.8329157 0.8421463 0.8745157 0
## rf 0.7661303 0.7988226 0.8119569 0.8055248 0.8183492 0.8236597 0
diffs <- diff(resamps)
summary(diffs) # Gives the differences of the resamps object. Another way to check which model fits better.##
## Call:
## summary.diff.resamples(object = diffs)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## rpart c50 rf
## rpart -0.03650 -0.02344
## c50 0.11281 0.01307
## rf 0.44853 0.01685
##
## Kappa
## rpart c50 rf
## rpart -0.06797 -0.04057
## c50 0.12067 0.02739
## rf 0.53639 0.02006
df <- data.frame(Model="CART decision tree", Accuracy=mean(resamps$values$`rpart~Accuracy`), Kappa=mean(resamps$values$`rpart~Kappa`))
df <- rbind(df,data.frame(Model="c5.0 decision tree",Accuracy=mean(resamps$values$`c50~Accuracy`), Kappa=mean(resamps$values$`c50~Kappa`)))
df <- rbind(df,data.frame(Model="Random Forest",Accuracy=mean(resamps$values$`rf~Accuracy`), Kappa=mean(resamps$values$`rf~Kappa`)))
df## Model Accuracy Kappa
## 1 CART decision tree 0.8849689 0.7649501
## 2 c5.0 decision tree 0.9214719 0.8329157
## 3 Random Forest 0.9084044 0.8055248
# CART decision tree
dt_param <- postResample(Brand_predict_dt, testing_1$Brand) # Given two factors, the overall agreement rate and Kappa are determined.
dt_Probs <- predict(dt_1, newdata = testing_1, type = "prob") # compute class probabilities from the model.
# c5.0 decision tree
c50_param <- postResample(Brand_predict_c50, testing_1$Brand)
c50_Probs <- predict(c50_1, newdata = testing_1, type = "prob")
# Random forest decision tree
rf_param <- postResample(Brand_predict_rf, testing_1$Brand)
rf_Probs <- predict(rf_1, newdata = testing_1, type = "prob")
comp_model <- data.frame(dt_param, c50_param, rf_param)
comp_model # Compares the accuracy and kappa of each used algortihm.## dt_param c50_param rf_param
## Accuracy 0.8092158 0.9236055 0.9042037
## Kappa 0.6255760 0.8368102 0.7968166
Among used different methods the c5.0 method gives the best results. So the performed predictive model that is going to apply to predict brand preferences in the incomplete survey is going to be c5.0.
Finally we will train our predictive model in the whole set and not just in our testing set.
c50_allData <- train(Brand~., data = CompleteResponses_1, method = "C5.0", trControl=fitControl, preProc = c("center", "scale"), tuneLength=2)
predictiveModel <- predict(c50_allData, newdata = CompleteResponses_1)
confusionMatrix(predictiveModel, CompleteResponses_1$Brand)## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 3391 350
## Sony 353 5804
##
## Accuracy : 0.929
## 95% CI : (0.9237, 0.934)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.849
##
## Mcnemar's Test P-Value : 0.9399
##
## Sensitivity : 0.9057
## Specificity : 0.9431
## Pos Pred Value : 0.9064
## Neg Pred Value : 0.9427
## Prevalence : 0.3783
## Detection Rate : 0.3426
## Detection Prevalence : 0.3780
## Balanced Accuracy : 0.9244
##
## 'Positive' Class : Acer
##
resamps_c50 <- resamples(list(c50 = c50_1, c50 = c50_allData))
df_c50 <- data.frame(Model="c5.0 decision tree", Accuracy=mean(resamps_c50$values$`c50~Accuracy`), Kappa=mean(resamps_c50$values$`c50~Kappa`))
df_c50 <- rbind(df_c50,data.frame(Model="Final c5.o decision tree",Accuracy=mean(resamps_c50$values$`c50~Accuracy`), Kappa=mean(resamps_c50$values$`c50~Kappa`)))
df_c50## Model Accuracy Kappa
## 1 c5.0 decision tree NaN NaN
## 2 Final c5.o decision tree NaN NaN
The final step is to apply the performed predictive model to the incomplete survey in order to predict missing brand preferences.
To apply the predictive model, the data set must suffer the same modifications.
Load data.
SurveyIncomplete <- read_csv("C:\\Users\\user\\Desktop\\Data Analytics II\\M2T2\\data set\\SurveyIncomplete.csv",
col_types = cols(brand = col_skip()))Change attributes name.
Change data types and rename atributte“s labels.
SurveyIncomplete$Educational_level <- factor(SurveyIncomplete$Educational_level, levels = c(0,1,2,3,4), labels = c("Less than High School Degree",
"High School Degree",
"Some College",
"4-Year College Degree",
"Master's, Doctoral or Professional Degree"))
SurveyIncomplete$Car <- factor(SurveyIncomplete$Car, levels = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), labels = c("BMW", "Buick", "Cadillac", "Chevrolet", "Chrysler", "Dodge",
"Ford", "Honda", "Hyundai", "Jeep", "Kia", "Lincoln", "Mazda",
"Mercedes Benz", "Mitsubishi", "Nissan", "Ram", "Subaru",
"Toyota", "None of the above"))
SurveyIncomplete$ZIPcode <- factor(SurveyIncomplete$ZIPcode, levels = c(0,1,2,3,4,5,6,7,8), labels = c("New England", "Mid-Atlantic", "East North Central",
"West North Central", "South Atlantic", "East South Central",
"West South Central", "Mountain", "Pacific"))Now is time to apply performed predictive model.
prediction <- predict(c50_allData, newdata = SurveyIncomplete)
summary(prediction) # Give all predicted values for two possible classes.## Acer Sony
## 1885 3115
SurveyIncomplete$Brand <- c(prediction)
SurveyIncomplete$Brand <- factor(SurveyIncomplete$Brand, levels = c(1,2), labels = c("Acer", "Sony"))
probabilities <- predict(c50_allData, newdata = SurveyIncomplete, type = "prob")
head(probabilities) # Show the probability to predict each class for each row.## Acer Sony
## 1 1.000000 0.00000000
## 2 0.000000 1.00000000
## 3 0.908265 0.09173495
## 4 0.000000 1.00000000
## 5 0.625029 0.37497097
## 6 1.000000 0.00000000