1 Introduction

In this work we have been asked to work with data mining tools to resolve some problems of the electronic devices company Blackwell Electronics. This company has made some surveys in order to know better customers, but some surveys have unresponsed questions, more specifically which computer brand they prefer. We can use customer responses to some survey questions (e.g. income, age, etc.) to predict the answer to the brand preference question.

What are we trying to predict?

We need to predict the brand preferences for the incomplete survey responses.

What type of problem is it? Classification or Regression? Binary or Multi-class? Uni-variate or Multi-variate?

It is a binary classification problem with multiple features.

What type of data do we have?

We have three files, one to build trained and predictive models (CompleteResponses), another to explain the survey questions (SurveyKey) and the last one, is the data set from we have to predict brand preferences (Survey_incomplete).

2 Load libraries

First, we need to activate packages to be used during the project.

library(readr)
library(dplyr)
library(caret)
library(C50)
library(rpart.plot)
library(gam)

3 Import Data

The next step is to upload the data we are going to be work with.

CompleteResponses <- read_csv("C:\\Users\\user\\Desktop\\Data Analytics II\\M2T2\\data set\\CompleteResponses.csv")

4 Initial exploration of data

With the function glimpse() we can check the dimensions and the structure of our data set.

glimpse(CompleteResponses)

## Observations: 9,898
## Variables: 7
## $ salary  <dbl> 119806.54, 106880.48, 78020.75, 63689.94, 50873.62, 130812....
## $ age     <dbl> 45, 63, 23, 51, 20, 56, 24, 62, 29, 41, 48, 52, 52, 33, 62,...
## $ elevel  <dbl> 0, 1, 0, 3, 3, 3, 4, 3, 4, 1, 4, 1, 3, 4, 2, 1, 2, 1, 2, 0,...
## $ car     <dbl> 14, 11, 15, 6, 14, 14, 8, 3, 17, 5, 16, 6, 20, 13, 6, 11, 7...
## $ zipcode <dbl> 4, 6, 2, 5, 4, 3, 5, 0, 0, 4, 5, 0, 4, 3, 3, 4, 7, 2, 8, 2,...
## $ credit  <dbl> 442037.71, 45007.18, 48795.32, 40888.88, 352951.50, 135943....
## $ brand   <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1,...

To finish with the initial exploration we will check for missing values and in case of finding them we will transform them.

missing_values <- is.na(CompleteResponses)
sum(missing_values)

## [1] 0

There are not missing values so we can proceed with data exploration.

5 Pre-processing

In this part we are goint to adequate data to our analysis.

First, we are going to change attributes names.

names(CompleteResponses)

## [1] "salary"  "age"     "elevel"  "car"     "zipcode" "credit"  "brand"

names(CompleteResponses) <- c("Salary", "Age", "Educational_level", "Car", "ZIPcode", "Credit", "Brand")

What happens with the data types. Does the type of data correspond to each value? In the previous glimpse() function, we verify that all data is double, but from the Survey Key data set we know that the educational level classifies people into 5 categories, the Car attribute corresponds to the main car brand and the ZIPcode says which region each family lives in. Finally, our dependent variable corresponds to which brand is preferred. Then, these four variables must be changed to categorical or factor.

At the same time, the labels of these variables will be changed according to that seen in Survey Key data set.

CompleteResponses$Educational_level <- factor(CompleteResponses$Educational_level, levels = c(0,1,2,3,4), labels = c("Less than High School Degree",
    "High School Degree",
    "Some College",
    "4-Year College Degree",
    "Master's, Doctoral or Professional Degree"))

CompleteResponses$Car <- factor(CompleteResponses$Car, levels = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), labels = c("BMW", "Buick", "Cadillac", "Chevrolet", "Chrysler", "Dodge",
    "Ford", "Honda", "Hyundai", "Jeep", "Kia", "Lincoln", "Mazda",
    "Mercedes Benz", "Mitsubishi", "Nissan", "Ram", "Subaru", 
    "Toyota", "None of the above"))

CompleteResponses$ZIPcode <- factor(CompleteResponses$ZIPcode, levels = c(0,1,2,3,4,5,6,7,8), labels =  c("New England", "Mid-Atlantic", "East North Central",
    "West North Central", "South Atlantic", "East South Central",
    "West South Central", "Mountain", "Pacific"))
  
CompleteResponses$Brand <- factor(CompleteResponses$Brand, levels = c(0,1), labels =  c("Acer", "Sony"))

6 Exploratory Data Analysis (EDA)

In this part we will check how data is distributed. We will perform some plots using “ggplot2” package.

6.1 Histogram; numeric variables.

Modifying the aesthetics of geom_histogram() we can observe the relation of each numerical data with dependant variable.

ggplot(CompleteResponses, aes(Salary, color = "Salary", binwidth=5 )) + geom_histogram(aes(fill=Brand)) + labs(title = "Salary")

There is a relationship between Salary and Brand. In fact, lowest salaries ( ~ 10.000) and highest salaries ( > 130.000) prefer Sony in almost 100%. People with salary between 30.000 - 50.000 and 110.000 - 130.000 still prefer Sony but not as much as before and the rest people prefer Acer.

6.2 Scatter plot ; numerical data

Scatter plot can relate 2 variables with the dependant variable to check these correlations.

ggplot(CompleteResponses, aes(Salary, Age, color = Brand)) + geom_point() + labs(title = "Salary vs Age")

From this plot it can be observed the following:

Age < 40 and 50000 < Salary < 100000 –> Acer
40 < Age < 60 and 80000 < Salary < 120000 –> Acer
60 < Age and Salary < 70000 –> Acer

Now its time to check Salary and Credit relation scatter plot.

ggplot(CompleteResponses, aes(Salary, Credit, color = Brand)) + geom_point() + labs(title = "Salary vs Credit")

We can conclude the following:

Salary > 120000 and all credit range –> Sony

Finally we check the relation scatter plot of Age and Credit and nothing clear can be concluded.

6.3 Boxplot ; numerical data

Th last plot to check numerical data that is going to be used is the boxplot. In this type of plots outlier values of each variable can be determined and in this case there have not been found any outlier.

It can also be observed the relation of numerical data with dependant variable in boxplots.

ggplot(CompleteResponses, aes("",Salary, color = Brand)) + geom_boxplot() + labs(title = "Salary")

It can be observed that people that prefer Sony has higher salary than people that prefer Acer.

6.4 Bar chart ; categorical data

In order to know how categorical data is ditributed the most common used plot is the bar chart.

ggplot(CompleteResponses, aes(Brand, color = Brand, fill = Brand)) + geom_bar() + labs(title = "Brand")

In this plot it is observed the differences in brand preferences.

7 Training and testing sets

Once data is preprocessed, it´s time to create training and testing sets for the predictive model that must be performed.

First the seed number must be set. The seed number is a number that you choose for a starting point used to create a sequence of random numbers. It is also helpful for others who want to recreate your same results.

set.seed(123)

The next step is to split data into two sets, training and testing set. A common split is 75/25, which means that 75% of the data will be the training set and 25% of the data will be the test set. For that, createDataPartition() function is used, that does a stratified random split of the data.

inTrain <- createDataPartition(y = CompleteResponses$Brand, p=.75, list = FALSE) 

# To partition the data:
training <- CompleteResponses[ inTrain,] # Training set for normal data set.
testing <- CompleteResponses[-inTrain,] # Testing set for normal data set.

Now is time to run different algorithms.

8 Modelling

We will run 3 different methods and then we will compare. The best is going to be used to predict our incomplete survey´s brand preferences.

This algorithms can be compared if all have the same resampling method and the same number of repetitions. To modify resampling method trainControl() function can be used.

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1) # repeatedcv --> repeated cross-validation (repeats for # repeats)

8.1 CART decision tree

The first method that is going to be performed is Classification and Regression tree algorithm. To run any algorithm the train() function is used. In this case, the method that will be used is called “rpart”.

In the first iteration it is going to perform the feature engineering. With the function varImp() we can know which variable is important and which not.

# CART decision tree
dt <- train(Brand ~., data = training, method = "rpart", trControl=fitControl, tuneLength=1) # rpart --> CART, Classification And Regression Tree. 
dt

## CART 
## 
## 7424 samples
##    6 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6682, 6682, 6682, 6681, 6682, 6682, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6858819  0.2594072
## 
## Tuning parameter 'cp' was held constant at a value of 0.09425451

variable_importance_dt <- varImp(dt) # A generic method for calculating variable importance for objects produced by train and method specific method

It can be observed that only Age and Salary variables have the enough significance to be taken into account. So we have to modified the training and testing sets in order to reduce noise and improve accuracy and re-run the algorithm.

CompleteResponses_1 <- data.frame(CompleteResponses$Salary, CompleteResponses$Age, CompleteResponses$Brand)

names(CompleteResponses_1) <- c("Salary", "Age", "Brand")

inTrain_1 <- createDataPartition(y = CompleteResponses_1$Brand, p=.75, list = FALSE) 

# To partition the data:
training_1 <- CompleteResponses_1[ inTrain_1,] # Training set for the data frame that only takes into account Salary and Age variables.
testing_1 <- CompleteResponses_1[-inTrain_1,] # Testing set for the data frame that only takes into account Salary and Age variables.

It is also possible to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.

# CART decision tree
dt_1 <- train(Brand ~., data = training_1, method = "rpart", trControl=fitControl, preProc = c("center", "scale"))

rpart.plot(dt_1$finalMode)

Once the training algorithm is performed it has to be tried to predict in testing set.

Brand_predict_dt <- predict(dt_1, newdata = testing_1)

Finally it will be generated the confusion matrix and statistics will be checked.

confusionMatrix(Brand_predict_dt, testing_1$Brand)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  899  435
##       Sony   37 1103
##                                           
##                Accuracy : 0.8092          
##                  95% CI : (0.7932, 0.8245)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6256          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9605          
##             Specificity : 0.7172          
##          Pos Pred Value : 0.6739          
##          Neg Pred Value : 0.9675          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3634          
##    Detection Prevalence : 0.5392          
##       Balanced Accuracy : 0.8388          
##                                           
##        'Positive' Class : Acer            
##

lvs <- c("Acer", "Sony")
truth <- factor(rep(lvs, 
                    times = c(936, 1538)), 
                levels = rev(lvs))
pred <- factor(c(rep(lvs, 
                     times = c(899, 37)), 
                 rep(lvs, times = c(435, 1103))), 
               levels = rev(lvs))

confusionMatrix(pred, truth)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Sony Acer
##       Sony 1103   37
##       Acer  435  899
##                                           
##                Accuracy : 0.8092          
##                  95% CI : (0.7932, 0.8245)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6256          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7172          
##             Specificity : 0.9605          
##          Pos Pred Value : 0.9675          
##          Neg Pred Value : 0.6739          
##              Prevalence : 0.6217          
##          Detection Rate : 0.4458          
##    Detection Prevalence : 0.4608          
##       Balanced Accuracy : 0.8388          
##                                           
##        'Positive' Class : Sony            
##

table_dt <- data.frame(confusionMatrix(pred, truth)$table)

table_dt

##   Prediction Reference Freq
## 1       Sony      Sony 1103
## 2       Acer      Sony  435
## 3       Sony      Acer   37
## 4       Acer      Acer  899

plotTable <- table_dt %>% mutate(goodbad = ifelse(table_dt$Prediction == table_dt$Reference, "good", "bad")) %>%  group_by(Reference) %>% mutate(prop = Freq/sum(Freq))

# fill alpha relative to sensitivity/specificity by proportional outcomes within reference groups (see dplyr code above as well as original confusion matrix for comparison)
ggplot(data = plotTable, mapping = aes(x = Reference, y = Prediction, fill = goodbad, alpha = prop)) + geom_tile() + geom_text(aes(label = Freq), vjust = .5, fontface  = "bold", alpha = 1) + scale_fill_manual(values = c(good = "green", bad = "red")) + theme_bw() + xlim(rev(levels(table_dt$Reference)))

8.2 C50

The next method that is going to be performed is C5.0 decision tree algorithm. In this case, the method that will be used in the train() function is called “C5.0”.

Like with CART decision tree, in the first iteration variable importance has to be checked with varImp() function.

# c5.0 decision tree
c50 <- train(Brand ~., data = training, method = "C5.0", trControl=fitControl, tuneLength=2)

variable_importance_c50 <- varImp(c50)

It can be observed that only Age and Salary variables have the enough significance to be taken into account, the same that happened with CART decision tree. So we have to used previously modified training and testing sets.

It is also possible, as before, to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.

This decision tree C5.0 with an Automatic Tuning Grid with an optimized tuneLength of 2.

c50_1 <- train(Brand~., data = training_1, method = "C5.0", trControl=fitControl, preProc = c("center", "scale"), tuneLength=2) # The tuneGrid parameter lets us decide which values the main parameter will take. While tuneLength only limit the number of default parameters to use.

plot(c50_1) # winnow --> a logical that enacts a feature selection step prior to model building..

Once the training algorithm is performed it has to be tried to predict in testing set.

Brand_predict_c50 <- predict(c50_1, newdata = testing_1)

Finally it will be generated the confusion matrix and statistics will be checked.

confusionMatrix(Brand_predict_c50, testing_1$Brand)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  830   83
##       Sony  106 1455
##                                           
##                Accuracy : 0.9236          
##                  95% CI : (0.9124, 0.9338)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8368          
##                                           
##  Mcnemar's Test P-Value : 0.1095          
##                                           
##             Sensitivity : 0.8868          
##             Specificity : 0.9460          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.9321          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3355          
##    Detection Prevalence : 0.3690          
##       Balanced Accuracy : 0.9164          
##                                           
##        'Positive' Class : Acer            
##

lvs <- c("Acer", "Sony")
truth <- factor(rep(lvs, times = c(936, 1538)), 
                levels = rev(lvs))
pred_1 <- factor(c(rep(lvs, times = c(830, 106)),
                   rep(lvs, times = c(83, 1455))), 
                 levels = rev(lvs))

confusionMatrix(pred_1, truth)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Sony Acer
##       Sony 1455  106
##       Acer   83  830
##                                           
##                Accuracy : 0.9236          
##                  95% CI : (0.9124, 0.9338)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8368          
##                                           
##  Mcnemar's Test P-Value : 0.1095          
##                                           
##             Sensitivity : 0.9460          
##             Specificity : 0.8868          
##          Pos Pred Value : 0.9321          
##          Neg Pred Value : 0.9091          
##              Prevalence : 0.6217          
##          Detection Rate : 0.5881          
##    Detection Prevalence : 0.6310          
##       Balanced Accuracy : 0.9164          
##                                           
##        'Positive' Class : Sony            
##

table_c50 <- data.frame(confusionMatrix(pred_1, truth)$table)

plotTable_c50 <- table_c50 %>% mutate(goodbad = ifelse(table_c50$Prediction == table_c50$Reference, "good", "bad")) %>%  group_by(Reference) %>% mutate(prop = Freq/sum(Freq))

# fill alpha relative to sensitivity/specificity by proportional outcomes within reference groups (see dplyr code above as well as original confusion matrix for comparison)
ggplot(data = plotTable_c50, mapping = aes(x = Reference, y = Prediction, fill = goodbad, alpha = prop)) + geom_tile() + geom_text(aes(label = Freq), vjust = .5, fontface  = "bold", alpha = 1) + scale_fill_manual(values = c(good = "green", bad = "red")) + theme_bw() + xlim(rev(levels(table_c50$Reference)))

8.3 Random forest

The last method that is going to be performed is Random Forest algorithm. In this case, the method that will be used in the train() function is called “rf”.

Like with the other models, first iteration is to check the importance of variables.

# Random Forest
rf <- train(Brand ~., data = training, method = "rf", trControl=fitControl)

variable_importance_rf <- varImp(rf)

It can be observed that only Age and Salary variables have the enough significance to be taken into account, the same that happened with CART and c5.0 decision trees. So we have to used previously modified training and testing sets.

It is also possible, as before, to normalize the numeric variables. In this case Salary and Age are numeric, so the algorithm will suffer a preprocessing.

# Random Forest
rfGrid <- expand.grid(mtry=2)
rf_1 <- train(Brand~.,data=training_1, method="rf", trControl=fitControl,  preProc = c("center", "scale"), tuneGrid=rfGrid)

Once the training algorithm is performed it has to be tried to predict in testing set.

Brand_predict_rf <- predict(rf_1, newdata = testing_1)

Finally it will be generated the confusion matrix and statistics will be checked.

confusionMatrix(Brand_predict_rf, testing_1$Brand)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  823  124
##       Sony  113 1414
##                                           
##                Accuracy : 0.9042          
##                  95% CI : (0.8919, 0.9155)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7968          
##                                           
##  Mcnemar's Test P-Value : 0.516           
##                                           
##             Sensitivity : 0.8793          
##             Specificity : 0.9194          
##          Pos Pred Value : 0.8691          
##          Neg Pred Value : 0.9260          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3327          
##    Detection Prevalence : 0.3828          
##       Balanced Accuracy : 0.8993          
##                                           
##        'Positive' Class : Acer            
##

lvs <- c("Acer", "Sony")
truth <- factor(rep(lvs, times = c(936, 1538)), 
                levels = rev(lvs))
pred_rf <- factor(c(rep(lvs, times = c(823, 113)),
                   rep(lvs, times = c(124, 1414))), 
                 levels = rev(lvs))

confusionMatrix(pred_rf, truth)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Sony Acer
##       Sony 1414  113
##       Acer  124  823
##                                           
##                Accuracy : 0.9042          
##                  95% CI : (0.8919, 0.9155)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7968          
##                                           
##  Mcnemar's Test P-Value : 0.516           
##                                           
##             Sensitivity : 0.9194          
##             Specificity : 0.8793          
##          Pos Pred Value : 0.9260          
##          Neg Pred Value : 0.8691          
##              Prevalence : 0.6217          
##          Detection Rate : 0.5715          
##    Detection Prevalence : 0.6172          
##       Balanced Accuracy : 0.8993          
##                                           
##        'Positive' Class : Sony            
##

table_rf <- data.frame(confusionMatrix(pred_rf, truth)$table)

plotTable_rf <- table_rf %>% mutate(goodbad = ifelse(table_rf$Prediction == table_rf$Reference, "good", "bad")) %>%  group_by(Reference) %>% mutate(prop = Freq/sum(Freq))

# fill alpha relative to sensitivity/specificity by proportional outcomes within reference groups (see dplyr code above as well as original confusion matrix for comparison)
ggplot(data = plotTable_rf, mapping = aes(x = Reference, y = Prediction, fill = goodbad, alpha = prop)) + geom_tile() + geom_text(aes(label = Freq), vjust = .5, fontface  = "bold", alpha = 1) + scale_fill_manual(values = c(good = "green", bad = "red")) + theme_bw() + xlim(rev(levels(table_rf$Reference)))

9 Resamples

After making the predictions using the test set use postResample() function to assess the metrics of the new predictions compared to the Ground Truth.

resamps <- resamples(list(rpart = dt_1, c50 = c50_1, rf = rf_1))
summary(resamps) # Show the accuracy and kappa of each model. Remember that the number of resamples was fit to 10 in fitcontrol object.

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: rpart, c50, rf 
## Number of resamples: 10 
## 
## Accuracy 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart 0.7913863 0.8965633 0.9002675 0.8849689 0.9077137 0.9219381    0
## c50   0.9057873 0.9164420 0.9218855 0.9214719 0.9255645 0.9407008    0
## rf    0.8894879 0.9056604 0.9111102 0.9084044 0.9147864 0.9164420    0
## 
## Kappa 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart 0.5925252 0.7843747 0.7921431 0.7649501 0.8080546 0.8351756    0
## c50   0.7979975 0.8218746 0.8331427 0.8329157 0.8421463 0.8745157    0
## rf    0.7661303 0.7988226 0.8119569 0.8055248 0.8183492 0.8236597    0

diffs <- diff(resamps)
summary(diffs)  # Gives the differences of the resamps object. Another way to check which model fits better.

## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##       rpart   c50      rf      
## rpart         -0.03650 -0.02344
## c50   0.11281           0.01307
## rf    0.44853 0.01685          
## 
## Kappa 
##       rpart   c50      rf      
## rpart         -0.06797 -0.04057
## c50   0.12067           0.02739
## rf    0.53639 0.02006

df <- data.frame(Model="CART decision tree", Accuracy=mean(resamps$values$`rpart~Accuracy`), Kappa=mean(resamps$values$`rpart~Kappa`))
df <- rbind(df,data.frame(Model="c5.0 decision tree",Accuracy=mean(resamps$values$`c50~Accuracy`), Kappa=mean(resamps$values$`c50~Kappa`)))
df <- rbind(df,data.frame(Model="Random Forest",Accuracy=mean(resamps$values$`rf~Accuracy`), Kappa=mean(resamps$values$`rf~Kappa`)))

df

##                Model  Accuracy     Kappa
## 1 CART decision tree 0.8849689 0.7649501
## 2 c5.0 decision tree 0.9214719 0.8329157
## 3      Random Forest 0.9084044 0.8055248

ggplot(df, aes(Accuracy , Model)) + geom_point() + labs(title="Accuracy vs Model")

ggplot(df, aes(Kappa , Model)) + geom_point() + labs(title="Kappa vs Model")

xyplot(resamps, what = "BlandAltman")

# CART decision tree
dt_param <- postResample(Brand_predict_dt, testing_1$Brand) # Given two factors, the overall agreement rate and Kappa are determined.
dt_Probs <- predict(dt_1, newdata = testing_1, type = "prob") # compute class probabilities from the model.

# c5.0 decision tree
c50_param <- postResample(Brand_predict_c50, testing_1$Brand)
c50_Probs <- predict(c50_1, newdata = testing_1, type = "prob") 

# Random forest decision tree
rf_param <- postResample(Brand_predict_rf, testing_1$Brand)
rf_Probs <- predict(rf_1, newdata = testing_1, type = "prob") 

comp_model <- data.frame(dt_param, c50_param, rf_param)
comp_model # Compares the accuracy and kappa of each used algortihm.

##           dt_param c50_param  rf_param
## Accuracy 0.8092158 0.9236055 0.9042037
## Kappa    0.6255760 0.8368102 0.7968166

Among used different methods the c5.0 method gives the best results. So the performed predictive model that is going to apply to predict brand preferences in the incomplete survey is going to be c5.0.

Finally we will train our predictive model in the whole set and not just in our testing set.

c50_allData <- train(Brand~., data = CompleteResponses_1, method = "C5.0", trControl=fitControl, preProc = c("center", "scale"), tuneLength=2)

predictiveModel <- predict(c50_allData, newdata = CompleteResponses_1)

confusionMatrix(predictiveModel, CompleteResponses_1$Brand)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer 3391  350
##       Sony  353 5804
##                                          
##                Accuracy : 0.929          
##                  95% CI : (0.9237, 0.934)
##     No Information Rate : 0.6217         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.849          
##                                          
##  Mcnemar's Test P-Value : 0.9399         
##                                          
##             Sensitivity : 0.9057         
##             Specificity : 0.9431         
##          Pos Pred Value : 0.9064         
##          Neg Pred Value : 0.9427         
##              Prevalence : 0.3783         
##          Detection Rate : 0.3426         
##    Detection Prevalence : 0.3780         
##       Balanced Accuracy : 0.9244         
##                                          
##        'Positive' Class : Acer           
##

resamps_c50 <- resamples(list(c50 = c50_1, c50 = c50_allData))

df_c50 <- data.frame(Model="c5.0 decision tree", Accuracy=mean(resamps_c50$values$`c50~Accuracy`), Kappa=mean(resamps_c50$values$`c50~Kappa`))
df_c50 <- rbind(df_c50,data.frame(Model="Final c5.o decision tree",Accuracy=mean(resamps_c50$values$`c50~Accuracy`), Kappa=mean(resamps_c50$values$`c50~Kappa`)))

df_c50

##                      Model Accuracy Kappa
## 1       c5.0 decision tree      NaN   NaN
## 2 Final c5.o decision tree      NaN   NaN

10 Survey incomplete

The final step is to apply the performed predictive model to the incomplete survey in order to predict missing brand preferences.

To apply the predictive model, the data set must suffer the same modifications.

Load data.

SurveyIncomplete <- read_csv("C:\\Users\\user\\Desktop\\Data Analytics II\\M2T2\\data set\\SurveyIncomplete.csv", 
    col_types = cols(brand = col_skip()))

Change attributes name.

names(SurveyIncomplete) <- c("Salary", "Age", "Educational_level", "Car", "ZIPcode", "Credit")

Change data types and rename atributte´s labels.

SurveyIncomplete$Educational_level <- factor(SurveyIncomplete$Educational_level, levels = c(0,1,2,3,4), labels = c("Less than High School Degree",
    "High School Degree",
    "Some College",
    "4-Year College Degree",
    "Master's, Doctoral or Professional Degree"))

SurveyIncomplete$Car <- factor(SurveyIncomplete$Car, levels = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), labels = c("BMW", "Buick", "Cadillac", "Chevrolet", "Chrysler", "Dodge",
    "Ford", "Honda", "Hyundai", "Jeep", "Kia", "Lincoln", "Mazda",
    "Mercedes Benz", "Mitsubishi", "Nissan", "Ram", "Subaru", 
    "Toyota", "None of the above"))

SurveyIncomplete$ZIPcode <- factor(SurveyIncomplete$ZIPcode, levels = c(0,1,2,3,4,5,6,7,8), labels =  c("New England", "Mid-Atlantic", "East North Central",
    "West North Central", "South Atlantic", "East South Central",
    "West South Central", "Mountain", "Pacific"))

Now is time to apply performed predictive model.

prediction <- predict(c50_allData, newdata = SurveyIncomplete)
summary(prediction) # Give all predicted values for two possible classes.

## Acer Sony 
## 1885 3115

SurveyIncomplete$Brand <- c(prediction)

SurveyIncomplete$Brand <- factor(SurveyIncomplete$Brand, levels = c(1,2), labels =  c("Acer", "Sony"))

probabilities <- predict(c50_allData, newdata = SurveyIncomplete, type = "prob") 
head(probabilities) # Show the probability to predict each class for each row.

##       Acer       Sony
## 1 1.000000 0.00000000
## 2 0.000000 1.00000000
## 3 0.908265 0.09173495
## 4 0.000000 1.00000000
## 5 0.625029 0.37497097
## 6 1.000000 0.00000000

Brand Prediction

Elias Lobato

8/1/2020

1 Introduction

2 Load libraries

3 Import Data

4 Initial exploration of data

5 Pre-processing

6 Exploratory Data Analysis (EDA)

6.1 Histogram; numeric variables.

6.2 Scatter plot ; numerical data

6.3 Boxplot ; numerical data

6.4 Bar chart ; categorical data

7 Training and testing sets

8 Modelling

8.1 CART decision tree

8.2 C50

8.3 Random forest

9 Resamples

10 Survey incomplete