The broad objective of this weeks task was to find out which of two brands of computers our customers prefer. This information would help the company determine the manufacturer to pursue a deeper relationship with.To do this, the CARET package was used.

Method To explain how we arrived at our final outcome, a detailed explanation of each feature and steps taken would be provided in this report to aid for reproducibility

What is the CARET package? The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems. The package utilizes a number of R packages.

The tools within the caret package can be used for data splitting, pre-processing, feature selection, model tunning usng resamling and variable importance estimation.

Steps taken: 1. The data was imported (complete Responses data)

  1. A working pipeline was then created. This pipleline used the:

createDataPartition() trainControl() train() predict() postResample()

Note: caret uses two grid methods for tuning: Automatic Grid (automated tuning) and Manual Grid (you specify the parameter values).

  1. Following importation of the dataset, the survey Key was consulted to gain a clear understanding of the dataset. The brand column was converted to factor and not numeric because the models require factor data to run.

how to load files

Blackwell_Hist_Sample3 <- readRDS(file = “C:/Users/gebruiker/Downloads/Blackwell_Hist_Sample.rds”)

#help for loading files help(read.csv)

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
CompleteResponses_M2T2_1 <- read.csv(file="C:/Users/gebruiker/Desktop/Ubiqum_1/CompleteResponses_M2T2.csv")

CompleteResponses_M2T2_1

summary(CompleteResponses_M2T2_1)

summary(CompleteResponses_M2T2_1$brand)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.6217  1.0000  1.0000
plot(CompleteResponses_M2T2_1$brand)

CompleteResponses_M2T2_1$brand<-as.factor(CompleteResponses_M2T2_1$brand)
  1. Pre-processing was then done. this included a basic check of missing data.
sum(is.na(CompleteResponses_M2T2_1))
## [1] 0

Random numbers are then generated using the set.seed function

set.seed(123)

the next step is to create data partitions using the createDatapartition()

M2T2_training <- createDataPartition(y = CompleteResponses_M2T2_1$brand, p=0.75, list=FALSE)

The size of the training and test sets were then checked using

# trainSize
# testSize

Assign the training and test data to the names training and testing #observe the minus sign in front of the partitioned data’s name

training <- CompleteResponses_M2T2_1[ M2T2_training,]
testing  <- CompleteResponses_M2T2_1[-M2T2_training,]

Training Control allows for estimation of parameter coefficients through methods like cross validation, boosting etc. While using these parameters, then entire data can be used for model building without splitting it. Below script showcases the use of cross validation technique and also how to apply it on the loaded data using “trainControl” function

fitControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  ## repeated five times
  repeats = 5)

Training is used to estimate coefficient values for various modeling functions like random forest and others. This function sets up a grid of tuning parameters and also can compute resampling based performance measures

Tip: When using all the variables, after the train command, in the bracket you put the parameter u r trying to predict, the sign and full stop to represent all the other parameters. If you want to be specific, just put the name of the needed parameter.

C5.0Fit <- train(brand ~ ., data = training, 
                 method = "C5.0", 
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
## Warning: 'trials' should be <= 5 for this object. Predictions generated
## using 5 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 7 for this object. Predictions generated
## using 7 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated
## using 9 trials

## Warning: 'trials' should be <= 9 for this object. Predictions generated
## using 9 trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated
## using 4 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 6 for this object. Predictions generated
## using 6 trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated
## using 5 trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated
## using 9 trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated
## using 5 trials
C5.0Fit
## C5.0 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 5939, 5940, 5938, 5939, 5940, 5939, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8885281  0.7693818
##   rules  FALSE   10      0.9139828  0.8162104
##   rules  FALSE   20      0.9151954  0.8193929
##   rules   TRUE    1      0.8893367  0.7712589
##   rules   TRUE   10      0.9150073  0.8184354
##   rules   TRUE   20      0.9165429  0.8222534
##   tree   FALSE    1      0.8871002  0.7641830
##   tree   FALSE   10      0.9154646  0.8204014
##   tree   FALSE   20      0.9167041  0.8234825
##   tree    TRUE    1      0.8879897  0.7661570
##   tree    TRUE   10      0.9170273  0.8236130
##   tree    TRUE   20      0.9177005  0.8255726
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree
##  and winnow = TRUE.

Predictions -

C50Probs <- predict(C5.0Fit, newdata = testing)

To see the initial row of the predicited data

head(C50Probs)
## [1] 0 1 0 1 1 1
## Levels: 0 1

Plot resample

plot(C5.0Fit)

To determine how the model prioritized each feature

varImp(C5.0Fit)
## C5.0 variable importance
## 
##         Overall
## salary   100.00
## age       85.51
## car       85.51
## credit    30.51
## zipcode    0.00
## elevel     0.00

to get a plot of the VarImp

plot(varImp(C5.0Fit))

used to check the preformance of each model

C50Classes <- predict(C5.0Fit, newdata = testing)
str(C50Classes)
##  Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 1 ...

confusionMatrix(data = C50Classes, testing$brand)

The resamples function can be used to collect, summarize and contrast the resampling results

# results <- resamples(list(C50 = C5.0Fit, rf = rfFit))
# summary(resamps)

The same steps were taken using the random forest model. The codes used are detailed below:

# str(CompleteResponses_M2T2)

#Train control

fitControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  ## repeated ten times
  repeats = 5)

#Training

rfFit <- train(brand ~ ., data = training, 
                 method = "rf", 
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
rfFit
## Random Forest 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 5940, 5939, 5939, 5939, 5939, 5940, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9181560  0.8264293
##   4     0.9169706  0.8235969
##   6     0.9134685  0.8161068
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

#Predictions -

rfProbs <- predict(rfFit, newdata = testing)
summary(rfProbs)
##    0    1 
##  968 1506
plot(rfProbs)

#To see the initial row of the predicited data

head(rfProbs)
## [1] 0 1 0 1 1 1
## Levels: 0 1

#Plot resample

plot(rfFit)

#To determine how the model prioritized each feature

varImp(rfFit)
## rf variable importance
## 
##         Overall
## salary  100.000
## age      48.424
## credit   13.110
## car       4.457
## zipcode   2.047
## elevel    0.000

#To Plot the variable Importance

plot(varImp(rfFit))

#used to check the preformance of each model

rfClasses <- predict(rfFit, newdata = testing)
str(rfClasses)
##  Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 1 ...

confusionMatrix(data = rfClasses, testing$brand)

#The resamples function can be used to collect, summarize and contrast the resampling results

results <- resamples(list(C5.0Fit, rfFit))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: Model1, Model2 
## Number of resamples: 25 
## 
## Accuracy 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Model1 0.8983849 0.9138047 0.9178451 0.9177005 0.9211590 0.9285714    0
## Model2 0.9022911 0.9137466 0.9184636 0.9181560 0.9232323 0.9333333    0
## 
## Kappa 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Model1 0.7855183 0.8165265 0.8268165 0.8255726 0.8329838 0.8483165    0
## Model2 0.7915625 0.8165265 0.8266703 0.8264293 0.8372772 0.8593202    0

Similar steps were taken for the incomplete data

SurveyIncomplete_M2T2_1 <- read.csv(file="C:/Users/gebruiker/Desktop/Ubiqum_1/SurveyIncomplete_M2T2.csv")
str(SurveyIncomplete_M2T2_1)
## 'data.frame':    5000 obs. of  7 variables:
##  $ salary : num  110500 140894 119160 20000 93956 ...
##  $ age    : int  54 44 49 56 59 71 32 33 32 58 ...
##  $ elevel : int  3 4 2 0 1 2 1 4 1 2 ...
##  $ car    : int  15 20 1 9 15 7 17 17 19 8 ...
##  $ zipcode: int  4 7 3 1 1 2 1 0 2 4 ...
##  $ credit : num  354724 395015 122025 99630 458680 ...
##  $ brand  : int  0 0 0 0 0 0 0 0 0 0 ...

#change brand to factor

SurveyIncomplete_M2T2_1$brand<-as.factor(SurveyIncomplete_M2T2_1$brand)

#Create data partition

SM2T2_training<- createDataPartition(y = SurveyIncomplete_M2T2_1$brand, p=0.75, list=FALSE)

trainSize testSize

#these assign the training and test data to the names training and testing

training <- CompleteResponses_M2T2_1[ SM2T2_training,]
testing  <- CompleteResponses_M2T2_1[-SM2T2_training,]

#Train control

fitControl_i <- trainControl(
  method = "repeatedcv",
  number = 10,
  ## repeated ten times
  repeats = 10)

#training

rfFit_i <- train(brand ~ ., data = training, 
                 method = "rf", 
                 trControl = fitControl_i,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
rfFit_i
## Random Forest 
## 
## 3750 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 3376, 3375, 3375, 3375, 3375, 3375, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9163470  0.8227362
##   4     0.9168012  0.8232987
##   6     0.9128821  0.8147560
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

#Predictions -

rfProbs_i <- predict(rfFit_i, newdata = testing)

#To see the initial row of the predicited data

head(rfProbs_i)
## [1] 1 1 0 1 0 1
## Levels: 0 1

#Plot resample

plot(rfFit_i)

#To determine how the model prioritized each feature

varImp(rfFit_i)
## rf variable importance
## 
##         Overall
## salary  100.000
## age      75.506
## credit    8.333
## car       3.049
## zipcode   2.035
## elevel    0.000

#to get a plot of the VarImp

plot(varImp(rfFit_i))

rfClasses_i <- predict(rfFit_i, newdata = testing)
str(rfClasses_i)
##  Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
confusionMatrix(data = rfProbs_i, testing$brand)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2099  260
##          1  227 3562
##                                           
##                Accuracy : 0.9208          
##                  95% CI : (0.9138, 0.9274)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8321          
##                                           
##  Mcnemar's Test P-Value : 0.147           
##                                           
##             Sensitivity : 0.9024          
##             Specificity : 0.9320          
##          Pos Pred Value : 0.8898          
##          Neg Pred Value : 0.9401          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3414          
##    Detection Prevalence : 0.3837          
##       Balanced Accuracy : 0.9172          
##                                           
##        'Positive' Class : 0               
## 
postResample(rfProbs_i, testing$brand)
##  Accuracy     Kappa 
## 0.9207872 0.8320700
summary(rfProbs_i)
##    0    1 
## 2359 3789
plot(rfProbs_i)

Findings Findings from the complete dataset as shown on line 54 and demonstrated in the plot on line 59, show that 6154 respondents prefer sony while 3744 respondents prefer Acer. The random forest model was used because it resulted in higher accuracy. This model when applied to the incomplete dataset showed showed that 3786 respondents prefered the Sony, while 2362 respondents prefered the Acer (see line 335 and 340).