The broad objective of this weeks task was to find out which of two brands of computers our customers prefer. This information would help the company determine the manufacturer to pursue a deeper relationship with.To do this, the CARET package was used.
Method To explain how we arrived at our final outcome, a detailed explanation of each feature and steps taken would be provided in this report to aid for reproducibility
What is the CARET package? The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems. The package utilizes a number of R packages.
The tools within the caret package can be used for data splitting, pre-processing, feature selection, model tunning usng resamling and variable importance estimation.
Steps taken: 1. The data was imported (complete Responses data)
createDataPartition() trainControl() train() predict() postResample()
Note: caret uses two grid methods for tuning: Automatic Grid (automated tuning) and Manual Grid (you specify the parameter values).
Blackwell_Hist_Sample3 <- readRDS(file = “C:/Users/gebruiker/Downloads/Blackwell_Hist_Sample.rds”)
#help for loading files help(read.csv)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
CompleteResponses_M2T2_1 <- read.csv(file="C:/Users/gebruiker/Desktop/Ubiqum_1/CompleteResponses_M2T2.csv")
CompleteResponses_M2T2_1
summary(CompleteResponses_M2T2_1)
summary(CompleteResponses_M2T2_1$brand)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.6217 1.0000 1.0000
plot(CompleteResponses_M2T2_1$brand)
CompleteResponses_M2T2_1$brand<-as.factor(CompleteResponses_M2T2_1$brand)
sum(is.na(CompleteResponses_M2T2_1))
## [1] 0
Random numbers are then generated using the set.seed function
set.seed(123)
the next step is to create data partitions using the createDatapartition()
M2T2_training <- createDataPartition(y = CompleteResponses_M2T2_1$brand, p=0.75, list=FALSE)
The size of the training and test sets were then checked using
# trainSize
# testSize
Assign the training and test data to the names training and testing #observe the minus sign in front of the partitioned data’s name
training <- CompleteResponses_M2T2_1[ M2T2_training,]
testing <- CompleteResponses_M2T2_1[-M2T2_training,]
Training Control allows for estimation of parameter coefficients through methods like cross validation, boosting etc. While using these parameters, then entire data can be used for model building without splitting it. Below script showcases the use of cross validation technique and also how to apply it on the loaded data using “trainControl” function
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
## repeated five times
repeats = 5)
Training is used to estimate coefficient values for various modeling functions like random forest and others. This function sets up a grid of tuning parameters and also can compute resampling based performance measures
Tip: When using all the variables, after the train command, in the bracket you put the parameter u r trying to predict, the sign and full stop to represent all the other parameters. If you want to be specific, just put the name of the needed parameter.
C5.0Fit <- train(brand ~ ., data = training,
method = "C5.0",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
## Warning: 'trials' should be <= 5 for this object. Predictions generated
## using 5 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 7 for this object. Predictions generated
## using 7 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated
## using 9 trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated
## using 9 trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated
## using 4 trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated
## using 8 trials
## Warning: 'trials' should be <= 6 for this object. Predictions generated
## using 6 trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated
## using 5 trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated
## using 9 trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated
## using 5 trials
C5.0Fit
## C5.0
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 5939, 5940, 5938, 5939, 5940, 5939, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8885281 0.7693818
## rules FALSE 10 0.9139828 0.8162104
## rules FALSE 20 0.9151954 0.8193929
## rules TRUE 1 0.8893367 0.7712589
## rules TRUE 10 0.9150073 0.8184354
## rules TRUE 20 0.9165429 0.8222534
## tree FALSE 1 0.8871002 0.7641830
## tree FALSE 10 0.9154646 0.8204014
## tree FALSE 20 0.9167041 0.8234825
## tree TRUE 1 0.8879897 0.7661570
## tree TRUE 10 0.9170273 0.8236130
## tree TRUE 20 0.9177005 0.8255726
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree
## and winnow = TRUE.
Predictions -
C50Probs <- predict(C5.0Fit, newdata = testing)
To see the initial row of the predicited data
head(C50Probs)
## [1] 0 1 0 1 1 1
## Levels: 0 1
Plot resample
plot(C5.0Fit)
To determine how the model prioritized each feature
varImp(C5.0Fit)
## C5.0 variable importance
##
## Overall
## salary 100.00
## age 85.51
## car 85.51
## credit 30.51
## zipcode 0.00
## elevel 0.00
to get a plot of the VarImp
plot(varImp(C5.0Fit))
used to check the preformance of each model
C50Classes <- predict(C5.0Fit, newdata = testing)
str(C50Classes)
## Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 1 ...
confusionMatrix(data = C50Classes, testing$brand)
The resamples function can be used to collect, summarize and contrast the resampling results
# results <- resamples(list(C50 = C5.0Fit, rf = rfFit))
# summary(resamps)
The same steps were taken using the random forest model. The codes used are detailed below:
# str(CompleteResponses_M2T2)
#Train control
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
## repeated ten times
repeats = 5)
#Training
rfFit <- train(brand ~ ., data = training,
method = "rf",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
rfFit
## Random Forest
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 5940, 5939, 5939, 5939, 5939, 5940, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9181560 0.8264293
## 4 0.9169706 0.8235969
## 6 0.9134685 0.8161068
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
#Predictions -
rfProbs <- predict(rfFit, newdata = testing)
summary(rfProbs)
## 0 1
## 968 1506
plot(rfProbs)
#To see the initial row of the predicited data
head(rfProbs)
## [1] 0 1 0 1 1 1
## Levels: 0 1
#Plot resample
plot(rfFit)
#To determine how the model prioritized each feature
varImp(rfFit)
## rf variable importance
##
## Overall
## salary 100.000
## age 48.424
## credit 13.110
## car 4.457
## zipcode 2.047
## elevel 0.000
#To Plot the variable Importance
plot(varImp(rfFit))
#used to check the preformance of each model
rfClasses <- predict(rfFit, newdata = testing)
str(rfClasses)
## Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 1 ...
confusionMatrix(data = rfClasses, testing$brand)
#The resamples function can be used to collect, summarize and contrast the resampling results
results <- resamples(list(C5.0Fit, rfFit))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: Model1, Model2
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Model1 0.8983849 0.9138047 0.9178451 0.9177005 0.9211590 0.9285714 0
## Model2 0.9022911 0.9137466 0.9184636 0.9181560 0.9232323 0.9333333 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Model1 0.7855183 0.8165265 0.8268165 0.8255726 0.8329838 0.8483165 0
## Model2 0.7915625 0.8165265 0.8266703 0.8264293 0.8372772 0.8593202 0
Similar steps were taken for the incomplete data
SurveyIncomplete_M2T2_1 <- read.csv(file="C:/Users/gebruiker/Desktop/Ubiqum_1/SurveyIncomplete_M2T2.csv")
str(SurveyIncomplete_M2T2_1)
## 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 110500 140894 119160 20000 93956 ...
## $ age : int 54 44 49 56 59 71 32 33 32 58 ...
## $ elevel : int 3 4 2 0 1 2 1 4 1 2 ...
## $ car : int 15 20 1 9 15 7 17 17 19 8 ...
## $ zipcode: int 4 7 3 1 1 2 1 0 2 4 ...
## $ credit : num 354724 395015 122025 99630 458680 ...
## $ brand : int 0 0 0 0 0 0 0 0 0 0 ...
#change brand to factor
SurveyIncomplete_M2T2_1$brand<-as.factor(SurveyIncomplete_M2T2_1$brand)
#Create data partition
SM2T2_training<- createDataPartition(y = SurveyIncomplete_M2T2_1$brand, p=0.75, list=FALSE)
trainSize testSize
#these assign the training and test data to the names training and testing
training <- CompleteResponses_M2T2_1[ SM2T2_training,]
testing <- CompleteResponses_M2T2_1[-SM2T2_training,]
#Train control
fitControl_i <- trainControl(
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
#training
rfFit_i <- train(brand ~ ., data = training,
method = "rf",
trControl = fitControl_i,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
rfFit_i
## Random Forest
##
## 3750 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 3376, 3375, 3375, 3375, 3375, 3375, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9163470 0.8227362
## 4 0.9168012 0.8232987
## 6 0.9128821 0.8147560
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
#Predictions -
rfProbs_i <- predict(rfFit_i, newdata = testing)
#To see the initial row of the predicited data
head(rfProbs_i)
## [1] 1 1 0 1 0 1
## Levels: 0 1
#Plot resample
plot(rfFit_i)
#To determine how the model prioritized each feature
varImp(rfFit_i)
## rf variable importance
##
## Overall
## salary 100.000
## age 75.506
## credit 8.333
## car 3.049
## zipcode 2.035
## elevel 0.000
#to get a plot of the VarImp
plot(varImp(rfFit_i))
rfClasses_i <- predict(rfFit_i, newdata = testing)
str(rfClasses_i)
## Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
confusionMatrix(data = rfProbs_i, testing$brand)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2099 260
## 1 227 3562
##
## Accuracy : 0.9208
## 95% CI : (0.9138, 0.9274)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8321
##
## Mcnemar's Test P-Value : 0.147
##
## Sensitivity : 0.9024
## Specificity : 0.9320
## Pos Pred Value : 0.8898
## Neg Pred Value : 0.9401
## Prevalence : 0.3783
## Detection Rate : 0.3414
## Detection Prevalence : 0.3837
## Balanced Accuracy : 0.9172
##
## 'Positive' Class : 0
##
postResample(rfProbs_i, testing$brand)
## Accuracy Kappa
## 0.9207872 0.8320700
summary(rfProbs_i)
## 0 1
## 2359 3789
plot(rfProbs_i)
Findings Findings from the complete dataset as shown on line 54 and demonstrated in the plot on line 59, show that 6154 respondents prefer sony while 3744 respondents prefer Acer. The random forest model was used because it resulted in higher accuracy. This model when applied to the incomplete dataset showed showed that 3786 respondents prefered the Sony, while 2362 respondents prefered the Acer (see line 335 and 340).