CompleteResponses<-read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\R_exercises\\CompleteResponses.csv")
Original<-read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\R_exercises\\CompleteResponses.csv")
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(gbm)
## Loaded gbm 2.1.8.1
library(ISLR)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Summary of the Data set:
str(CompleteResponses)
## 'data.frame': 9898 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : int 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : int 0 1 0 3 3 3 4 3 4 1 ...
## $ car : int 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: int 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : int 0 1 0 1 0 1 1 1 0 1 ...
summary(CompleteResponses)
## salary age elevel car
## Min. : 20000 Min. :20.00 Min. :0.000 Min. : 1.00
## 1st Qu.: 52082 1st Qu.:35.00 1st Qu.:1.000 1st Qu.: 6.00
## Median : 84950 Median :50.00 Median :2.000 Median :11.00
## Mean : 84871 Mean :49.78 Mean :1.983 Mean :10.52
## 3rd Qu.:117162 3rd Qu.:65.00 3rd Qu.:3.000 3rd Qu.:15.75
## Max. :150000 Max. :80.00 Max. :4.000 Max. :20.00
## zipcode credit brand
## Min. :0.000 Min. : 0 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:120807 1st Qu.:0.0000
## Median :4.000 Median :250607 Median :1.0000
## Mean :4.041 Mean :249176 Mean :0.6217
## 3rd Qu.:6.000 3rd Qu.:374640 3rd Qu.:1.0000
## Max. :8.000 Max. :500000 Max. :1.0000
We have some variables types that need to be amended:
CompleteResponses$elevel<- as.factor(CompleteResponses$elevel)
CompleteResponses$car<- as.factor(CompleteResponses$car)
CompleteResponses$zipcode<-as.factor(CompleteResponses$zipcode)
CompleteResponses$brand<- as.factor(CompleteResponses$brand)
Mean salary for Acer (0) smaller than that of Sony, Acer also has a smaller IQR than Sony. We don’t see any outliers in our plots.
The below density chart shows a uniform distribution in our dataset. The mean & median is 50 years old.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The below graph shows distinct groups of customers preferring either
Acer or Sony within pockets.
Another uniformly distributed population among education levels.
## Warning in geom_histogram(stat = "count", color = "darkblue", fill =
## "lightblue"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
A similar picture here, even distribution among car brands.
## Warning in geom_histogram(stat = "count", color = "darkblue", fill =
## "lightblue"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
Uniform distribution amogst zip codes:
## Warning in geom_histogram(stat = "count", color = "darkblue", fill =
## "lightblue"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
No outliers in credit limit data:
We can see from the below the mean credit limit is very similar across their brand preferences:
We start off with creating a random order for our data and checking whether we have any missing data. We will be splitting our data into 75%/25% sets for training and testing.
#Create a random order for the dataset
set.seed(123)
#Check for any Missing values (It was detailed that this is a complete survey)
is.null(CompleteResponses)
## [1] FALSE
#No empty cells
#Split data into train and test sets at 75%/25%
IndexTrain<- createDataPartition(y=CompleteResponses$brand,
p=0.75,
list=FALSE)
Training_set_responses<-CompleteResponses[IndexTrain,]
Testing_set_responses<-CompleteResponses[-IndexTrain,]
We have three numeric variables; salary, credit and age - all of which vary in magnitude. The below will normalize this data:
#Preprocess the numeric data
numerics<-c('salary','credit','age')
ProcValues<-preProcess(Training_set_responses[,numerics],method=c('center','scale'))
Training_set_responses_numerics_Scaled<- predict(ProcValues,Training_set_responses[,numerics])
Testing_set_responses_numerics_Scaled<- predict(ProcValues,Testing_set_responses[,numerics])
Training_set_responses_Scaled<-Training_set_responses
Training_set_responses_Scaled[,numerics]<-Training_set_responses_numerics_Scaled
Testing_set_responses_Scaled<- Testing_set_responses
Testing_set_responses_Scaled[,numerics]<-Testing_set_responses_numerics_Scaled
The first classification model we will try is Gradient Boosting, a decision tree based classifier.
We are trying to predict the brand of computer the customer would prefer between Acer (0) and Sony (1).
The below will train a gradient boosting model based on our training data. We will split this training data into 10 folds to cross validate which will detect any over fitting.
# Define the control parameters for our model
responses_ctrl<- trainControl(method='repeatedcv', number=10,repeats=1)
#Fit Gradient Boosting model
GBMFit<- train(brand~.,data = Training_set_responses_Scaled,method='gbm', trControl=responses_ctrl)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2996 nan 0.1000 0.0131
## 2 1.2774 nan 0.1000 0.0103
## 3 1.2596 nan 0.1000 0.0092
## 4 1.2442 nan 0.1000 0.0077
## 5 1.2317 nan 0.1000 0.0059
## 6 1.2186 nan 0.1000 0.0062
## 7 1.2077 nan 0.1000 0.0053
## 8 1.1994 nan 0.1000 0.0041
## 9 1.1921 nan 0.1000 0.0030
## 10 1.1834 nan 0.1000 0.0042
## 20 1.1299 nan 0.1000 0.0022
## 40 1.0730 nan 0.1000 0.0005
## 60 1.0463 nan 0.1000 0.0001
## 80 1.0362 nan 0.1000 -0.0001
## 100 1.0311 nan 0.1000 0.0000
## 120 1.0283 nan 0.1000 -0.0001
## 140 1.0266 nan 0.1000 -0.0001
## 150 1.0255 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2895 nan 0.1000 0.0180
## 2 1.2591 nan 0.1000 0.0147
## 3 1.2329 nan 0.1000 0.0131
## 4 1.2115 nan 0.1000 0.0106
## 5 1.1925 nan 0.1000 0.0090
## 6 1.1728 nan 0.1000 0.0096
## 7 1.1561 nan 0.1000 0.0077
## 8 1.1429 nan 0.1000 0.0063
## 9 1.1320 nan 0.1000 0.0051
## 10 1.1199 nan 0.1000 0.0057
## 20 1.0453 nan 0.1000 0.0036
## 40 0.9301 nan 0.1000 0.0009
## 60 0.7461 nan 0.1000 0.0003
## 80 0.6674 nan 0.1000 0.0025
## 100 0.6291 nan 0.1000 -0.0002
## 120 0.5865 nan 0.1000 -0.0001
## 140 0.5514 nan 0.1000 0.0009
## 150 0.5476 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2765 nan 0.1000 0.0244
## 2 1.2301 nan 0.1000 0.0228
## 3 1.1864 nan 0.1000 0.0215
## 4 1.1488 nan 0.1000 0.0182
## 5 1.1199 nan 0.1000 0.0140
## 6 1.0976 nan 0.1000 0.0105
## 7 1.0694 nan 0.1000 0.0139
## 8 1.0534 nan 0.1000 0.0082
## 9 1.0325 nan 0.1000 0.0101
## 10 1.0110 nan 0.1000 0.0102
## 20 0.8637 nan 0.1000 0.0060
## 40 0.7219 nan 0.1000 0.0003
## 60 0.6483 nan 0.1000 0.0001
## 80 0.5788 nan 0.1000 0.0058
## 100 0.5356 nan 0.1000 0.0000
## 120 0.5312 nan 0.1000 -0.0002
## 140 0.5185 nan 0.1000 -0.0001
## 150 0.4947 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3009 nan 0.1000 0.0124
## 2 1.2807 nan 0.1000 0.0101
## 3 1.2623 nan 0.1000 0.0091
## 4 1.2478 nan 0.1000 0.0071
## 5 1.2333 nan 0.1000 0.0072
## 6 1.2210 nan 0.1000 0.0060
## 7 1.2108 nan 0.1000 0.0046
## 8 1.2008 nan 0.1000 0.0047
## 9 1.1940 nan 0.1000 0.0033
## 10 1.1851 nan 0.1000 0.0043
## 20 1.1308 nan 0.1000 0.0018
## 40 1.0721 nan 0.1000 0.0018
## 60 1.0469 nan 0.1000 0.0002
## 80 1.0351 nan 0.1000 0.0006
## 100 1.0302 nan 0.1000 -0.0001
## 120 1.0265 nan 0.1000 -0.0001
## 140 1.0249 nan 0.1000 -0.0002
## 150 1.0245 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2883 nan 0.1000 0.0177
## 2 1.2580 nan 0.1000 0.0147
## 3 1.2305 nan 0.1000 0.0133
## 4 1.2090 nan 0.1000 0.0104
## 5 1.1897 nan 0.1000 0.0096
## 6 1.1732 nan 0.1000 0.0082
## 7 1.1569 nan 0.1000 0.0081
## 8 1.1434 nan 0.1000 0.0067
## 9 1.1314 nan 0.1000 0.0060
## 10 1.1210 nan 0.1000 0.0047
## 20 1.0426 nan 0.1000 0.0026
## 40 0.8838 nan 0.1000 0.0005
## 60 0.7284 nan 0.1000 0.0035
## 80 0.6480 nan 0.1000 0.0001
## 100 0.5925 nan 0.1000 -0.0001
## 120 0.5625 nan 0.1000 0.0018
## 140 0.5472 nan 0.1000 -0.0004
## 150 0.5387 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2677 nan 0.1000 0.0292
## 2 1.2213 nan 0.1000 0.0229
## 3 1.1788 nan 0.1000 0.0199
## 4 1.1447 nan 0.1000 0.0170
## 5 1.1141 nan 0.1000 0.0151
## 6 1.0859 nan 0.1000 0.0141
## 7 1.0615 nan 0.1000 0.0116
## 8 1.0452 nan 0.1000 0.0079
## 9 1.0309 nan 0.1000 0.0067
## 10 1.0095 nan 0.1000 0.0105
## 20 0.8803 nan 0.1000 0.0050
## 40 0.7319 nan 0.1000 0.0008
## 60 0.6377 nan 0.1000 0.0050
## 80 0.5612 nan 0.1000 0.0001
## 100 0.5019 nan 0.1000 0.0000
## 120 0.4676 nan 0.1000 -0.0001
## 140 0.4458 nan 0.1000 -0.0000
## 150 0.4399 nan 0.1000 0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3005 nan 0.1000 0.0128
## 2 1.2795 nan 0.1000 0.0107
## 3 1.2612 nan 0.1000 0.0088
## 4 1.2467 nan 0.1000 0.0074
## 5 1.2336 nan 0.1000 0.0062
## 6 1.2208 nan 0.1000 0.0062
## 7 1.2115 nan 0.1000 0.0044
## 8 1.2017 nan 0.1000 0.0047
## 9 1.1943 nan 0.1000 0.0032
## 10 1.1874 nan 0.1000 0.0036
## 20 1.1322 nan 0.1000 0.0019
## 40 1.0742 nan 0.1000 0.0006
## 60 1.0491 nan 0.1000 0.0007
## 80 1.0383 nan 0.1000 -0.0001
## 100 1.0331 nan 0.1000 -0.0000
## 120 1.0299 nan 0.1000 0.0000
## 140 1.0276 nan 0.1000 -0.0001
## 150 1.0268 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2903 nan 0.1000 0.0173
## 2 1.2578 nan 0.1000 0.0158
## 3 1.2319 nan 0.1000 0.0129
## 4 1.2101 nan 0.1000 0.0104
## 5 1.1910 nan 0.1000 0.0087
## 6 1.1726 nan 0.1000 0.0089
## 7 1.1581 nan 0.1000 0.0069
## 8 1.1446 nan 0.1000 0.0062
## 9 1.1328 nan 0.1000 0.0059
## 10 1.1229 nan 0.1000 0.0047
## 20 1.0471 nan 0.1000 0.0014
## 40 0.8539 nan 0.1000 0.0054
## 60 0.7147 nan 0.1000 0.0007
## 80 0.6493 nan 0.1000 0.0019
## 100 0.6081 nan 0.1000 -0.0001
## 120 0.5618 nan 0.1000 0.0003
## 140 0.5345 nan 0.1000 -0.0000
## 150 0.5218 nan 0.1000 0.0009
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2724 nan 0.1000 0.0269
## 2 1.2259 nan 0.1000 0.0234
## 3 1.1809 nan 0.1000 0.0219
## 4 1.1463 nan 0.1000 0.0170
## 5 1.1218 nan 0.1000 0.0118
## 6 1.0985 nan 0.1000 0.0113
## 7 1.0731 nan 0.1000 0.0123
## 8 1.0498 nan 0.1000 0.0112
## 9 1.0291 nan 0.1000 0.0097
## 10 1.0157 nan 0.1000 0.0062
## 20 0.8795 nan 0.1000 0.0046
## 40 0.7340 nan 0.1000 0.0091
## 60 0.6798 nan 0.1000 -0.0001
## 80 0.6362 nan 0.1000 -0.0002
## 100 0.5752 nan 0.1000 -0.0001
## 120 0.5110 nan 0.1000 0.0001
## 140 0.4730 nan 0.1000 -0.0001
## 150 0.4612 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3005 nan 0.1000 0.0130
## 2 1.2795 nan 0.1000 0.0109
## 3 1.2613 nan 0.1000 0.0092
## 4 1.2457 nan 0.1000 0.0074
## 5 1.2326 nan 0.1000 0.0061
## 6 1.2204 nan 0.1000 0.0060
## 7 1.2105 nan 0.1000 0.0046
## 8 1.2011 nan 0.1000 0.0045
## 9 1.1940 nan 0.1000 0.0033
## 10 1.1861 nan 0.1000 0.0034
## 20 1.1289 nan 0.1000 0.0019
## 40 1.0721 nan 0.1000 0.0015
## 60 1.0474 nan 0.1000 0.0003
## 80 1.0361 nan 0.1000 -0.0003
## 100 1.0320 nan 0.1000 -0.0000
## 120 1.0297 nan 0.1000 -0.0002
## 140 1.0274 nan 0.1000 -0.0001
## 150 1.0271 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2911 nan 0.1000 0.0170
## 2 1.2608 nan 0.1000 0.0146
## 3 1.2359 nan 0.1000 0.0120
## 4 1.2108 nan 0.1000 0.0124
## 5 1.1892 nan 0.1000 0.0102
## 6 1.1725 nan 0.1000 0.0081
## 7 1.1572 nan 0.1000 0.0072
## 8 1.1441 nan 0.1000 0.0065
## 9 1.1322 nan 0.1000 0.0053
## 10 1.1211 nan 0.1000 0.0049
## 20 1.0460 nan 0.1000 0.0039
## 40 0.9083 nan 0.1000 0.0085
## 60 0.7473 nan 0.1000 0.0032
## 80 0.6659 nan 0.1000 0.0035
## 100 0.6059 nan 0.1000 0.0002
## 120 0.5757 nan 0.1000 -0.0001
## 140 0.5585 nan 0.1000 -0.0001
## 150 0.5318 nan 0.1000 0.0022
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2722 nan 0.1000 0.0267
## 2 1.2282 nan 0.1000 0.0216
## 3 1.1819 nan 0.1000 0.0218
## 4 1.1499 nan 0.1000 0.0152
## 5 1.1207 nan 0.1000 0.0152
## 6 1.0907 nan 0.1000 0.0146
## 7 1.0654 nan 0.1000 0.0113
## 8 1.0423 nan 0.1000 0.0109
## 9 1.0275 nan 0.1000 0.0070
## 10 1.0087 nan 0.1000 0.0093
## 20 0.8890 nan 0.1000 0.0025
## 40 0.7520 nan 0.1000 0.0005
## 60 0.6514 nan 0.1000 0.0038
## 80 0.6141 nan 0.1000 0.0016
## 100 0.5752 nan 0.1000 0.0022
## 120 0.5307 nan 0.1000 -0.0001
## 140 0.4971 nan 0.1000 -0.0001
## 150 0.4841 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3004 nan 0.1000 0.0130
## 2 1.2785 nan 0.1000 0.0107
## 3 1.2612 nan 0.1000 0.0088
## 4 1.2452 nan 0.1000 0.0080
## 5 1.2330 nan 0.1000 0.0062
## 6 1.2209 nan 0.1000 0.0062
## 7 1.2113 nan 0.1000 0.0048
## 8 1.2019 nan 0.1000 0.0047
## 9 1.1949 nan 0.1000 0.0034
## 10 1.1865 nan 0.1000 0.0035
## 20 1.1300 nan 0.1000 0.0028
## 40 1.0734 nan 0.1000 0.0008
## 60 1.0460 nan 0.1000 0.0004
## 80 1.0352 nan 0.1000 -0.0001
## 100 1.0303 nan 0.1000 -0.0003
## 120 1.0269 nan 0.1000 -0.0001
## 140 1.0253 nan 0.1000 -0.0001
## 150 1.0248 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2901 nan 0.1000 0.0184
## 2 1.2600 nan 0.1000 0.0149
## 3 1.2312 nan 0.1000 0.0142
## 4 1.2086 nan 0.1000 0.0109
## 5 1.1887 nan 0.1000 0.0097
## 6 1.1702 nan 0.1000 0.0086
## 7 1.1554 nan 0.1000 0.0072
## 8 1.1428 nan 0.1000 0.0061
## 9 1.1314 nan 0.1000 0.0048
## 10 1.1202 nan 0.1000 0.0050
## 20 1.0457 nan 0.1000 0.0031
## 40 0.9205 nan 0.1000 0.0004
## 60 0.7461 nan 0.1000 0.0023
## 80 0.6603 nan 0.1000 0.0039
## 100 0.6099 nan 0.1000 0.0001
## 120 0.5853 nan 0.1000 -0.0001
## 140 0.5392 nan 0.1000 0.0019
## 150 0.5317 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2691 nan 0.1000 0.0276
## 2 1.2190 nan 0.1000 0.0240
## 3 1.1799 nan 0.1000 0.0188
## 4 1.1445 nan 0.1000 0.0173
## 5 1.1163 nan 0.1000 0.0136
## 6 1.0874 nan 0.1000 0.0134
## 7 1.0583 nan 0.1000 0.0137
## 8 1.0350 nan 0.1000 0.0114
## 9 1.0200 nan 0.1000 0.0072
## 10 1.0070 nan 0.1000 0.0060
## 20 0.8631 nan 0.1000 0.0067
## 40 0.7266 nan 0.1000 0.0015
## 60 0.6228 nan 0.1000 -0.0000
## 80 0.5566 nan 0.1000 0.0027
## 100 0.4904 nan 0.1000 0.0015
## 120 0.4655 nan 0.1000 -0.0002
## 140 0.4413 nan 0.1000 0.0000
## 150 0.4339 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3008 nan 0.1000 0.0125
## 2 1.2789 nan 0.1000 0.0106
## 3 1.2615 nan 0.1000 0.0085
## 4 1.2444 nan 0.1000 0.0078
## 5 1.2315 nan 0.1000 0.0063
## 6 1.2197 nan 0.1000 0.0061
## 7 1.2094 nan 0.1000 0.0050
## 8 1.2006 nan 0.1000 0.0039
## 9 1.1919 nan 0.1000 0.0042
## 10 1.1849 nan 0.1000 0.0035
## 20 1.1324 nan 0.1000 0.0014
## 40 1.0747 nan 0.1000 0.0006
## 60 1.0498 nan 0.1000 0.0000
## 80 1.0388 nan 0.1000 -0.0001
## 100 1.0344 nan 0.1000 0.0000
## 120 1.0319 nan 0.1000 -0.0002
## 140 1.0289 nan 0.1000 0.0003
## 150 1.0282 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2899 nan 0.1000 0.0174
## 2 1.2566 nan 0.1000 0.0153
## 3 1.2311 nan 0.1000 0.0127
## 4 1.2089 nan 0.1000 0.0110
## 5 1.1901 nan 0.1000 0.0091
## 6 1.1737 nan 0.1000 0.0078
## 7 1.1579 nan 0.1000 0.0077
## 8 1.1454 nan 0.1000 0.0063
## 9 1.1339 nan 0.1000 0.0053
## 10 1.1212 nan 0.1000 0.0060
## 20 1.0517 nan 0.1000 0.0035
## 40 0.9444 nan 0.1000 0.0016
## 60 0.7670 nan 0.1000 0.0034
## 80 0.6751 nan 0.1000 0.0022
## 100 0.6154 nan 0.1000 0.0020
## 120 0.5853 nan 0.1000 0.0001
## 140 0.5413 nan 0.1000 0.0000
## 150 0.5217 nan 0.1000 0.0018
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2723 nan 0.1000 0.0267
## 2 1.2267 nan 0.1000 0.0220
## 3 1.1830 nan 0.1000 0.0215
## 4 1.1498 nan 0.1000 0.0167
## 5 1.1203 nan 0.1000 0.0141
## 6 1.0918 nan 0.1000 0.0136
## 7 1.0680 nan 0.1000 0.0115
## 8 1.0515 nan 0.1000 0.0080
## 9 1.0277 nan 0.1000 0.0110
## 10 1.0064 nan 0.1000 0.0103
## 20 0.8841 nan 0.1000 0.0052
## 40 0.7674 nan 0.1000 0.0005
## 60 0.6756 nan 0.1000 0.0054
## 80 0.5862 nan 0.1000 -0.0001
## 100 0.5033 nan 0.1000 -0.0003
## 120 0.4801 nan 0.1000 0.0031
## 140 0.4365 nan 0.1000 0.0008
## 150 0.4344 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3016 nan 0.1000 0.0126
## 2 1.2806 nan 0.1000 0.0106
## 3 1.2635 nan 0.1000 0.0086
## 4 1.2481 nan 0.1000 0.0077
## 5 1.2351 nan 0.1000 0.0060
## 6 1.2226 nan 0.1000 0.0062
## 7 1.2121 nan 0.1000 0.0052
## 8 1.2036 nan 0.1000 0.0039
## 9 1.1966 nan 0.1000 0.0033
## 10 1.1907 nan 0.1000 0.0027
## 20 1.1324 nan 0.1000 0.0032
## 40 1.0750 nan 0.1000 0.0007
## 60 1.0479 nan 0.1000 0.0002
## 80 1.0370 nan 0.1000 0.0006
## 100 1.0323 nan 0.1000 -0.0001
## 120 1.0307 nan 0.1000 -0.0004
## 140 1.0288 nan 0.1000 -0.0001
## 150 1.0284 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2901 nan 0.1000 0.0179
## 2 1.2610 nan 0.1000 0.0140
## 3 1.2331 nan 0.1000 0.0131
## 4 1.2086 nan 0.1000 0.0117
## 5 1.1902 nan 0.1000 0.0092
## 6 1.1744 nan 0.1000 0.0080
## 7 1.1605 nan 0.1000 0.0069
## 8 1.1474 nan 0.1000 0.0062
## 9 1.1337 nan 0.1000 0.0069
## 10 1.1214 nan 0.1000 0.0054
## 20 1.0493 nan 0.1000 0.0020
## 40 0.9394 nan 0.1000 0.0022
## 60 0.7328 nan 0.1000 0.0010
## 80 0.6662 nan 0.1000 0.0038
## 100 0.6382 nan 0.1000 -0.0002
## 120 0.5887 nan 0.1000 0.0001
## 140 0.5446 nan 0.1000 0.0004
## 150 0.5286 nan 0.1000 0.0020
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2706 nan 0.1000 0.0283
## 2 1.2224 nan 0.1000 0.0239
## 3 1.1844 nan 0.1000 0.0186
## 4 1.1524 nan 0.1000 0.0154
## 5 1.1232 nan 0.1000 0.0145
## 6 1.0956 nan 0.1000 0.0137
## 7 1.0770 nan 0.1000 0.0089
## 8 1.0512 nan 0.1000 0.0123
## 9 1.0366 nan 0.1000 0.0071
## 10 1.0158 nan 0.1000 0.0098
## 20 0.8883 nan 0.1000 0.0076
## 40 0.7404 nan 0.1000 0.0006
## 60 0.6551 nan 0.1000 0.0062
## 80 0.5677 nan 0.1000 0.0032
## 100 0.5268 nan 0.1000 -0.0001
## 120 0.4949 nan 0.1000 -0.0003
## 140 0.4793 nan 0.1000 0.0015
## 150 0.4680 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3011 nan 0.1000 0.0132
## 2 1.2780 nan 0.1000 0.0112
## 3 1.2597 nan 0.1000 0.0082
## 4 1.2432 nan 0.1000 0.0082
## 5 1.2288 nan 0.1000 0.0068
## 6 1.2165 nan 0.1000 0.0057
## 7 1.2053 nan 0.1000 0.0051
## 8 1.1965 nan 0.1000 0.0041
## 9 1.1896 nan 0.1000 0.0033
## 10 1.1808 nan 0.1000 0.0043
## 20 1.1259 nan 0.1000 0.0016
## 40 1.0680 nan 0.1000 0.0006
## 60 1.0433 nan 0.1000 0.0002
## 80 1.0332 nan 0.1000 -0.0001
## 100 1.0272 nan 0.1000 0.0000
## 120 1.0239 nan 0.1000 -0.0001
## 140 1.0226 nan 0.1000 -0.0002
## 150 1.0223 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2908 nan 0.1000 0.0183
## 2 1.2581 nan 0.1000 0.0157
## 3 1.2321 nan 0.1000 0.0127
## 4 1.2072 nan 0.1000 0.0114
## 5 1.1874 nan 0.1000 0.0098
## 6 1.1697 nan 0.1000 0.0085
## 7 1.1541 nan 0.1000 0.0072
## 8 1.1390 nan 0.1000 0.0063
## 9 1.1274 nan 0.1000 0.0057
## 10 1.1177 nan 0.1000 0.0043
## 20 1.0401 nan 0.1000 0.0017
## 40 0.8877 nan 0.1000 0.0071
## 60 0.7428 nan 0.1000 0.0006
## 80 0.6440 nan 0.1000 0.0030
## 100 0.5966 nan 0.1000 0.0006
## 120 0.5434 nan 0.1000 0.0001
## 140 0.5354 nan 0.1000 -0.0001
## 150 0.5168 nan 0.1000 0.0020
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2690 nan 0.1000 0.0286
## 2 1.2230 nan 0.1000 0.0233
## 3 1.1792 nan 0.1000 0.0206
## 4 1.1446 nan 0.1000 0.0166
## 5 1.1140 nan 0.1000 0.0148
## 6 1.0883 nan 0.1000 0.0126
## 7 1.0696 nan 0.1000 0.0089
## 8 1.0493 nan 0.1000 0.0101
## 9 1.0266 nan 0.1000 0.0109
## 10 1.0145 nan 0.1000 0.0060
## 20 0.8721 nan 0.1000 0.0075
## 40 0.7410 nan 0.1000 0.0037
## 60 0.6229 nan 0.1000 0.0003
## 80 0.5506 nan 0.1000 -0.0001
## 100 0.4821 nan 0.1000 0.0002
## 120 0.4401 nan 0.1000 0.0001
## 140 0.4138 nan 0.1000 0.0016
## 150 0.4072 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3003 nan 0.1000 0.0128
## 2 1.2789 nan 0.1000 0.0107
## 3 1.2620 nan 0.1000 0.0084
## 4 1.2458 nan 0.1000 0.0079
## 5 1.2333 nan 0.1000 0.0061
## 6 1.2220 nan 0.1000 0.0051
## 7 1.2147 nan 0.1000 0.0032
## 8 1.2031 nan 0.1000 0.0057
## 9 1.1947 nan 0.1000 0.0040
## 10 1.1881 nan 0.1000 0.0031
## 20 1.1264 nan 0.1000 0.0019
## 40 1.0677 nan 0.1000 0.0006
## 60 1.0409 nan 0.1000 0.0001
## 80 1.0306 nan 0.1000 0.0001
## 100 1.0255 nan 0.1000 -0.0000
## 120 1.0238 nan 0.1000 -0.0001
## 140 1.0217 nan 0.1000 -0.0000
## 150 1.0211 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2893 nan 0.1000 0.0180
## 2 1.2595 nan 0.1000 0.0148
## 3 1.2334 nan 0.1000 0.0123
## 4 1.2089 nan 0.1000 0.0124
## 5 1.1867 nan 0.1000 0.0107
## 6 1.1676 nan 0.1000 0.0086
## 7 1.1524 nan 0.1000 0.0073
## 8 1.1387 nan 0.1000 0.0062
## 9 1.1265 nan 0.1000 0.0060
## 10 1.1139 nan 0.1000 0.0060
## 20 1.0406 nan 0.1000 0.0035
## 40 0.8553 nan 0.1000 0.0104
## 60 0.7284 nan 0.1000 0.0005
## 80 0.6560 nan 0.1000 0.0023
## 100 0.5869 nan 0.1000 -0.0001
## 120 0.5549 nan 0.1000 0.0025
## 140 0.5267 nan 0.1000 -0.0000
## 150 0.5221 nan 0.1000 0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2748 nan 0.1000 0.0252
## 2 1.2275 nan 0.1000 0.0235
## 3 1.1819 nan 0.1000 0.0223
## 4 1.1456 nan 0.1000 0.0176
## 5 1.1176 nan 0.1000 0.0133
## 6 1.0972 nan 0.1000 0.0098
## 7 1.0731 nan 0.1000 0.0120
## 8 1.0483 nan 0.1000 0.0117
## 9 1.0295 nan 0.1000 0.0093
## 10 1.0090 nan 0.1000 0.0089
## 20 0.8705 nan 0.1000 0.0057
## 40 0.7392 nan 0.1000 -0.0003
## 60 0.6217 nan 0.1000 -0.0001
## 80 0.5668 nan 0.1000 0.0030
## 100 0.5051 nan 0.1000 0.0017
## 120 0.4826 nan 0.1000 -0.0000
## 140 0.4515 nan 0.1000 -0.0001
## 150 0.4374 nan 0.1000 0.0014
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3014 nan 0.1000 0.0130
## 2 1.2795 nan 0.1000 0.0104
## 3 1.2607 nan 0.1000 0.0093
## 4 1.2443 nan 0.1000 0.0078
## 5 1.2315 nan 0.1000 0.0060
## 6 1.2189 nan 0.1000 0.0063
## 7 1.2085 nan 0.1000 0.0047
## 8 1.2002 nan 0.1000 0.0039
## 9 1.1931 nan 0.1000 0.0034
## 10 1.1868 nan 0.1000 0.0026
## 20 1.1276 nan 0.1000 0.0022
## 40 1.0698 nan 0.1000 0.0006
## 60 1.0435 nan 0.1000 0.0008
## 80 1.0323 nan 0.1000 0.0001
## 100 1.0287 nan 0.1000 -0.0001
## 120 1.0244 nan 0.1000 -0.0001
## 140 1.0227 nan 0.1000 -0.0001
## 150 1.0222 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2883 nan 0.1000 0.0184
## 2 1.2577 nan 0.1000 0.0153
## 3 1.2318 nan 0.1000 0.0128
## 4 1.2102 nan 0.1000 0.0106
## 5 1.1884 nan 0.1000 0.0109
## 6 1.1696 nan 0.1000 0.0088
## 7 1.1546 nan 0.1000 0.0073
## 8 1.1410 nan 0.1000 0.0062
## 9 1.1287 nan 0.1000 0.0061
## 10 1.1183 nan 0.1000 0.0047
## 20 1.0391 nan 0.1000 0.0030
## 40 0.8652 nan 0.1000 0.0106
## 60 0.7396 nan 0.1000 0.0032
## 80 0.6490 nan 0.1000 0.0021
## 100 0.5948 nan 0.1000 0.0024
## 120 0.5652 nan 0.1000 -0.0000
## 140 0.5431 nan 0.1000 -0.0001
## 150 0.5328 nan 0.1000 0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2720 nan 0.1000 0.0272
## 2 1.2258 nan 0.1000 0.0226
## 3 1.1858 nan 0.1000 0.0199
## 4 1.1485 nan 0.1000 0.0180
## 5 1.1143 nan 0.1000 0.0163
## 6 1.0911 nan 0.1000 0.0114
## 7 1.0648 nan 0.1000 0.0128
## 8 1.0484 nan 0.1000 0.0079
## 9 1.0267 nan 0.1000 0.0101
## 10 1.0081 nan 0.1000 0.0087
## 20 0.8910 nan 0.1000 0.0106
## 40 0.7252 nan 0.1000 0.0044
## 60 0.6266 nan 0.1000 0.0000
## 80 0.6022 nan 0.1000 -0.0002
## 100 0.5939 nan 0.1000 -0.0002
## 120 0.5686 nan 0.1000 0.0019
## 140 0.5209 nan 0.1000 0.0033
## 150 0.5020 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2721 nan 0.1000 0.0268
## 2 1.2212 nan 0.1000 0.0236
## 3 1.1832 nan 0.1000 0.0188
## 4 1.1485 nan 0.1000 0.0172
## 5 1.1182 nan 0.1000 0.0144
## 6 1.0988 nan 0.1000 0.0099
## 7 1.0742 nan 0.1000 0.0116
## 8 1.0528 nan 0.1000 0.0100
## 9 1.0292 nan 0.1000 0.0116
## 10 1.0169 nan 0.1000 0.0062
## 20 0.8651 nan 0.1000 0.0031
## 40 0.7346 nan 0.1000 0.0066
## 60 0.6549 nan 0.1000 -0.0001
## 80 0.5776 nan 0.1000 0.0001
## 100 0.5191 nan 0.1000 -0.0001
## 120 0.4768 nan 0.1000 0.0018
## 140 0.4480 nan 0.1000 0.0024
## 150 0.4293 nan 0.1000 0.0004
GBMFitAccuracy<-GBMFit$results
We have a high accuracy rate using 150 trees with a depth of 3 nodes:
#Check results
GBMFit
## Stochastic Gradient Boosting
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6682, 6682, 6682, 6681, 6682, 6682, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7264303 0.4227616
## 1 100 0.7281809 0.4250240
## 1 150 0.7279119 0.4245218
## 2 50 0.8306819 0.6431492
## 2 100 0.8809244 0.7514811
## 2 150 0.9067861 0.8038497
## 3 50 0.8728423 0.7374849
## 3 100 0.9028839 0.7967197
## 3 150 0.9155438 0.8215371
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
GBMFitAccuracy
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa
## 1 0.1 1 10 50 0.7264303 0.4227616
## 4 0.1 2 10 50 0.8306819 0.6431492
## 7 0.1 3 10 50 0.8728423 0.7374849
## 2 0.1 1 10 100 0.7281809 0.4250240
## 5 0.1 2 10 100 0.8809244 0.7514811
## 8 0.1 3 10 100 0.9028839 0.7967197
## 3 0.1 1 10 150 0.7279119 0.4245218
## 6 0.1 2 10 150 0.9067861 0.8038497
## 9 0.1 3 10 150 0.9155438 0.8215371
## AccuracySD KappaSD
## 1 0.015308770 0.03159188
## 4 0.022678649 0.04803775
## 7 0.013587121 0.02617844
## 2 0.014526039 0.02925238
## 5 0.010622789 0.02124222
## 8 0.018291331 0.03665301
## 3 0.014971874 0.03060939
## 6 0.014445881 0.02956136
## 9 0.008622595 0.01811902
We also see that Salary and Age are the most important factors in predicting customer brand preferences:
varImp(GBMFit)
## gbm variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## salary 100.00000
## age 74.33530
## credit 1.38598
## zipcode2 0.13944
## car15 0.09672
## zipcode3 0.07518
## car12 0.06845
## zipcode4 0.06718
## car14 0.06286
## car4 0.05841
## elevel1 0.05572
## zipcode5 0.05502
## elevel4 0.05175
## zipcode6 0.05028
## car10 0.04734
## car2 0.04384
## car3 0.04338
## zipcode8 0.04119
## zipcode1 0.03938
## car8 0.03874
We will now use our trained model to guess what computer brand customers would prefer from our untouched testing data. The model will predict whether the customer prefers Acer or Sony, and the Confusion Matrix below compares how our predictions look against the customers real choice.
#predictions
BrandPredsGBM<- predict(GBMFit, newdata=Testing_set_responses_Scaled)
#confusion matrix
GBMConfMat<- confusionMatrix(data=BrandPredsGBM,Testing_set_responses$brand)
GBMConfMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 873 109
## 1 63 1429
##
## Accuracy : 0.9305
## 95% CI : (0.9197, 0.9402)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8536
##
## Mcnemar's Test P-Value : 0.0006009
##
## Sensitivity : 0.9327
## Specificity : 0.9291
## Pos Pred Value : 0.8890
## Neg Pred Value : 0.9578
## Prevalence : 0.3783
## Detection Rate : 0.3529
## Detection Prevalence : 0.3969
## Balanced Accuracy : 0.9309
##
## 'Positive' Class : 0
##
We will now try a Random Forest Classification model. Random Forest is another tree based classifier that randomly creates multiple independent decision trees and combines the predictions to create a final prediction.
We are going to manually tune our Random Forest model, with tests from 6 to 10 mtry to see which provides the highest accuracy (I have tested 1-5 separately and they always score lower accuracy). Here, mtry controls how many input features a tree has to consider at any given moment.
#Create grids for manual tuning
RandomForestGrid2<-expand.grid(mtry=c(6,7,8,9,10))
#Fit the model:
RandomForestFit2<-train(brand~.,data = Training_set_responses_Scaled,method='rf', trControl=responses_ctrl,
tuneGrid=RandomForestGrid2)
We can see the accuracy rate improving as we increase the number of mtry’s.
10 mtry leads to a very high accuracy:
#Check results
RandomForestFit2
## Random Forest
##
## 7424 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6682, 6682, 6682, 6681, 6682, 6682, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 6 0.9108276 0.8112342
## 7 0.9163499 0.8228310
## 8 0.9174281 0.8249692
## 9 0.9182358 0.8264965
## 10 0.9185057 0.8270487
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
#More in-depth results:
RandomForest2Results<- RandomForestFit2$results
RandomForest2Results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 6 0.9108276 0.8112342 0.008704044 0.01828027
## 2 7 0.9163499 0.8228310 0.006854520 0.01444745
## 3 8 0.9174281 0.8249692 0.006441021 0.01358864
## 4 9 0.9182358 0.8264965 0.006509490 0.01364185
## 5 10 0.9185057 0.8270487 0.007074204 0.01515796
Again we see Salary and Age are the most important factors:
varImp(RandomForestFit2)
## rf variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## salary 100.0000
## age 53.0401
## credit 14.6956
## elevel3 0.9307
## elevel4 0.9122
## elevel1 0.8875
## elevel2 0.8423
## zipcode6 0.6122
## zipcode4 0.5912
## zipcode2 0.5295
## zipcode1 0.5126
## zipcode7 0.4938
## zipcode3 0.4743
## zipcode8 0.4620
## zipcode5 0.4439
## car15 0.2582
## car7 0.2183
## car17 0.2130
## car19 0.1953
## car12 0.1674
The below accuracy in this model is an improvement on the Gradient boosting:
BrandPredsRF2<- predict(RandomForestFit2, newdata=Testing_set_responses_Scaled)
#Confusion matrix
RFConfMat<- confusionMatrix(data=BrandPredsRF2,Testing_set_responses$brand)
RFConfMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 860 108
## 1 76 1430
##
## Accuracy : 0.9256
## 95% CI : (0.9146, 0.9357)
## No Information Rate : 0.6217
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8429
##
## Mcnemar's Test P-Value : 0.02229
##
## Sensitivity : 0.9188
## Specificity : 0.9298
## Pos Pred Value : 0.8884
## Neg Pred Value : 0.9495
## Prevalence : 0.3783
## Detection Rate : 0.3476
## Detection Prevalence : 0.3913
## Balanced Accuracy : 0.9243
##
## 'Positive' Class : 0
##
Importing the incomplete survey data:
SurveyIncomplete<-read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\R_exercises\\SurveyIncomplete.csv")
Changing the variable types and normalizing numeric data (as we did with the original data set). I am also removing the incorrect ‘brand’ column with wrong data.
SurveyIncomplete$elevel<-as.factor(SurveyIncomplete$elevel)
SurveyIncomplete$car<-as.factor(SurveyIncomplete$car)
SurveyIncomplete$zipcode<-as.factor(SurveyIncomplete$zipcode)
#Removing incorrect brand column, centering and scaling numeric variables
IncompleteTesting<-SurveyIncomplete
IncompleteTesting<- select(IncompleteTesting,-c('brand'))
PreprocessedNumerics<- preProcess(IncompleteTesting[,numerics],method=c('center','scale'))
Incomplete_numerics_Scaled<- predict(PreprocessedNumerics,IncompleteTesting[,numerics])
IncompleteTesting[,numerics]<-Incomplete_numerics_Scaled
Our Random Forest Model predicts there will be the below number of customers who prefer Acer (0) or Sony(1):
IncompleteBrandPredsRF2<- predict(RandomForestFit2, newdata=IncompleteTesting)
summary(IncompleteBrandPredsRF2)
## 0 1
## 1902 3098
#Using the original datasets that have not been normalized
Original$elevel<- as.factor(Original$elevel)
Original$car<- as.factor(Original$car)
Original$zipcode<-as.factor(Original$zipcode)
Original$brand<- as.factor(Original$brand)
New<-read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\R_exercises\\SurveyIncomplete.csv")
New$elevel<- as.factor(New$elevel)
New$car<- as.factor(New$car)
New$zipcode<-as.factor(New$zipcode)
New<-select(New,-c('brand'))
New['brand']<-c(IncompleteBrandPredsRF2)
combined_data<-full_join(Original,New)
## Joining with `by = join_by(salary, age, elevel, car, zipcode, credit, brand)`
The below histogram shows that Sony (1) is more popular with approximately 60% of customers preferring Sony to Acer.
## Warning in geom_histogram(stat = "count", color = "darkblue", fill =
## "lightblue"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`