An Investigation into customer brand preferences

Importing Dataset and Libraries

CompleteResponses<-read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\R_exercises\\CompleteResponses.csv")
Original<-read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\R_exercises\\CompleteResponses.csv")
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
library(gbm)
## Loaded gbm 2.1.8.1
library(ISLR)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Summary of the Data set:

str(CompleteResponses)
## 'data.frame':    9898 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : int  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : int  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : int  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: int  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : int  0 1 0 1 0 1 1 1 0 1 ...
summary(CompleteResponses)
##      salary            age            elevel           car       
##  Min.   : 20000   Min.   :20.00   Min.   :0.000   Min.   : 1.00  
##  1st Qu.: 52082   1st Qu.:35.00   1st Qu.:1.000   1st Qu.: 6.00  
##  Median : 84950   Median :50.00   Median :2.000   Median :11.00  
##  Mean   : 84871   Mean   :49.78   Mean   :1.983   Mean   :10.52  
##  3rd Qu.:117162   3rd Qu.:65.00   3rd Qu.:3.000   3rd Qu.:15.75  
##  Max.   :150000   Max.   :80.00   Max.   :4.000   Max.   :20.00  
##     zipcode          credit           brand       
##  Min.   :0.000   Min.   :     0   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:120807   1st Qu.:0.0000  
##  Median :4.000   Median :250607   Median :1.0000  
##  Mean   :4.041   Mean   :249176   Mean   :0.6217  
##  3rd Qu.:6.000   3rd Qu.:374640   3rd Qu.:1.0000  
##  Max.   :8.000   Max.   :500000   Max.   :1.0000

We have some variables types that need to be amended:

CompleteResponses$elevel<- as.factor(CompleteResponses$elevel)
CompleteResponses$car<- as.factor(CompleteResponses$car)
CompleteResponses$zipcode<-as.factor(CompleteResponses$zipcode)
CompleteResponses$brand<- as.factor(CompleteResponses$brand)

EDA

Salary:

Mean salary for Acer (0) smaller than that of Sony, Acer also has a smaller IQR than Sony. We don’t see any outliers in our plots.

Age

The below density chart shows a uniform distribution in our dataset. The mean & median is 50 years old.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The below graph shows distinct groups of customers preferring either Acer or Sony within pockets.

Education Level

Another uniformly distributed population among education levels.

## Warning in geom_histogram(stat = "count", color = "darkblue", fill =
## "lightblue"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Brand of Car

A similar picture here, even distribution among car brands.

## Warning in geom_histogram(stat = "count", color = "darkblue", fill =
## "lightblue"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Zip code

Uniform distribution amogst zip codes:

## Warning in geom_histogram(stat = "count", color = "darkblue", fill =
## "lightblue"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Credit Limit

No outliers in credit limit data:

We can see from the below the mean credit limit is very similar across their brand preferences:

Modelling

Training and Testing data

We start off with creating a random order for our data and checking whether we have any missing data. We will be splitting our data into 75%/25% sets for training and testing.

#Create a random order for the dataset
set.seed(123)

#Check for any Missing values (It was detailed that this is a complete survey)
is.null(CompleteResponses)
## [1] FALSE
#No empty cells

#Split data into train and test sets at 75%/25%

IndexTrain<- createDataPartition(y=CompleteResponses$brand,
                                 p=0.75,
                                 list=FALSE)
Training_set_responses<-CompleteResponses[IndexTrain,]
Testing_set_responses<-CompleteResponses[-IndexTrain,]

Preprocessing

We have three numeric variables; salary, credit and age - all of which vary in magnitude. The below will normalize this data:

#Preprocess the numeric data
numerics<-c('salary','credit','age')
ProcValues<-preProcess(Training_set_responses[,numerics],method=c('center','scale'))
Training_set_responses_numerics_Scaled<- predict(ProcValues,Training_set_responses[,numerics])
Testing_set_responses_numerics_Scaled<- predict(ProcValues,Testing_set_responses[,numerics])

Training_set_responses_Scaled<-Training_set_responses
Training_set_responses_Scaled[,numerics]<-Training_set_responses_numerics_Scaled

Testing_set_responses_Scaled<- Testing_set_responses
Testing_set_responses_Scaled[,numerics]<-Testing_set_responses_numerics_Scaled

Gradient Boosting

The first classification model we will try is Gradient Boosting, a decision tree based classifier.

We are trying to predict the brand of computer the customer would prefer between Acer (0) and Sony (1).

The below will train a gradient boosting model based on our training data. We will split this training data into 10 folds to cross validate which will detect any over fitting.

# Define the control parameters for our model
responses_ctrl<- trainControl(method='repeatedcv', number=10,repeats=1)

#Fit Gradient Boosting model
GBMFit<- train(brand~.,data = Training_set_responses_Scaled,method='gbm', trControl=responses_ctrl)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2996             nan     0.1000    0.0131
##      2        1.2774             nan     0.1000    0.0103
##      3        1.2596             nan     0.1000    0.0092
##      4        1.2442             nan     0.1000    0.0077
##      5        1.2317             nan     0.1000    0.0059
##      6        1.2186             nan     0.1000    0.0062
##      7        1.2077             nan     0.1000    0.0053
##      8        1.1994             nan     0.1000    0.0041
##      9        1.1921             nan     0.1000    0.0030
##     10        1.1834             nan     0.1000    0.0042
##     20        1.1299             nan     0.1000    0.0022
##     40        1.0730             nan     0.1000    0.0005
##     60        1.0463             nan     0.1000    0.0001
##     80        1.0362             nan     0.1000   -0.0001
##    100        1.0311             nan     0.1000    0.0000
##    120        1.0283             nan     0.1000   -0.0001
##    140        1.0266             nan     0.1000   -0.0001
##    150        1.0255             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2895             nan     0.1000    0.0180
##      2        1.2591             nan     0.1000    0.0147
##      3        1.2329             nan     0.1000    0.0131
##      4        1.2115             nan     0.1000    0.0106
##      5        1.1925             nan     0.1000    0.0090
##      6        1.1728             nan     0.1000    0.0096
##      7        1.1561             nan     0.1000    0.0077
##      8        1.1429             nan     0.1000    0.0063
##      9        1.1320             nan     0.1000    0.0051
##     10        1.1199             nan     0.1000    0.0057
##     20        1.0453             nan     0.1000    0.0036
##     40        0.9301             nan     0.1000    0.0009
##     60        0.7461             nan     0.1000    0.0003
##     80        0.6674             nan     0.1000    0.0025
##    100        0.6291             nan     0.1000   -0.0002
##    120        0.5865             nan     0.1000   -0.0001
##    140        0.5514             nan     0.1000    0.0009
##    150        0.5476             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2765             nan     0.1000    0.0244
##      2        1.2301             nan     0.1000    0.0228
##      3        1.1864             nan     0.1000    0.0215
##      4        1.1488             nan     0.1000    0.0182
##      5        1.1199             nan     0.1000    0.0140
##      6        1.0976             nan     0.1000    0.0105
##      7        1.0694             nan     0.1000    0.0139
##      8        1.0534             nan     0.1000    0.0082
##      9        1.0325             nan     0.1000    0.0101
##     10        1.0110             nan     0.1000    0.0102
##     20        0.8637             nan     0.1000    0.0060
##     40        0.7219             nan     0.1000    0.0003
##     60        0.6483             nan     0.1000    0.0001
##     80        0.5788             nan     0.1000    0.0058
##    100        0.5356             nan     0.1000    0.0000
##    120        0.5312             nan     0.1000   -0.0002
##    140        0.5185             nan     0.1000   -0.0001
##    150        0.4947             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3009             nan     0.1000    0.0124
##      2        1.2807             nan     0.1000    0.0101
##      3        1.2623             nan     0.1000    0.0091
##      4        1.2478             nan     0.1000    0.0071
##      5        1.2333             nan     0.1000    0.0072
##      6        1.2210             nan     0.1000    0.0060
##      7        1.2108             nan     0.1000    0.0046
##      8        1.2008             nan     0.1000    0.0047
##      9        1.1940             nan     0.1000    0.0033
##     10        1.1851             nan     0.1000    0.0043
##     20        1.1308             nan     0.1000    0.0018
##     40        1.0721             nan     0.1000    0.0018
##     60        1.0469             nan     0.1000    0.0002
##     80        1.0351             nan     0.1000    0.0006
##    100        1.0302             nan     0.1000   -0.0001
##    120        1.0265             nan     0.1000   -0.0001
##    140        1.0249             nan     0.1000   -0.0002
##    150        1.0245             nan     0.1000   -0.0000
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2883             nan     0.1000    0.0177
##      2        1.2580             nan     0.1000    0.0147
##      3        1.2305             nan     0.1000    0.0133
##      4        1.2090             nan     0.1000    0.0104
##      5        1.1897             nan     0.1000    0.0096
##      6        1.1732             nan     0.1000    0.0082
##      7        1.1569             nan     0.1000    0.0081
##      8        1.1434             nan     0.1000    0.0067
##      9        1.1314             nan     0.1000    0.0060
##     10        1.1210             nan     0.1000    0.0047
##     20        1.0426             nan     0.1000    0.0026
##     40        0.8838             nan     0.1000    0.0005
##     60        0.7284             nan     0.1000    0.0035
##     80        0.6480             nan     0.1000    0.0001
##    100        0.5925             nan     0.1000   -0.0001
##    120        0.5625             nan     0.1000    0.0018
##    140        0.5472             nan     0.1000   -0.0004
##    150        0.5387             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2677             nan     0.1000    0.0292
##      2        1.2213             nan     0.1000    0.0229
##      3        1.1788             nan     0.1000    0.0199
##      4        1.1447             nan     0.1000    0.0170
##      5        1.1141             nan     0.1000    0.0151
##      6        1.0859             nan     0.1000    0.0141
##      7        1.0615             nan     0.1000    0.0116
##      8        1.0452             nan     0.1000    0.0079
##      9        1.0309             nan     0.1000    0.0067
##     10        1.0095             nan     0.1000    0.0105
##     20        0.8803             nan     0.1000    0.0050
##     40        0.7319             nan     0.1000    0.0008
##     60        0.6377             nan     0.1000    0.0050
##     80        0.5612             nan     0.1000    0.0001
##    100        0.5019             nan     0.1000    0.0000
##    120        0.4676             nan     0.1000   -0.0001
##    140        0.4458             nan     0.1000   -0.0000
##    150        0.4399             nan     0.1000    0.0000
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3005             nan     0.1000    0.0128
##      2        1.2795             nan     0.1000    0.0107
##      3        1.2612             nan     0.1000    0.0088
##      4        1.2467             nan     0.1000    0.0074
##      5        1.2336             nan     0.1000    0.0062
##      6        1.2208             nan     0.1000    0.0062
##      7        1.2115             nan     0.1000    0.0044
##      8        1.2017             nan     0.1000    0.0047
##      9        1.1943             nan     0.1000    0.0032
##     10        1.1874             nan     0.1000    0.0036
##     20        1.1322             nan     0.1000    0.0019
##     40        1.0742             nan     0.1000    0.0006
##     60        1.0491             nan     0.1000    0.0007
##     80        1.0383             nan     0.1000   -0.0001
##    100        1.0331             nan     0.1000   -0.0000
##    120        1.0299             nan     0.1000    0.0000
##    140        1.0276             nan     0.1000   -0.0001
##    150        1.0268             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2903             nan     0.1000    0.0173
##      2        1.2578             nan     0.1000    0.0158
##      3        1.2319             nan     0.1000    0.0129
##      4        1.2101             nan     0.1000    0.0104
##      5        1.1910             nan     0.1000    0.0087
##      6        1.1726             nan     0.1000    0.0089
##      7        1.1581             nan     0.1000    0.0069
##      8        1.1446             nan     0.1000    0.0062
##      9        1.1328             nan     0.1000    0.0059
##     10        1.1229             nan     0.1000    0.0047
##     20        1.0471             nan     0.1000    0.0014
##     40        0.8539             nan     0.1000    0.0054
##     60        0.7147             nan     0.1000    0.0007
##     80        0.6493             nan     0.1000    0.0019
##    100        0.6081             nan     0.1000   -0.0001
##    120        0.5618             nan     0.1000    0.0003
##    140        0.5345             nan     0.1000   -0.0000
##    150        0.5218             nan     0.1000    0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2724             nan     0.1000    0.0269
##      2        1.2259             nan     0.1000    0.0234
##      3        1.1809             nan     0.1000    0.0219
##      4        1.1463             nan     0.1000    0.0170
##      5        1.1218             nan     0.1000    0.0118
##      6        1.0985             nan     0.1000    0.0113
##      7        1.0731             nan     0.1000    0.0123
##      8        1.0498             nan     0.1000    0.0112
##      9        1.0291             nan     0.1000    0.0097
##     10        1.0157             nan     0.1000    0.0062
##     20        0.8795             nan     0.1000    0.0046
##     40        0.7340             nan     0.1000    0.0091
##     60        0.6798             nan     0.1000   -0.0001
##     80        0.6362             nan     0.1000   -0.0002
##    100        0.5752             nan     0.1000   -0.0001
##    120        0.5110             nan     0.1000    0.0001
##    140        0.4730             nan     0.1000   -0.0001
##    150        0.4612             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3005             nan     0.1000    0.0130
##      2        1.2795             nan     0.1000    0.0109
##      3        1.2613             nan     0.1000    0.0092
##      4        1.2457             nan     0.1000    0.0074
##      5        1.2326             nan     0.1000    0.0061
##      6        1.2204             nan     0.1000    0.0060
##      7        1.2105             nan     0.1000    0.0046
##      8        1.2011             nan     0.1000    0.0045
##      9        1.1940             nan     0.1000    0.0033
##     10        1.1861             nan     0.1000    0.0034
##     20        1.1289             nan     0.1000    0.0019
##     40        1.0721             nan     0.1000    0.0015
##     60        1.0474             nan     0.1000    0.0003
##     80        1.0361             nan     0.1000   -0.0003
##    100        1.0320             nan     0.1000   -0.0000
##    120        1.0297             nan     0.1000   -0.0002
##    140        1.0274             nan     0.1000   -0.0001
##    150        1.0271             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2911             nan     0.1000    0.0170
##      2        1.2608             nan     0.1000    0.0146
##      3        1.2359             nan     0.1000    0.0120
##      4        1.2108             nan     0.1000    0.0124
##      5        1.1892             nan     0.1000    0.0102
##      6        1.1725             nan     0.1000    0.0081
##      7        1.1572             nan     0.1000    0.0072
##      8        1.1441             nan     0.1000    0.0065
##      9        1.1322             nan     0.1000    0.0053
##     10        1.1211             nan     0.1000    0.0049
##     20        1.0460             nan     0.1000    0.0039
##     40        0.9083             nan     0.1000    0.0085
##     60        0.7473             nan     0.1000    0.0032
##     80        0.6659             nan     0.1000    0.0035
##    100        0.6059             nan     0.1000    0.0002
##    120        0.5757             nan     0.1000   -0.0001
##    140        0.5585             nan     0.1000   -0.0001
##    150        0.5318             nan     0.1000    0.0022
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2722             nan     0.1000    0.0267
##      2        1.2282             nan     0.1000    0.0216
##      3        1.1819             nan     0.1000    0.0218
##      4        1.1499             nan     0.1000    0.0152
##      5        1.1207             nan     0.1000    0.0152
##      6        1.0907             nan     0.1000    0.0146
##      7        1.0654             nan     0.1000    0.0113
##      8        1.0423             nan     0.1000    0.0109
##      9        1.0275             nan     0.1000    0.0070
##     10        1.0087             nan     0.1000    0.0093
##     20        0.8890             nan     0.1000    0.0025
##     40        0.7520             nan     0.1000    0.0005
##     60        0.6514             nan     0.1000    0.0038
##     80        0.6141             nan     0.1000    0.0016
##    100        0.5752             nan     0.1000    0.0022
##    120        0.5307             nan     0.1000   -0.0001
##    140        0.4971             nan     0.1000   -0.0001
##    150        0.4841             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3004             nan     0.1000    0.0130
##      2        1.2785             nan     0.1000    0.0107
##      3        1.2612             nan     0.1000    0.0088
##      4        1.2452             nan     0.1000    0.0080
##      5        1.2330             nan     0.1000    0.0062
##      6        1.2209             nan     0.1000    0.0062
##      7        1.2113             nan     0.1000    0.0048
##      8        1.2019             nan     0.1000    0.0047
##      9        1.1949             nan     0.1000    0.0034
##     10        1.1865             nan     0.1000    0.0035
##     20        1.1300             nan     0.1000    0.0028
##     40        1.0734             nan     0.1000    0.0008
##     60        1.0460             nan     0.1000    0.0004
##     80        1.0352             nan     0.1000   -0.0001
##    100        1.0303             nan     0.1000   -0.0003
##    120        1.0269             nan     0.1000   -0.0001
##    140        1.0253             nan     0.1000   -0.0001
##    150        1.0248             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2901             nan     0.1000    0.0184
##      2        1.2600             nan     0.1000    0.0149
##      3        1.2312             nan     0.1000    0.0142
##      4        1.2086             nan     0.1000    0.0109
##      5        1.1887             nan     0.1000    0.0097
##      6        1.1702             nan     0.1000    0.0086
##      7        1.1554             nan     0.1000    0.0072
##      8        1.1428             nan     0.1000    0.0061
##      9        1.1314             nan     0.1000    0.0048
##     10        1.1202             nan     0.1000    0.0050
##     20        1.0457             nan     0.1000    0.0031
##     40        0.9205             nan     0.1000    0.0004
##     60        0.7461             nan     0.1000    0.0023
##     80        0.6603             nan     0.1000    0.0039
##    100        0.6099             nan     0.1000    0.0001
##    120        0.5853             nan     0.1000   -0.0001
##    140        0.5392             nan     0.1000    0.0019
##    150        0.5317             nan     0.1000   -0.0000
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2691             nan     0.1000    0.0276
##      2        1.2190             nan     0.1000    0.0240
##      3        1.1799             nan     0.1000    0.0188
##      4        1.1445             nan     0.1000    0.0173
##      5        1.1163             nan     0.1000    0.0136
##      6        1.0874             nan     0.1000    0.0134
##      7        1.0583             nan     0.1000    0.0137
##      8        1.0350             nan     0.1000    0.0114
##      9        1.0200             nan     0.1000    0.0072
##     10        1.0070             nan     0.1000    0.0060
##     20        0.8631             nan     0.1000    0.0067
##     40        0.7266             nan     0.1000    0.0015
##     60        0.6228             nan     0.1000   -0.0000
##     80        0.5566             nan     0.1000    0.0027
##    100        0.4904             nan     0.1000    0.0015
##    120        0.4655             nan     0.1000   -0.0002
##    140        0.4413             nan     0.1000    0.0000
##    150        0.4339             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3008             nan     0.1000    0.0125
##      2        1.2789             nan     0.1000    0.0106
##      3        1.2615             nan     0.1000    0.0085
##      4        1.2444             nan     0.1000    0.0078
##      5        1.2315             nan     0.1000    0.0063
##      6        1.2197             nan     0.1000    0.0061
##      7        1.2094             nan     0.1000    0.0050
##      8        1.2006             nan     0.1000    0.0039
##      9        1.1919             nan     0.1000    0.0042
##     10        1.1849             nan     0.1000    0.0035
##     20        1.1324             nan     0.1000    0.0014
##     40        1.0747             nan     0.1000    0.0006
##     60        1.0498             nan     0.1000    0.0000
##     80        1.0388             nan     0.1000   -0.0001
##    100        1.0344             nan     0.1000    0.0000
##    120        1.0319             nan     0.1000   -0.0002
##    140        1.0289             nan     0.1000    0.0003
##    150        1.0282             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2899             nan     0.1000    0.0174
##      2        1.2566             nan     0.1000    0.0153
##      3        1.2311             nan     0.1000    0.0127
##      4        1.2089             nan     0.1000    0.0110
##      5        1.1901             nan     0.1000    0.0091
##      6        1.1737             nan     0.1000    0.0078
##      7        1.1579             nan     0.1000    0.0077
##      8        1.1454             nan     0.1000    0.0063
##      9        1.1339             nan     0.1000    0.0053
##     10        1.1212             nan     0.1000    0.0060
##     20        1.0517             nan     0.1000    0.0035
##     40        0.9444             nan     0.1000    0.0016
##     60        0.7670             nan     0.1000    0.0034
##     80        0.6751             nan     0.1000    0.0022
##    100        0.6154             nan     0.1000    0.0020
##    120        0.5853             nan     0.1000    0.0001
##    140        0.5413             nan     0.1000    0.0000
##    150        0.5217             nan     0.1000    0.0018
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2723             nan     0.1000    0.0267
##      2        1.2267             nan     0.1000    0.0220
##      3        1.1830             nan     0.1000    0.0215
##      4        1.1498             nan     0.1000    0.0167
##      5        1.1203             nan     0.1000    0.0141
##      6        1.0918             nan     0.1000    0.0136
##      7        1.0680             nan     0.1000    0.0115
##      8        1.0515             nan     0.1000    0.0080
##      9        1.0277             nan     0.1000    0.0110
##     10        1.0064             nan     0.1000    0.0103
##     20        0.8841             nan     0.1000    0.0052
##     40        0.7674             nan     0.1000    0.0005
##     60        0.6756             nan     0.1000    0.0054
##     80        0.5862             nan     0.1000   -0.0001
##    100        0.5033             nan     0.1000   -0.0003
##    120        0.4801             nan     0.1000    0.0031
##    140        0.4365             nan     0.1000    0.0008
##    150        0.4344             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3016             nan     0.1000    0.0126
##      2        1.2806             nan     0.1000    0.0106
##      3        1.2635             nan     0.1000    0.0086
##      4        1.2481             nan     0.1000    0.0077
##      5        1.2351             nan     0.1000    0.0060
##      6        1.2226             nan     0.1000    0.0062
##      7        1.2121             nan     0.1000    0.0052
##      8        1.2036             nan     0.1000    0.0039
##      9        1.1966             nan     0.1000    0.0033
##     10        1.1907             nan     0.1000    0.0027
##     20        1.1324             nan     0.1000    0.0032
##     40        1.0750             nan     0.1000    0.0007
##     60        1.0479             nan     0.1000    0.0002
##     80        1.0370             nan     0.1000    0.0006
##    100        1.0323             nan     0.1000   -0.0001
##    120        1.0307             nan     0.1000   -0.0004
##    140        1.0288             nan     0.1000   -0.0001
##    150        1.0284             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2901             nan     0.1000    0.0179
##      2        1.2610             nan     0.1000    0.0140
##      3        1.2331             nan     0.1000    0.0131
##      4        1.2086             nan     0.1000    0.0117
##      5        1.1902             nan     0.1000    0.0092
##      6        1.1744             nan     0.1000    0.0080
##      7        1.1605             nan     0.1000    0.0069
##      8        1.1474             nan     0.1000    0.0062
##      9        1.1337             nan     0.1000    0.0069
##     10        1.1214             nan     0.1000    0.0054
##     20        1.0493             nan     0.1000    0.0020
##     40        0.9394             nan     0.1000    0.0022
##     60        0.7328             nan     0.1000    0.0010
##     80        0.6662             nan     0.1000    0.0038
##    100        0.6382             nan     0.1000   -0.0002
##    120        0.5887             nan     0.1000    0.0001
##    140        0.5446             nan     0.1000    0.0004
##    150        0.5286             nan     0.1000    0.0020
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2706             nan     0.1000    0.0283
##      2        1.2224             nan     0.1000    0.0239
##      3        1.1844             nan     0.1000    0.0186
##      4        1.1524             nan     0.1000    0.0154
##      5        1.1232             nan     0.1000    0.0145
##      6        1.0956             nan     0.1000    0.0137
##      7        1.0770             nan     0.1000    0.0089
##      8        1.0512             nan     0.1000    0.0123
##      9        1.0366             nan     0.1000    0.0071
##     10        1.0158             nan     0.1000    0.0098
##     20        0.8883             nan     0.1000    0.0076
##     40        0.7404             nan     0.1000    0.0006
##     60        0.6551             nan     0.1000    0.0062
##     80        0.5677             nan     0.1000    0.0032
##    100        0.5268             nan     0.1000   -0.0001
##    120        0.4949             nan     0.1000   -0.0003
##    140        0.4793             nan     0.1000    0.0015
##    150        0.4680             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3011             nan     0.1000    0.0132
##      2        1.2780             nan     0.1000    0.0112
##      3        1.2597             nan     0.1000    0.0082
##      4        1.2432             nan     0.1000    0.0082
##      5        1.2288             nan     0.1000    0.0068
##      6        1.2165             nan     0.1000    0.0057
##      7        1.2053             nan     0.1000    0.0051
##      8        1.1965             nan     0.1000    0.0041
##      9        1.1896             nan     0.1000    0.0033
##     10        1.1808             nan     0.1000    0.0043
##     20        1.1259             nan     0.1000    0.0016
##     40        1.0680             nan     0.1000    0.0006
##     60        1.0433             nan     0.1000    0.0002
##     80        1.0332             nan     0.1000   -0.0001
##    100        1.0272             nan     0.1000    0.0000
##    120        1.0239             nan     0.1000   -0.0001
##    140        1.0226             nan     0.1000   -0.0002
##    150        1.0223             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2908             nan     0.1000    0.0183
##      2        1.2581             nan     0.1000    0.0157
##      3        1.2321             nan     0.1000    0.0127
##      4        1.2072             nan     0.1000    0.0114
##      5        1.1874             nan     0.1000    0.0098
##      6        1.1697             nan     0.1000    0.0085
##      7        1.1541             nan     0.1000    0.0072
##      8        1.1390             nan     0.1000    0.0063
##      9        1.1274             nan     0.1000    0.0057
##     10        1.1177             nan     0.1000    0.0043
##     20        1.0401             nan     0.1000    0.0017
##     40        0.8877             nan     0.1000    0.0071
##     60        0.7428             nan     0.1000    0.0006
##     80        0.6440             nan     0.1000    0.0030
##    100        0.5966             nan     0.1000    0.0006
##    120        0.5434             nan     0.1000    0.0001
##    140        0.5354             nan     0.1000   -0.0001
##    150        0.5168             nan     0.1000    0.0020
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2690             nan     0.1000    0.0286
##      2        1.2230             nan     0.1000    0.0233
##      3        1.1792             nan     0.1000    0.0206
##      4        1.1446             nan     0.1000    0.0166
##      5        1.1140             nan     0.1000    0.0148
##      6        1.0883             nan     0.1000    0.0126
##      7        1.0696             nan     0.1000    0.0089
##      8        1.0493             nan     0.1000    0.0101
##      9        1.0266             nan     0.1000    0.0109
##     10        1.0145             nan     0.1000    0.0060
##     20        0.8721             nan     0.1000    0.0075
##     40        0.7410             nan     0.1000    0.0037
##     60        0.6229             nan     0.1000    0.0003
##     80        0.5506             nan     0.1000   -0.0001
##    100        0.4821             nan     0.1000    0.0002
##    120        0.4401             nan     0.1000    0.0001
##    140        0.4138             nan     0.1000    0.0016
##    150        0.4072             nan     0.1000   -0.0000
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3003             nan     0.1000    0.0128
##      2        1.2789             nan     0.1000    0.0107
##      3        1.2620             nan     0.1000    0.0084
##      4        1.2458             nan     0.1000    0.0079
##      5        1.2333             nan     0.1000    0.0061
##      6        1.2220             nan     0.1000    0.0051
##      7        1.2147             nan     0.1000    0.0032
##      8        1.2031             nan     0.1000    0.0057
##      9        1.1947             nan     0.1000    0.0040
##     10        1.1881             nan     0.1000    0.0031
##     20        1.1264             nan     0.1000    0.0019
##     40        1.0677             nan     0.1000    0.0006
##     60        1.0409             nan     0.1000    0.0001
##     80        1.0306             nan     0.1000    0.0001
##    100        1.0255             nan     0.1000   -0.0000
##    120        1.0238             nan     0.1000   -0.0001
##    140        1.0217             nan     0.1000   -0.0000
##    150        1.0211             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2893             nan     0.1000    0.0180
##      2        1.2595             nan     0.1000    0.0148
##      3        1.2334             nan     0.1000    0.0123
##      4        1.2089             nan     0.1000    0.0124
##      5        1.1867             nan     0.1000    0.0107
##      6        1.1676             nan     0.1000    0.0086
##      7        1.1524             nan     0.1000    0.0073
##      8        1.1387             nan     0.1000    0.0062
##      9        1.1265             nan     0.1000    0.0060
##     10        1.1139             nan     0.1000    0.0060
##     20        1.0406             nan     0.1000    0.0035
##     40        0.8553             nan     0.1000    0.0104
##     60        0.7284             nan     0.1000    0.0005
##     80        0.6560             nan     0.1000    0.0023
##    100        0.5869             nan     0.1000   -0.0001
##    120        0.5549             nan     0.1000    0.0025
##    140        0.5267             nan     0.1000   -0.0000
##    150        0.5221             nan     0.1000    0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2748             nan     0.1000    0.0252
##      2        1.2275             nan     0.1000    0.0235
##      3        1.1819             nan     0.1000    0.0223
##      4        1.1456             nan     0.1000    0.0176
##      5        1.1176             nan     0.1000    0.0133
##      6        1.0972             nan     0.1000    0.0098
##      7        1.0731             nan     0.1000    0.0120
##      8        1.0483             nan     0.1000    0.0117
##      9        1.0295             nan     0.1000    0.0093
##     10        1.0090             nan     0.1000    0.0089
##     20        0.8705             nan     0.1000    0.0057
##     40        0.7392             nan     0.1000   -0.0003
##     60        0.6217             nan     0.1000   -0.0001
##     80        0.5668             nan     0.1000    0.0030
##    100        0.5051             nan     0.1000    0.0017
##    120        0.4826             nan     0.1000   -0.0000
##    140        0.4515             nan     0.1000   -0.0001
##    150        0.4374             nan     0.1000    0.0014
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3014             nan     0.1000    0.0130
##      2        1.2795             nan     0.1000    0.0104
##      3        1.2607             nan     0.1000    0.0093
##      4        1.2443             nan     0.1000    0.0078
##      5        1.2315             nan     0.1000    0.0060
##      6        1.2189             nan     0.1000    0.0063
##      7        1.2085             nan     0.1000    0.0047
##      8        1.2002             nan     0.1000    0.0039
##      9        1.1931             nan     0.1000    0.0034
##     10        1.1868             nan     0.1000    0.0026
##     20        1.1276             nan     0.1000    0.0022
##     40        1.0698             nan     0.1000    0.0006
##     60        1.0435             nan     0.1000    0.0008
##     80        1.0323             nan     0.1000    0.0001
##    100        1.0287             nan     0.1000   -0.0001
##    120        1.0244             nan     0.1000   -0.0001
##    140        1.0227             nan     0.1000   -0.0001
##    150        1.0222             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2883             nan     0.1000    0.0184
##      2        1.2577             nan     0.1000    0.0153
##      3        1.2318             nan     0.1000    0.0128
##      4        1.2102             nan     0.1000    0.0106
##      5        1.1884             nan     0.1000    0.0109
##      6        1.1696             nan     0.1000    0.0088
##      7        1.1546             nan     0.1000    0.0073
##      8        1.1410             nan     0.1000    0.0062
##      9        1.1287             nan     0.1000    0.0061
##     10        1.1183             nan     0.1000    0.0047
##     20        1.0391             nan     0.1000    0.0030
##     40        0.8652             nan     0.1000    0.0106
##     60        0.7396             nan     0.1000    0.0032
##     80        0.6490             nan     0.1000    0.0021
##    100        0.5948             nan     0.1000    0.0024
##    120        0.5652             nan     0.1000   -0.0000
##    140        0.5431             nan     0.1000   -0.0001
##    150        0.5328             nan     0.1000    0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2720             nan     0.1000    0.0272
##      2        1.2258             nan     0.1000    0.0226
##      3        1.1858             nan     0.1000    0.0199
##      4        1.1485             nan     0.1000    0.0180
##      5        1.1143             nan     0.1000    0.0163
##      6        1.0911             nan     0.1000    0.0114
##      7        1.0648             nan     0.1000    0.0128
##      8        1.0484             nan     0.1000    0.0079
##      9        1.0267             nan     0.1000    0.0101
##     10        1.0081             nan     0.1000    0.0087
##     20        0.8910             nan     0.1000    0.0106
##     40        0.7252             nan     0.1000    0.0044
##     60        0.6266             nan     0.1000    0.0000
##     80        0.6022             nan     0.1000   -0.0002
##    100        0.5939             nan     0.1000   -0.0002
##    120        0.5686             nan     0.1000    0.0019
##    140        0.5209             nan     0.1000    0.0033
##    150        0.5020             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2721             nan     0.1000    0.0268
##      2        1.2212             nan     0.1000    0.0236
##      3        1.1832             nan     0.1000    0.0188
##      4        1.1485             nan     0.1000    0.0172
##      5        1.1182             nan     0.1000    0.0144
##      6        1.0988             nan     0.1000    0.0099
##      7        1.0742             nan     0.1000    0.0116
##      8        1.0528             nan     0.1000    0.0100
##      9        1.0292             nan     0.1000    0.0116
##     10        1.0169             nan     0.1000    0.0062
##     20        0.8651             nan     0.1000    0.0031
##     40        0.7346             nan     0.1000    0.0066
##     60        0.6549             nan     0.1000   -0.0001
##     80        0.5776             nan     0.1000    0.0001
##    100        0.5191             nan     0.1000   -0.0001
##    120        0.4768             nan     0.1000    0.0018
##    140        0.4480             nan     0.1000    0.0024
##    150        0.4293             nan     0.1000    0.0004
GBMFitAccuracy<-GBMFit$results

Results

We have a high accuracy rate using 150 trees with a depth of 3 nodes:

#Check results
GBMFit
## Stochastic Gradient Boosting 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6682, 6682, 6682, 6681, 6682, 6682, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7264303  0.4227616
##   1                  100      0.7281809  0.4250240
##   1                  150      0.7279119  0.4245218
##   2                   50      0.8306819  0.6431492
##   2                  100      0.8809244  0.7514811
##   2                  150      0.9067861  0.8038497
##   3                   50      0.8728423  0.7374849
##   3                  100      0.9028839  0.7967197
##   3                  150      0.9155438  0.8215371
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.
GBMFitAccuracy
##   shrinkage interaction.depth n.minobsinnode n.trees  Accuracy     Kappa
## 1       0.1                 1             10      50 0.7264303 0.4227616
## 4       0.1                 2             10      50 0.8306819 0.6431492
## 7       0.1                 3             10      50 0.8728423 0.7374849
## 2       0.1                 1             10     100 0.7281809 0.4250240
## 5       0.1                 2             10     100 0.8809244 0.7514811
## 8       0.1                 3             10     100 0.9028839 0.7967197
## 3       0.1                 1             10     150 0.7279119 0.4245218
## 6       0.1                 2             10     150 0.9067861 0.8038497
## 9       0.1                 3             10     150 0.9155438 0.8215371
##    AccuracySD    KappaSD
## 1 0.015308770 0.03159188
## 4 0.022678649 0.04803775
## 7 0.013587121 0.02617844
## 2 0.014526039 0.02925238
## 5 0.010622789 0.02124222
## 8 0.018291331 0.03665301
## 3 0.014971874 0.03060939
## 6 0.014445881 0.02956136
## 9 0.008622595 0.01811902

We also see that Salary and Age are the most important factors in predicting customer brand preferences:

varImp(GBMFit)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##            Overall
## salary   100.00000
## age       74.33530
## credit     1.38598
## zipcode2   0.13944
## car15      0.09672
## zipcode3   0.07518
## car12      0.06845
## zipcode4   0.06718
## car14      0.06286
## car4       0.05841
## elevel1    0.05572
## zipcode5   0.05502
## elevel4    0.05175
## zipcode6   0.05028
## car10      0.04734
## car2       0.04384
## car3       0.04338
## zipcode8   0.04119
## zipcode1   0.03938
## car8       0.03874

Predcitions and Confusion matrix

We will now use our trained model to guess what computer brand customers would prefer from our untouched testing data. The model will predict whether the customer prefers Acer or Sony, and the Confusion Matrix below compares how our predictions look against the customers real choice.

#predictions
BrandPredsGBM<- predict(GBMFit, newdata=Testing_set_responses_Scaled)

#confusion matrix
GBMConfMat<- confusionMatrix(data=BrandPredsGBM,Testing_set_responses$brand)
GBMConfMat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  873  109
##          1   63 1429
##                                           
##                Accuracy : 0.9305          
##                  95% CI : (0.9197, 0.9402)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8536          
##                                           
##  Mcnemar's Test P-Value : 0.0006009       
##                                           
##             Sensitivity : 0.9327          
##             Specificity : 0.9291          
##          Pos Pred Value : 0.8890          
##          Neg Pred Value : 0.9578          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3529          
##    Detection Prevalence : 0.3969          
##       Balanced Accuracy : 0.9309          
##                                           
##        'Positive' Class : 0               
## 

Random Forest

We will now try a Random Forest Classification model. Random Forest is another tree based classifier that randomly creates multiple independent decision trees and combines the predictions to create a final prediction.

We are going to manually tune our Random Forest model, with tests from 6 to 10 mtry to see which provides the highest accuracy (I have tested 1-5 separately and they always score lower accuracy). Here, mtry controls how many input features a tree has to consider at any given moment.

#Create grids for manual tuning
RandomForestGrid2<-expand.grid(mtry=c(6,7,8,9,10))

#Fit the model:

RandomForestFit2<-train(brand~.,data = Training_set_responses_Scaled,method='rf', trControl=responses_ctrl,
                       tuneGrid=RandomForestGrid2)

Results

We can see the accuracy rate improving as we increase the number of mtry’s.

10 mtry leads to a very high accuracy:

#Check results
RandomForestFit2
## Random Forest 
## 
## 7424 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6682, 6682, 6682, 6681, 6682, 6682, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    6    0.9108276  0.8112342
##    7    0.9163499  0.8228310
##    8    0.9174281  0.8249692
##    9    0.9182358  0.8264965
##   10    0.9185057  0.8270487
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
#More in-depth results:
RandomForest2Results<- RandomForestFit2$results

RandomForest2Results
##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    6 0.9108276 0.8112342 0.008704044 0.01828027
## 2    7 0.9163499 0.8228310 0.006854520 0.01444745
## 3    8 0.9174281 0.8249692 0.006441021 0.01358864
## 4    9 0.9182358 0.8264965 0.006509490 0.01364185
## 5   10 0.9185057 0.8270487 0.007074204 0.01515796

Again we see Salary and Age are the most important factors:

varImp(RandomForestFit2)
## rf variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##           Overall
## salary   100.0000
## age       53.0401
## credit    14.6956
## elevel3    0.9307
## elevel4    0.9122
## elevel1    0.8875
## elevel2    0.8423
## zipcode6   0.6122
## zipcode4   0.5912
## zipcode2   0.5295
## zipcode1   0.5126
## zipcode7   0.4938
## zipcode3   0.4743
## zipcode8   0.4620
## zipcode5   0.4439
## car15      0.2582
## car7       0.2183
## car17      0.2130
## car19      0.1953
## car12      0.1674

Predictions and Confusion Matrix

The below accuracy in this model is an improvement on the Gradient boosting:

BrandPredsRF2<- predict(RandomForestFit2, newdata=Testing_set_responses_Scaled)

#Confusion matrix
RFConfMat<- confusionMatrix(data=BrandPredsRF2,Testing_set_responses$brand)
RFConfMat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  860  108
##          1   76 1430
##                                           
##                Accuracy : 0.9256          
##                  95% CI : (0.9146, 0.9357)
##     No Information Rate : 0.6217          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8429          
##                                           
##  Mcnemar's Test P-Value : 0.02229         
##                                           
##             Sensitivity : 0.9188          
##             Specificity : 0.9298          
##          Pos Pred Value : 0.8884          
##          Neg Pred Value : 0.9495          
##              Prevalence : 0.3783          
##          Detection Rate : 0.3476          
##    Detection Prevalence : 0.3913          
##       Balanced Accuracy : 0.9243          
##                                           
##        'Positive' Class : 0               
## 

Predicting Brand preferences on the incomplete survey data

Importing and changing variable types

Importing the incomplete survey data:

SurveyIncomplete<-read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\R_exercises\\SurveyIncomplete.csv")

Changing the variable types and normalizing numeric data (as we did with the original data set). I am also removing the incorrect ‘brand’ column with wrong data.

SurveyIncomplete$elevel<-as.factor(SurveyIncomplete$elevel)
SurveyIncomplete$car<-as.factor(SurveyIncomplete$car)
SurveyIncomplete$zipcode<-as.factor(SurveyIncomplete$zipcode)

#Removing incorrect brand column, centering and scaling numeric variables

IncompleteTesting<-SurveyIncomplete
IncompleteTesting<- select(IncompleteTesting,-c('brand'))

PreprocessedNumerics<- preProcess(IncompleteTesting[,numerics],method=c('center','scale'))
Incomplete_numerics_Scaled<- predict(PreprocessedNumerics,IncompleteTesting[,numerics])
IncompleteTesting[,numerics]<-Incomplete_numerics_Scaled

Predicting

Our Random Forest Model predicts there will be the below number of customers who prefer Acer (0) or Sony(1):

IncompleteBrandPredsRF2<- predict(RandomForestFit2, newdata=IncompleteTesting)
summary(IncompleteBrandPredsRF2)
##    0    1 
## 1902 3098

Conclusion

Creating the complete survey Dataset

#Using the original datasets that have not been normalized
Original$elevel<- as.factor(Original$elevel)
Original$car<- as.factor(Original$car)
Original$zipcode<-as.factor(Original$zipcode)
Original$brand<- as.factor(Original$brand)

New<-read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\R_exercises\\SurveyIncomplete.csv")
New$elevel<- as.factor(New$elevel)
New$car<- as.factor(New$car)
New$zipcode<-as.factor(New$zipcode)
New<-select(New,-c('brand'))
New['brand']<-c(IncompleteBrandPredsRF2)

combined_data<-full_join(Original,New)
## Joining with `by = join_by(salary, age, elevel, car, zipcode, credit, brand)`

Final Data

The below histogram shows that Sony (1) is more popular with approximately 60% of customers preferring Sony to Acer.

## Warning in geom_histogram(stat = "count", color = "darkblue", fill =
## "lightblue"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`