This dataset is exported from Paamonim association. Paamonim is a nationwide volunteer organization dedicated to helping families and individuals conduct their household finances with balance, responsibility and integrity, thereby strengthening Israel’s economic and social fabric. With the help of its 2800 volunteers, Paamonim provides a wide range of solutions to thousands of Israeli families and individuals each year through its targeted economic recovery programs, and offers information and tools for the general public interested learning the best methods for managing their household finances.
Paamonim have a call center who’s role is to call past donators and ask if they are willing to donate more.
Now a days the call center extract a campaign by cutting the population by rules such as “let’s call those who didn’t donate in the last year”
This ML project purpose is to make every call count and try to predict who will donate if the call center will initiate the call.
The data we recived was of 5 years of donations, each record represents one donation. The donation can be in any payment measure such as wire transfer, credit card, cheque or cash.
This is data series dataset and time have meaning but the but in the end is calssifaction question.
after preprocessing the file in 3rd party application we have a ready dataset.
Before we start the analysis it’s important to understand the basic challanges of the model we are looking for. So the first step is to define precisely what the exactly point of time we want the model to take place, what data we have in that exact moment and what if there is data that can manipluate our results if we won’t take care of it in first place.
The question - can we predict if a call initiation will be succesful when it made in the greatest time windows without any special holidays campaign. This timeframe is right after Passover and before Rosh-Hashana. Passover campaign starts right after Purim and Purim is only 2 months after Hannuka that is 2.5 months after Rosh-Hashana.
So the most important time we want an effective model for outgoing calls is right after passover.
As for now, the call center have 3% positive responses (we can see in the data below that the response is about 3% from the population)
In that case we need to keep in mind that the accuarcy of the models does not hold such importance comparing to the recall i.e. how many the model predict as positive versus how many had a positive response.
Paamonim launch every jewish holiday a special campaign and the call center does not initate calls but answers incoming calls only. Paamonim also have special campaigns for special cases such as a very sick child or infants etc.
In our analysis we need to be aware of those campaigns because they can’t be considered as a donations made by initiation but they maybe do have impact on they future behaviour of the donators.
So let’s have a look on the data
We can clearly see that the **** model successfully predict who will donate if the call center will initiate a call. The recall parameter is 56% which is not very good but it’s better than random so we can accept the model as a tool with mindset of testing the model in each period.
Dataset overview
## [1] "Account" "Religious"
## [3] "Amount_giving" "STATE_OF_DONOR"
## [5] "Clali" "Moked"
## [7] "not_cash" "Hefresh_from_last_year"
## [9] "TARAM_MOKED" "has_cellphone"
## [11] "email" "YADANI"
## [13] "HOK" "AMOUNT_TREND"
## [15] "COUNT_TREND"
## 'data.frame': 101491 obs. of 15 variables:
## $ Account : num 40009 40017 40038 40073 40083 ...
## $ Religious : Factor w/ 3 levels "","Mixed","Yes": 3 3 2 3 3 3 3 2 3 3 ...
## $ Amount_giving : Factor w/ 4 levels "Core","Low","Major",..: 2 2 2 2 1 2 1 2 1 4 ...
## $ STATE_OF_DONOR : Factor w/ 4 levels "Current","Dormant",..: 3 1 4 1 3 1 3 4 1 4 ...
## $ Clali : int 1 1 1 1 1 1 1 1 0 1 ...
## $ Moked : int 0 1 0 1 0 1 0 0 0 0 ...
## $ not_cash : int 1 1 1 1 1 1 1 1 0 1 ...
## $ Hefresh_from_last_year: Factor w/ 6 levels "above","near_completed",..: 6 5 4 5 6 5 6 5 3 4 ...
## $ TARAM_MOKED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_cellphone : int 1 1 1 0 0 1 1 1 0 0 ...
## $ email : int 1 0 0 0 1 1 1 0 0 1 ...
## $ YADANI : num 5.11 100 100 0 22.16 ...
## $ HOK : num 94.9 0 0 100 77.8 ...
## $ AMOUNT_TREND : Factor w/ 5 levels "DOWN","No Data",..: 3 3 3 3 1 3 3 3 2 3 ...
## $ COUNT_TREND : Factor w/ 5 levels "DOWN","No Data",..: 3 3 3 3 3 3 3 3 2 3 ...
## Account Religious Amount_giving STATE_OF_DONOR Clali Moked not_cash
## 1 40009 Yes Low Inactive 1 0 1
## 2 40017 Yes Low Current 1 1 1
## 3 40038 Mixed Low Lapsed 1 0 1
## 4 40073 Yes Low Current 1 1 1
## 5 40083 Yes Core Inactive 1 0 1
## 6 40085 Yes Low Current 1 1 1
## Hefresh_from_last_year TARAM_MOKED has_cellphone email YADANI HOK
## 1 started 0 1 1 5.11 94.89
## 2 not_active 0 1 0 100.00 0.00
## 3 not started 0 1 0 100.00 0.00
## 4 not_active 0 0 0 0.00 100.00
## 5 started 0 0 1 22.16 77.84
## 6 not_active 0 1 1 100.00 0.00
## AMOUNT_TREND COUNT_TREND
## 1 NOT_STABLE NOT_STABLE
## 2 NOT_STABLE NOT_STABLE
## 3 NOT_STABLE NOT_STABLE
## 4 NOT_STABLE NOT_STABLE
## 5 DOWN NOT_STABLE
## 6 NOT_STABLE NOT_STABLE
So we have 15 features
account - this is the donor id. it’s not relevant for the model but it will be relevant for cross validations
Religious- this feautre represents if the city citizens are mostly religious or if it’s a mixed religions city. it have 2 classes “mixed” and “yes”
Amount_giving - this feautre has 4 classes. “Low” - donators with low donation average (beneath 180 Sheqels which religious people like to donate 10 times CHAY - 18). “Core” are between 180 and 300. “Major” are 300-500 and Top are above 500. Those numbers set by the business side.
“State_of_donor” - a feature with 4 classes. “Current”- donated in the last year. “Lapsed” - donated in the last 2 years. “Inactive” - last 3 years, and “Dormant” - above 3 years.
“Clali” - a flag represents if the donor ever donated for a “general” purpose.
“Moked” - a flag represents if the donor ever donated within the call center activity
“not_cash” - a flag represents if the donor ever used a non-cash payment method such as credit card or wire transfer
“Hefresh_from_last_year” - a feature with 5 classes and 1 blank class. blank - never donated before this year for general purpose. “not active” - donated before to general purpose but this year didn’t donate anything. “not started” - donated between 1%-40% comparing to last year. “started” - 40% to 80%. “near completed” - 80% - 100%. “above” - donated more than last year.
“TARAM_MOKED” - the target feature. a flag feautre represents if the donator did donate to “general” purpose through the call center activity.
“has cellphone” - a flag feautre represents if the donor has a cellphone in the call center database.
“email” - a flag feature represents if the donor has a email in the call center database.
“YADANI” - a numeric feature represnts the precntage of one timed donations from the general donations.
“HOK” - a numeric feature represnts the precntage of multipile donations by biliing system from the general donations.
“AMOUNT_TREND” - a feature with 4 classes and 1 blank class. blank represents donators without “general” donations history. “DOWN” - a negative trand comparing to last 2 years. “Stable” - same amount every year, “UP” - A positive trend over the last 2 years. “Not stable” - not identified pattern found.
“COUNT_TREND” - same as the amount_trend feature but using the number of donations instead of the amount.
numIndex<- sapply(mydata, is.numeric)
summary(mydata[,numIndex])
## Account Clali Moked not_cash
## Min. :4.001e+04 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:2.022e+06 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.000
## Median :2.044e+06 Median :1.0000 Median :0.0000 Median :1.000
## Mean :8.409e+09 Mean :0.5316 Mean :0.3223 Mean :0.837
## 3rd Qu.:2.000e+10 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :4.000e+10 Max. :1.0000 Max. :1.0000 Max. :1.000
## TARAM_MOKED has_cellphone email YADANI
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.00
## Median :0.0000 Median :1.0000 Median :0.0000 Median : 0.00
## Mean :0.0284 Mean :0.5801 Mean :0.4923 Mean : 39.94
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:100.00
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :100.00
## HOK
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 12.56
## 3rd Qu.: 0.00
## Max. :100.00
we can see we don’t have NA’s. Now let’s look at each feature by himself as univariate analaysis
summary(mydata$Religious)
## Mixed Yes
## 26 33986 67479
We have 2 classes of Religious city categories. ‘Mixed’ which means that most of the city citizens are not religious and ‘Yes’ means they are. If we want to see the relationship with target feautre we can plot it like so:
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Religious)
prop.table(counts, 2)
##
## Mixed Yes
## 0 0.96153846 0.97802036 0.96837535
## 1 0.03846154 0.02197964 0.03162465
barplot(counts, main="Religious and Donations", xlab="Is religious city",col=c("darkblue","red"), legend = rownames(counts))
We can see there is slightly more donators (in precentage) from religious city that are willing to donate for a “general” purpose.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Amount_giving)
prop.table(counts, 2)
##
## Core Low Major Top
## 0 0.97355513 0.97157443 0.97031336 0.96966904
## 1 0.02644487 0.02842557 0.02968664 0.03033096
barplot(counts, main="Average amount of donation and Donations", xlab="Amount_giving",col=c("darkblue","red"), legend = rownames(counts))
We can see there is slightly more Top donators who are willing to donate more if the call center asks them. interesting enough we see that low donators tends slightly to donate more than core donators. Maybe the reason is that low amounts are easier to be donated.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$STATE_OF_DONOR)
prop.table(counts, 2)
##
## Current Dormant Inactive Lapsed
## 0 0.98005711 0.98017725 0.96784966 0.95122995
## 1 0.01994289 0.01982275 0.03215034 0.04877005
barplot(counts, main="STATE OF DONOR and Donations", xlab="STATE_OF_DONOR",col=c("darkblue","red"), legend = rownames(counts))
Here we can see a big difference between inactive and lapsed donators and current and dormant. The result is easy to describe. Current donors are donors who just had a donation in the last half a year, so it’s unlikely to them to donate again. Dormant donors usually won’t donate again, either that they don’t want to (3 years of inactivity) or they are unreachable.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Clali)
prop.table(counts, 2)
##
## 0 1
## 0 0.98028821 0.96395211
## 1 0.01971179 0.03604789
barplot(counts, main="Clali and Donations", xlab="Clali",col=c("darkblue","red"), legend = rownames(counts))
As expected. The donatiors that previously donated to general purposes are more willing to do it again, but it seems that not by much.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Moked)
prop.table(counts, 2)
##
## 0 1
## 0 0.98514060 0.94314187
## 1 0.01485940 0.05685813
barplot(counts, main="Moked and Donations", xlab="Moked",col=c("darkblue","red"), legend = rownames(counts))
Here also the result is expected. We can easly see that donators who repond positively to the call center call are more willing (times 5) to reposnd positively next time.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$not_cash)
prop.table(counts, 2)
##
## 0 1
## 0 0.98113892 0.96974655
## 1 0.01886108 0.03025345
barplot(counts, main="not cash and Donations", xlab="not_cash",col=c("darkblue","red"), legend = rownames(counts))
We can see very staight forward that the majority of donators are using non-cash methods and they the call center donators are majorly live in the non-cash group.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Hefresh_from_last_year)
prop.table(counts, 2)
##
## above near_completed No Data not started not_active
## 0 0.827425195 0.939911448 0.994171125 0.990773962 0.990763947
## 1 0.172574805 0.060088552 0.005828875 0.009226038 0.009236053
##
## started
## 0 0.943946188
## 1 0.056053812
barplot(counts, main="Hefresh_from_last_year and Donations", xlab="Hefresh_from_last_year",col=c("darkblue","red"), legend = rownames(counts))
This result is very clear - donators that donated this year more than they had last year are very likely to donate again in call center intiative call, the others are less likely
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$has_cellphone)
prop.table(counts, 2)
##
## 0 1
## 0 0.98444137 0.96231190
## 1 0.01555863 0.03768810
barplot(counts, main="has cellphone and Donations", xlab="has_cellphone",col=c("darkblue","red"), legend = rownames(counts))
Donators with cellphone are more likely to donate to the call center but not by much. Maybe it’s because it easier to get them but since we know the call center activity hours are in the night maybe the phone or cell phone doesn’t realy matter.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$email)
prop.table(counts, 2)
##
## 0 1
## 0 0.97351006 0.96963692
## 1 0.02648994 0.03036308
barplot(counts, main="has email and Donations", xlab="email",col=c("darkblue","red"), legend = rownames(counts))
Donators with emails and without holds almost no difference in thier behaviour towards donating to the call center. It can be explained by that the email is the least likely method to reach a donator.
ggplot(data=mydata, aes(mydata$YADANI)) +
geom_histogram(breaks=seq(0, 100, by=2),
col="red",
fill="green",
alpha = .2)
it’s seems obviously that most of donators are using one time donations or continous donation “Horat-Keva” or “HOK”.
A logical action is to use this field as 1 or 0 only.
ggplot(data=mydata, aes(mydata$HOK)) +
geom_histogram(breaks=seq(0, 100, by=2),
col="red",
fill="green",
alpha = .2)
like Yadani field it would be logical to use 1 or 0 only.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$AMOUNT_TREND)
prop.table(counts, 2)
##
## DOWN No Data NOT_STABLE STABLE UP
## 0 0.978122449 0.994171125 0.964733466 0.991550211 0.797049129
## 1 0.021877551 0.005828875 0.035266534 0.008449789 0.202950871
barplot(counts, main="Amount trend and Donations", xlab="Amount trend",col=c("darkblue","red"), legend = rownames(counts))
We can see obviously that when the trend is “up” the percentage of positive respond is bigger than the other classes.
mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$COUNT_TREND)
prop.table(counts, 2)
##
## DOWN No Data NOT_STABLE STABLE UP
## 0 0.991008505 0.994171125 0.961928489 0.981547117 0.772833724
## 1 0.008991495 0.005828875 0.038071511 0.018452883 0.227166276
barplot(counts, main="Count trend and Donations", xlab="Count trend",col=c("darkblue","red"), legend = rownames(counts))
We can see the same behaviour again in the number of donation trend also.
Before we continue to modelling we should use information gain to identify the highest ranking variables.
weights <- information.gain(TARAM_MOKED~., mydata)
weights<-data.frame(VarName=rownames(weights), weight=weights[,1])
weights<-weights[order(weights$weight, decreasing = TRUE),]
print(weights)
## VarName weight
## 8 Hefresh_from_last_year 2.888775e-02
## 13 AMOUNT_TREND 2.280040e-02
## 14 COUNT_TREND 2.224239e-02
## 6 Moked 6.401771e-03
## 12 HOK 5.919448e-03
## 11 YADANI 3.838364e-03
## 4 STATE_OF_DONOR 2.370700e-03
## 9 has_cellphone 2.320096e-03
## 5 Clali 1.236077e-03
## 1 Account 1.151535e-03
## 2 Religious 3.921659e-04
## 7 not_cash 3.551698e-04
## 3 Amount_giving 2.352865e-05
## 10 email 0.000000e+00
We see that every variable has information to give so we can’t omit anything just now. But maybe we can test for corrolation between some of the features for feature reduction. The primary suspect are the trends.
cols <- c("AMOUNT_TREND", "COUNT_TREND")
mydata.cor <- mydata[,match(cols,colnames(mydata))]
mydata.cor$AMOUNT_TREND <- as.numeric(mydata.cor$AMOUNT_TREND)
mydata.cor$COUNT_TREND <- as.numeric(mydata.cor$COUNT_TREND)
mycor <- cor(mydata.cor)
mycor
## AMOUNT_TREND COUNT_TREND
## AMOUNT_TREND 1.0000000 0.8080453
## COUNT_TREND 0.8080453 1.0000000
So we can obviously say that count_trend and amount_trend are highly corrolated, so we can ommit one of them.
So as we notice we can omit one column, and we can also try to use regression models. let’s first build a convient data set for each type of model and then we will run through all of the models one by one.
# removing count trend feature
cols <- "COUNT_TREND"
mydata <- mydata[,-match(cols,colnames(mydata))]
Now, the data frame is ready for classifaction models but not for regression. Let’s create a data frame suited for regression models. We can easily replace all the flags to numeric. And we can make amount_giving and state of donor numeric by using levels. But we have “No data” customers that we can’t treat as 0 like others that are a real 0. Since we the features YADANI and HOK have a very low information gain we can omit them. The amount_trend feature and “hefresh_from_last_year” feature could be converted a levels - we can certainly conver “no data” as 0 because it means they never donated before
# convert HOK field to boolean
mydata$HOK <- ifelse(mydata$HOK< 0.1,0,1)
mydata$YADANI <- ifelse(mydata$YADANI< 0.1,0,1)
#cols<- c("YADANI")
mydata_numeric <- mydata
#mydata <- mydata[,-match(cols,colnames(mydata))]
# convert to numeric
mydata_numeric$Religious <- ifelse(mydata_numeric$Religious == "Yes",1,0)
mydata_numeric$Amount_giving.f <- as.numeric(factor(mydata_numeric$Amount_giving , levels=c("Low", "Core", "Major", "Top")))
mydata_numeric$STATE_OF_DONOR.f <- as.numeric(factor(mydata_numeric$STATE_OF_DONOR , levels=c("Dormant", "Inactive", "Lapsed", "Current")))
mydata_numeric$AMOUNT_TREND.f <- as.numeric(factor(mydata_numeric$AMOUNT_TREND , levels=c("DOWN", "No Data", "NOT_STABLE", "STABLE", "UP")))
mydata_numeric$Hefresh_from_last_year.f <- as.numeric(factor(mydata_numeric$Hefresh_from_last_year , levels=c("not_active", "No Data", "not started", "started", "near_completed", "above")))
cols <- "STATE_OF_DONOR"
mydata_numeric <- mydata_numeric[,-match(cols,colnames(mydata_numeric))]
cols <- "Amount_giving"
mydata_numeric <- mydata_numeric[,-match(cols,colnames(mydata_numeric))]
cols <- "AMOUNT_TREND"
mydata_numeric <- mydata_numeric[,-match(cols,colnames(mydata_numeric))]
cols <- "Hefresh_from_last_year"
mydata_numeric <- mydata_numeric[,-match(cols,colnames(mydata_numeric))]
mydata_numeric <- as.data.frame(lapply(mydata_numeric, function(x) as.numeric(as.character(x))))
Now let’s fix the model target feature and the formula
target_feature <- "TARAM_MOKED"
fmla <- as.formula(paste(target_feature,"~.",sep=""))
target_feature_numeric <- "TARAM_MOKED"
fmla_numeric <- as.formula(paste(target_feature_numeric,"~.",sep=""))
# splitdf function will return a list of training and testing sets, we don't want to use 2/3 and 1/3 because we want to make sure the split will include enough positive target
splitdf <- function(mydata_num, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(mydata)
trainindex <- sample(index, trunc(length(index)*0.8))
trainset <- mydata[trainindex, ]
testset <- mydata[-trainindex, ]
list(trainset=trainset,testset=testset)
}
#apply the function
splits <- splitdf(mydata, seed=808)
# save the training and testing sets as data frames
train_set <- splits$trainset
test_set <- splits$testset
# split the numeric dataframe
splitdf <- function(mydata_numeric, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(mydata_numeric)
trainindex <- sample(index, trunc(length(index)*0.8))
trainset <- mydata_numeric[trainindex, ]
testset <- mydata_numeric[-trainindex, ]
list(trainset=trainset,testset=testset)
}
#apply the function
splits <- splitdf(mydata_numeric, seed=808)
# save the training and testing sets as data frames
train_set_numeric <- splits$trainset
test_set_numeric <- splits$testset
Since this a classification question we will first try classification models
dt_model_train <- rpart(fmla,train_set)
rctrl <- rpart.control(minbucket=20 # i'm ready to accept 1 as a group
,minsplit=20 # node with min before split attemp
,cp=0)
dt_model_complete <- rpart(fmla,train_set,control = rctrl, method = "class")
dt_predict <- predict(dt_model_complete,test_set,type="class") #class = classification,
fancyRpartPlot(dt_model_complete)
Here we can see that the tree is relatively big for what we excpected and it might worth trying prunning.
let’s first take a look in the tree confusion matrix:
# Confusion matrix
confusion_matrix <- table(test_set[,target_feature],dt_predict)
confusion_matrix
## dt_predict
## 0 1
## 0 19639 74
## 1 401 185
accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
accuracy
## [1] 0.9765998
recall <- confusion_matrix[2,2]/sum(confusion_matrix[2,])
recall
## [1] 0.3156997
OK, so the recall parameter (what we want to be very high while accuracy does not matter as long as it above 90%) is 31%, not bad. let’s see if we can do some prunning
cp <- printcp(dt_model_complete)
##
## Classification tree:
## rpart(formula = fmla, data = train_set, method = "class", control = rctrl)
##
## Variables actually used in tree construction:
## [1] Account Amount_giving Clali
## [4] email has_cellphone Hefresh_from_last_year
## [7] HOK Moked not_cash
## [10] Religious STATE_OF_DONOR YADANI
##
## Root node error: 2296/81192 = 0.028279
##
## n= 81192
##
## CP nsplit rel error xerror xstd
## 1 0.03208479 0 1.00000 1.00000 0.020572
## 2 0.01916376 3 0.90375 0.90375 0.019585
## 3 0.01742160 5 0.86542 0.89547 0.019497
## 4 0.00696864 6 0.84800 0.84800 0.018986
## 5 0.00468206 7 0.84103 0.84321 0.018934
## 6 0.00435540 11 0.82230 0.83319 0.018824
## 7 0.00348432 14 0.80923 0.82883 0.018776
## 8 0.00261324 17 0.79878 0.81490 0.018621
## 9 0.00217770 19 0.79355 0.81577 0.018631
## 10 0.00195993 22 0.78702 0.81533 0.018626
## 11 0.00159698 25 0.78092 0.81141 0.018582
## 12 0.00152439 30 0.77091 0.81054 0.018572
## 13 0.00087108 32 0.76786 0.80793 0.018543
## 14 0.00072590 35 0.76524 0.81620 0.018636
## 15 0.00058072 38 0.76307 0.81838 0.018660
## 16 0.00043554 41 0.76132 0.81838 0.018660
## 17 0.00021777 45 0.75958 0.83275 0.018819
## 18 0.00000000 51 0.75828 0.82883 0.018776
plot(cp[,2],cp[,3],type="l")
lines(cp[,2],cp[,4],col="red")
We can see that the cp xerror parameter is indicates that a higher trend at 32 splits we will stop there
pruned_dt_model <- prune(dt_model_complete,cp = 0.00159698026)
fancyRpartPlot(pruned_dt_model)
redt_predict <- predict(pruned_dt_model,test_set,type="class") #class = classification,
confusion_matrix <- table(test_set[,target_feature],dt_predict)
confusion_matrix
## dt_predict
## 0 1
## 0 19639 74
## 1 401 185
we can notice now that the confusion matrix almost did not change so we made the tree better.
rmse_summary<-data.frame(model="DT", RMSE= rmse(test_set$TARAM_MOKED,dt_predict))
recall_summary <- data.frame(model= "DT", recall = recall)
rocrpred <- ROCR::prediction(as.numeric(redt_predict), test_set$TARAM_MOKED)
DT_ROC<-performance(rocrpred, "tpr", "fpr")
mydata_cat <- as.data.frame(sapply(mydata, as.factor))
cat_features <- names(mydata_cat)[c(2,3,4,5,6,7,8,10,11,12,13,14)]
# splitdf function will return a list of training and testing sets, we don't want to use 2/3 and 1/3 because we want to make sure the split will include enough positive target
splitdf <- function(mydata_cat, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(mydata_cat)
trainindex <- sample(index, trunc(length(index)*0.8))
trainset <- mydata_cat[trainindex, ]
testset <- mydata_cat[-trainindex, ]
list(trainset=trainset,testset=testset)
}
#apply the function
splits <- splitdf(mydata_cat, seed=808)
# save the training and testing sets as data frames
train_set <- splits$trainset
test_set <- splits$testset
NB_model <- naiveBayes(x=train_set[,cat_features],y=train_set$TARAM_MOKED)
NB_predict <- predict(NB_model,test_set[,cat_features],type="class")
confusion_matrix_NB <- table(test_set$TARAM_MOKED,NB_predict)
confusion_matrix_NB
## NB_predict
## 0 1
## 0 18865 848
## 1 315 271
# Confusion matrix
confusion_matrix_NB <- table(test_set[,target_feature],NB_predict)
confusion_matrix_NB
## NB_predict
## 0 1
## 0 18865 848
## 1 315 271
accuracy <- sum(diag(confusion_matrix_NB))/sum(confusion_matrix_NB)
accuracy
## [1] 0.9427065
recall <- confusion_matrix_NB[2,2]/sum(confusion_matrix_NB[2,])
recall
## [1] 0.4624573
recall_summary <- rbind(recall_summary,data.frame(model= "NB", recall = recall))
rmse_summary<-rbind(rmse_summary, data.frame(model="NB", RMSE= rmse(test_set$TARAM_MOKED,NB_predict)))
rocrpred <- ROCR::prediction(as.numeric(NB_predict), test_set$TARAM_MOKED)
NB_ROC<-performance(rocrpred, "tpr", "fpr")
numeric_features <- names(mydata_numeric)[c(1,2,3,4,6,7,8,9,10,11,12,13)]
mydata_numeric$TARAM_MOKED <- as.factor(mydata_numeric$TARAM_MOKED )
SVM_model <- svm(fmla_numeric,train_set_numeric,kernel = "polynomial", type = "C")
SVM_Predict <- predict(SVM_model,test_set_numeric)
confusion_matrix_SVM <- table(test_set_numeric$TARAM_MOKED,SVM_Predict)
confusion_matrix_SVM
## SVM_Predict
## 0 1
## 0 19687 26
## 1 474 112
accuracy <- sum(diag(confusion_matrix_SVM))/sum(confusion_matrix_SVM)
accuracy
## [1] 0.9753682
recall <- confusion_matrix_SVM[2,2]/sum(confusion_matrix_SVM[2,])
recall
## [1] 0.1911263
recall_summary <- rbind(recall_summary,data.frame(model= "SVM", recall = recall))
rmse_summary<-rbind(rmse_summary, data.frame(model="SVM", RMSE= rmse(test_set_numeric$TARAM_MOKED,SVM_Predict)))
rocrpred <- ROCR::prediction(as.numeric(SVM_Predict), test_set_numeric$TARAM_MOKED)
svm_ROC<-performance(rocrpred, "tpr", "fpr")
train_set_nnet <-as.data.frame(sapply(train_set_numeric[,c(1,2,3,4,6,7,8,9,10,11,12,13)], function(x) (x-min(x))/(max(x)-min(x))))
train_set_nnet$TARAM_MOKED <- as.factor(train_set_numeric[,4])
nnet_model <- nnet(fmla, data = train_set_nnet, size = 3 ,maxit=1000,linout = F)
## # weights: 40
## initial value 56008.963020
## final value 0.000000
## converged
test_set_nnet <- sapply(test_set_numeric[,c(1,2,3,4,6,7,8,9,10,11,12,13)], function(x) (x-min(x))/(max(x)-min(x)))
nnet_pred <- predict(nnet_model,test_set_nnet, type = "class")
confusion_matrix_nnet <- table(test_set_numeric$TARAM_MOKED,nnet_pred)
confusion_matrix_nnet
## nnet_pred
## 0 1
## 0 13559 6154
## 1 199 387
accuracy <- sum(diag(confusion_matrix_nnet))/sum(confusion_matrix_nnet)
accuracy
## [1] 0.6870289
recall <- confusion_matrix_nnet[2,2]/sum(confusion_matrix_nnet[2,])
recall
## [1] 0.6604096
recall_summary <- rbind(recall_summary,data.frame(model= "nn", recall = recall))
rmse_summary<-rbind(rmse_summary, data.frame(model="nn", RMSE= rmse(test_set_numeric$TARAM_MOKED,as.numeric(nnet_pred))))
rocrpred <- ROCR::prediction(as.numeric(nnet_pred), test_set_numeric$TARAM_MOKED)
nnet_ROC<-performance(rocrpred, "tpr", "fpr")
It’s important to pay attention here. We got 60% recall but in what price. only 60% accuaracy.
n<-nrow(test_set)
percentOf1<-nrow(test_set[test_set$TARAM_MOKED==1,])/n
#Create normal distribution of random numbers between 0 and 1 (the random numbers simulate
#probability of "1"), and find X value which represents percent of "1" in test set
set.seed(n)
normDistribution<-rnorm(n, mean = 0, sd=1)
minND<-min(normDistribution)
maxND<-max(normDistribution)
normDistribution<- sapply(normDistribution, function(x) (x-minND)/(maxND-minND))
X<-qnorm(1-percentOf1, mean = mean(normDistribution), sd = sd(normDistribution))
#plot( density(normDistribution))
# Apply "1" to numbers with simulated probability equal or more than X value, "0" for the rest
NoModel<-sapply(normDistribution, function(x) if (x>=X) 1 else 0 )
confusion_matrix <- table(test_set[,target_feature],NoModel)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")
recall.random<-confusion_matrix[2,2]/(confusion_matrix[,2])
accuracy.random<-sum(diag(confusion_matrix))/sum(confusion_matrix)
noModel_predict <-ROCR::prediction(as.numeric(normDistribution), test_set$TARAM_MOKED)
noModel_ROC<-performance(noModel_predict, "tpr", "fpr")
confusion_matrix
## NoModel
## predicted 0 predicted 1
## actual 0 19179 534
## actual 1 566 20
colors <- c('black','red','blue','green', 'orange')
legends_list <- c("noModel", "nnet", "SVM", "Decision tree", "Naive Bayes")
plot.new()
plot(noModel_ROC)
plot(nnet_ROC,add =TRUE, col = "red")
plot(svm_ROC,add =TRUE, col = "blue")
plot(DT_ROC,add =TRUE, col = "green")
plot(NB_ROC,add =TRUE, col = "orange")
legend(x = "topright",legend = legends_list, fill = colors)
print(rmse_summary)
## model RMSE
## 1 DT 0.1529711
## 2 NB 0.2393605
## 3 SVM 0.9901978
## 4 nn 0.5594382
print(recall_summary)
## model recall
## 1 DT 0.3156997
## 2 NB 0.4624573
## 3 SVM 0.1911263
## 4 nn 0.6604096
So infact we have a winner. The Naive Bayes model had the best recall (it will cost some of the accuracy over the decision tree but we’ll get a lot of donators that way). The nnet model reach highest recall but it cost a lot of the accuarcy and trade off probably is no cost-effective. Next summer we will try this model as a pilot. let’s hope it will bring added value to the call center.