Introduction

This dataset is exported from Paamonim association. Paamonim is a nationwide volunteer organization dedicated to helping families and individuals conduct their household finances with balance, responsibility and integrity, thereby strengthening Israel’s economic and social fabric. With the help of its 2800 volunteers, Paamonim provides a wide range of solutions to thousands of Israeli families and individuals each year through its targeted economic recovery programs, and offers information and tools for the general public interested learning the best methods for managing their household finances.

Paamonim have a call center who’s role is to call past donators and ask if they are willing to donate more.

Now a days the call center extract a campaign by cutting the population by rules such as “let’s call those who didn’t donate in the last year”

This ML project purpose is to make every call count and try to predict who will donate if the call center will initiate the call.

Scope of work

The data we recived was of 5 years of donations, each record represents one donation. The donation can be in any payment measure such as wire transfer, credit card, cheque or cash.

This is data series dataset and time have meaning but the but in the end is calssifaction question.

after preprocessing the file in 3rd party application we have a ready dataset.

preinspection challenges identified

Before we start the analysis it’s important to understand the basic challanges of the model we are looking for. So the first step is to define precisely what the exactly point of time we want the model to take place, what data we have in that exact moment and what if there is data that can manipluate our results if we won’t take care of it in first place.

The question - can we predict if a call initiation will be succesful when it made in the greatest time windows without any special holidays campaign. This timeframe is right after Passover and before Rosh-Hashana. Passover campaign starts right after Purim and Purim is only 2 months after Hannuka that is 2.5 months after Rosh-Hashana.

So the most important time we want an effective model for outgoing calls is right after passover.

As for now, the call center have 3% positive responses (we can see in the data below that the response is about 3% from the population)

In that case we need to keep in mind that the accuarcy of the models does not hold such importance comparing to the recall i.e. how many the model predict as positive versus how many had a positive response.

background

Paamonim launch every jewish holiday a special campaign and the call center does not initate calls but answers incoming calls only. Paamonim also have special campaigns for special cases such as a very sick child or infants etc.

In our analysis we need to be aware of those campaigns because they can’t be considered as a donations made by initiation but they maybe do have impact on they future behaviour of the donators.

So let’s have a look on the data

The most important conclusions made are

EDA

  1. The trend parameters are the best features to learn about the behaivour in the future.
  2. We can test the model next summer but as it seems the model can take the call center and make his calls very effective. as for now the call center get donation once in 100 donators he tried to contact. With the Naive Bayes with 40%+ recall it’s almost 50% 50% for positive reaction.
  3. eventhough it’s low recall, it will suit the call center very well.

Modelling

We can clearly see that the **** model successfully predict who will donate if the call center will initiate a call. The recall parameter is 56% which is not very good but it’s better than random so we can accept the model as a tool with mindset of testing the model in each period.

Data Understanding

Dataset overview

##  [1] "Account"                "Religious"             
##  [3] "Amount_giving"          "STATE_OF_DONOR"        
##  [5] "Clali"                  "Moked"                 
##  [7] "not_cash"               "Hefresh_from_last_year"
##  [9] "TARAM_MOKED"            "has_cellphone"         
## [11] "email"                  "YADANI"                
## [13] "HOK"                    "AMOUNT_TREND"          
## [15] "COUNT_TREND"
## 'data.frame':    101491 obs. of  15 variables:
##  $ Account               : num  40009 40017 40038 40073 40083 ...
##  $ Religious             : Factor w/ 3 levels "","Mixed","Yes": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Amount_giving         : Factor w/ 4 levels "Core","Low","Major",..: 2 2 2 2 1 2 1 2 1 4 ...
##  $ STATE_OF_DONOR        : Factor w/ 4 levels "Current","Dormant",..: 3 1 4 1 3 1 3 4 1 4 ...
##  $ Clali                 : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Moked                 : int  0 1 0 1 0 1 0 0 0 0 ...
##  $ not_cash              : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Hefresh_from_last_year: Factor w/ 6 levels "above","near_completed",..: 6 5 4 5 6 5 6 5 3 4 ...
##  $ TARAM_MOKED           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_cellphone         : int  1 1 1 0 0 1 1 1 0 0 ...
##  $ email                 : int  1 0 0 0 1 1 1 0 0 1 ...
##  $ YADANI                : num  5.11 100 100 0 22.16 ...
##  $ HOK                   : num  94.9 0 0 100 77.8 ...
##  $ AMOUNT_TREND          : Factor w/ 5 levels "DOWN","No Data",..: 3 3 3 3 1 3 3 3 2 3 ...
##  $ COUNT_TREND           : Factor w/ 5 levels "DOWN","No Data",..: 3 3 3 3 3 3 3 3 2 3 ...
##   Account Religious Amount_giving STATE_OF_DONOR Clali Moked not_cash
## 1   40009       Yes           Low       Inactive     1     0        1
## 2   40017       Yes           Low        Current     1     1        1
## 3   40038     Mixed           Low         Lapsed     1     0        1
## 4   40073       Yes           Low        Current     1     1        1
## 5   40083       Yes          Core       Inactive     1     0        1
## 6   40085       Yes           Low        Current     1     1        1
##   Hefresh_from_last_year TARAM_MOKED has_cellphone email YADANI    HOK
## 1                started           0             1     1   5.11  94.89
## 2             not_active           0             1     0 100.00   0.00
## 3            not started           0             1     0 100.00   0.00
## 4             not_active           0             0     0   0.00 100.00
## 5                started           0             0     1  22.16  77.84
## 6             not_active           0             1     1 100.00   0.00
##   AMOUNT_TREND COUNT_TREND
## 1   NOT_STABLE  NOT_STABLE
## 2   NOT_STABLE  NOT_STABLE
## 3   NOT_STABLE  NOT_STABLE
## 4   NOT_STABLE  NOT_STABLE
## 5         DOWN  NOT_STABLE
## 6   NOT_STABLE  NOT_STABLE

So we have 15 features

  1. account - this is the donor id. it’s not relevant for the model but it will be relevant for cross validations

  2. Religious- this feautre represents if the city citizens are mostly religious or if it’s a mixed religions city. it have 2 classes “mixed” and “yes”

  3. Amount_giving - this feautre has 4 classes. “Low” - donators with low donation average (beneath 180 Sheqels which religious people like to donate 10 times CHAY - 18). “Core” are between 180 and 300. “Major” are 300-500 and Top are above 500. Those numbers set by the business side.

  4. “State_of_donor” - a feature with 4 classes. “Current”- donated in the last year. “Lapsed” - donated in the last 2 years. “Inactive” - last 3 years, and “Dormant” - above 3 years.

  5. “Clali” - a flag represents if the donor ever donated for a “general” purpose.

  6. “Moked” - a flag represents if the donor ever donated within the call center activity

  7. “not_cash” - a flag represents if the donor ever used a non-cash payment method such as credit card or wire transfer

  8. “Hefresh_from_last_year” - a feature with 5 classes and 1 blank class. blank - never donated before this year for general purpose. “not active” - donated before to general purpose but this year didn’t donate anything. “not started” - donated between 1%-40% comparing to last year. “started” - 40% to 80%. “near completed” - 80% - 100%. “above” - donated more than last year.

  9. “TARAM_MOKED” - the target feature. a flag feautre represents if the donator did donate to “general” purpose through the call center activity.

  10. “has cellphone” - a flag feautre represents if the donor has a cellphone in the call center database.

  11. “email” - a flag feature represents if the donor has a email in the call center database.

  12. “YADANI” - a numeric feature represnts the precntage of one timed donations from the general donations.

  13. “HOK” - a numeric feature represnts the precntage of multipile donations by biliing system from the general donations.

  14. “AMOUNT_TREND” - a feature with 4 classes and 1 blank class. blank represents donators without “general” donations history. “DOWN” - a negative trand comparing to last 2 years. “Stable” - same amount every year, “UP” - A positive trend over the last 2 years. “Not stable” - not identified pattern found.

  15. “COUNT_TREND” - same as the amount_trend feature but using the number of donations instead of the amount.

numIndex<- sapply(mydata, is.numeric)
summary(mydata[,numIndex])
##     Account              Clali            Moked           not_cash    
##  Min.   :4.001e+04   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:2.022e+06   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.000  
##  Median :2.044e+06   Median :1.0000   Median :0.0000   Median :1.000  
##  Mean   :8.409e+09   Mean   :0.5316   Mean   :0.3223   Mean   :0.837  
##  3rd Qu.:2.000e+10   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :4.000e+10   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##   TARAM_MOKED     has_cellphone        email            YADANI      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  0.00  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :  0.00  
##  Mean   :0.0284   Mean   :0.5801   Mean   :0.4923   Mean   : 39.94  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:100.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :100.00  
##       HOK        
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median :  0.00  
##  Mean   : 12.56  
##  3rd Qu.:  0.00  
##  Max.   :100.00

we can see we don’t have NA’s. Now let’s look at each feature by himself as univariate analaysis

religious

summary(mydata$Religious)
##       Mixed   Yes 
##    26 33986 67479

We have 2 classes of Religious city categories. ‘Mixed’ which means that most of the city citizens are not religious and ‘Yes’ means they are. If we want to see the relationship with target feautre we can plot it like so:

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Religious)
prop.table(counts, 2)
##    
##                     Mixed        Yes
##   0 0.96153846 0.97802036 0.96837535
##   1 0.03846154 0.02197964 0.03162465
barplot(counts, main="Religious and Donations", xlab="Is religious city",col=c("darkblue","red"),   legend = rownames(counts))

We can see there is slightly more donators (in precentage) from religious city that are willing to donate for a “general” purpose.

Amount giving

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Amount_giving)
prop.table(counts, 2)
##    
##           Core        Low      Major        Top
##   0 0.97355513 0.97157443 0.97031336 0.96966904
##   1 0.02644487 0.02842557 0.02968664 0.03033096
barplot(counts, main="Average amount of donation and Donations", xlab="Amount_giving",col=c("darkblue","red"),  legend = rownames(counts))

We can see there is slightly more Top donators who are willing to donate more if the call center asks them. interesting enough we see that low donators tends slightly to donate more than core donators. Maybe the reason is that low amounts are easier to be donated.

State of Donor

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$STATE_OF_DONOR)
prop.table(counts, 2)
##    
##        Current    Dormant   Inactive     Lapsed
##   0 0.98005711 0.98017725 0.96784966 0.95122995
##   1 0.01994289 0.01982275 0.03215034 0.04877005
barplot(counts, main="STATE OF DONOR and Donations", xlab="STATE_OF_DONOR",col=c("darkblue","red"), legend = rownames(counts))

Here we can see a big difference between inactive and lapsed donators and current and dormant. The result is easy to describe. Current donors are donors who just had a donation in the last half a year, so it’s unlikely to them to donate again. Dormant donors usually won’t donate again, either that they don’t want to (3 years of inactivity) or they are unreachable.

Clali

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Clali)
prop.table(counts, 2)
##    
##              0          1
##   0 0.98028821 0.96395211
##   1 0.01971179 0.03604789
barplot(counts, main="Clali and Donations", xlab="Clali",col=c("darkblue","red"),   legend = rownames(counts))

As expected. The donatiors that previously donated to general purposes are more willing to do it again, but it seems that not by much.

Moked

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Moked)
prop.table(counts, 2)
##    
##              0          1
##   0 0.98514060 0.94314187
##   1 0.01485940 0.05685813
barplot(counts, main="Moked and Donations", xlab="Moked",col=c("darkblue","red"),   legend = rownames(counts))

Here also the result is expected. We can easly see that donators who repond positively to the call center call are more willing (times 5) to reposnd positively next time.

Not cash

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$not_cash)
prop.table(counts, 2)
##    
##              0          1
##   0 0.98113892 0.96974655
##   1 0.01886108 0.03025345
barplot(counts, main="not cash and Donations", xlab="not_cash",col=c("darkblue","red"), legend = rownames(counts))

We can see very staight forward that the majority of donators are using non-cash methods and they the call center donators are majorly live in the non-cash group.

Hefresh from last year

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$Hefresh_from_last_year)
prop.table(counts, 2)
##    
##           above near_completed     No Data not started  not_active
##   0 0.827425195    0.939911448 0.994171125 0.990773962 0.990763947
##   1 0.172574805    0.060088552 0.005828875 0.009226038 0.009236053
##    
##         started
##   0 0.943946188
##   1 0.056053812
barplot(counts, main="Hefresh_from_last_year and Donations", xlab="Hefresh_from_last_year",col=c("darkblue","red"), legend = rownames(counts))

This result is very clear - donators that donated this year more than they had last year are very likely to donate again in call center intiative call, the others are less likely

has cellphone

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$has_cellphone)
prop.table(counts, 2)
##    
##              0          1
##   0 0.98444137 0.96231190
##   1 0.01555863 0.03768810
barplot(counts, main="has cellphone and Donations", xlab="has_cellphone",col=c("darkblue","red"),   legend = rownames(counts))

Donators with cellphone are more likely to donate to the call center but not by much. Maybe it’s because it easier to get them but since we know the call center activity hours are in the night maybe the phone or cell phone doesn’t realy matter.

has email

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$email)
prop.table(counts, 2)
##    
##              0          1
##   0 0.97351006 0.96963692
##   1 0.02648994 0.03036308
barplot(counts, main="has email and Donations", xlab="email",col=c("darkblue","red"),   legend = rownames(counts))

Donators with emails and without holds almost no difference in thier behaviour towards donating to the call center. It can be explained by that the email is the least likely method to reach a donator.

Yadani

ggplot(data=mydata, aes(mydata$YADANI)) + 
  geom_histogram(breaks=seq(0, 100, by=2), 
                 col="red", 
                 fill="green", 
                 alpha = .2) 

it’s seems obviously that most of donators are using one time donations or continous donation “Horat-Keva” or “HOK”.

A logical action is to use this field as 1 or 0 only.

HOK

ggplot(data=mydata, aes(mydata$HOK)) + 
  geom_histogram(breaks=seq(0, 100, by=2), 
                 col="red", 
                 fill="green", 
                 alpha = .2) 

like Yadani field it would be logical to use 1 or 0 only.

Amount trend

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$AMOUNT_TREND)
prop.table(counts, 2)
##    
##            DOWN     No Data  NOT_STABLE      STABLE          UP
##   0 0.978122449 0.994171125 0.964733466 0.991550211 0.797049129
##   1 0.021877551 0.005828875 0.035266534 0.008449789 0.202950871
barplot(counts, main="Amount trend and Donations", xlab="Amount trend",col=c("darkblue","red"), legend = rownames(counts))

We can see obviously that when the trend is “up” the percentage of positive respond is bigger than the other classes.

Count trend

mydata$TARAM_MOKED <- as.factor(mydata$TARAM_MOKED)
counts <- table(mydata$TARAM_MOKED,mydata$COUNT_TREND)
prop.table(counts, 2)
##    
##            DOWN     No Data  NOT_STABLE      STABLE          UP
##   0 0.991008505 0.994171125 0.961928489 0.981547117 0.772833724
##   1 0.008991495 0.005828875 0.038071511 0.018452883 0.227166276
barplot(counts, main="Count trend and Donations", xlab="Count trend",col=c("darkblue","red"),   legend = rownames(counts))

We can see the same behaviour again in the number of donation trend also.

information gain

Before we continue to modelling we should use information gain to identify the highest ranking variables.

weights <- information.gain(TARAM_MOKED~., mydata)
weights<-data.frame(VarName=rownames(weights), weight=weights[,1])
weights<-weights[order(weights$weight, decreasing = TRUE),]
print(weights)
##                   VarName       weight
## 8  Hefresh_from_last_year 2.888775e-02
## 13           AMOUNT_TREND 2.280040e-02
## 14            COUNT_TREND 2.224239e-02
## 6                   Moked 6.401771e-03
## 12                    HOK 5.919448e-03
## 11                 YADANI 3.838364e-03
## 4          STATE_OF_DONOR 2.370700e-03
## 9           has_cellphone 2.320096e-03
## 5                   Clali 1.236077e-03
## 1                 Account 1.151535e-03
## 2               Religious 3.921659e-04
## 7                not_cash 3.551698e-04
## 3           Amount_giving 2.352865e-05
## 10                  email 0.000000e+00

We see that every variable has information to give so we can’t omit anything just now. But maybe we can test for corrolation between some of the features for feature reduction. The primary suspect are the trends.

cols <- c("AMOUNT_TREND", "COUNT_TREND")
mydata.cor <- mydata[,match(cols,colnames(mydata))]
mydata.cor$AMOUNT_TREND <- as.numeric(mydata.cor$AMOUNT_TREND)
mydata.cor$COUNT_TREND <- as.numeric(mydata.cor$COUNT_TREND)

mycor <- cor(mydata.cor)
mycor
##              AMOUNT_TREND COUNT_TREND
## AMOUNT_TREND    1.0000000   0.8080453
## COUNT_TREND     0.8080453   1.0000000

So we can obviously say that count_trend and amount_trend are highly corrolated, so we can ommit one of them.

Modelling

So as we notice we can omit one column, and we can also try to use regression models. let’s first build a convient data set for each type of model and then we will run through all of the models one by one.

# removing count trend feature
cols <- "COUNT_TREND"
mydata <-  mydata[,-match(cols,colnames(mydata))]

Now, the data frame is ready for classifaction models but not for regression. Let’s create a data frame suited for regression models. We can easily replace all the flags to numeric. And we can make amount_giving and state of donor numeric by using levels. But we have “No data” customers that we can’t treat as 0 like others that are a real 0. Since we the features YADANI and HOK have a very low information gain we can omit them. The amount_trend feature and “hefresh_from_last_year” feature could be converted a levels - we can certainly conver “no data” as 0 because it means they never donated before

# convert HOK field to boolean
mydata$HOK <- ifelse(mydata$HOK< 0.1,0,1)
mydata$YADANI <- ifelse(mydata$YADANI< 0.1,0,1)


#cols<- c("YADANI")
mydata_numeric <- mydata
#mydata <- mydata[,-match(cols,colnames(mydata))]

# convert to numeric

mydata_numeric$Religious <- ifelse(mydata_numeric$Religious == "Yes",1,0)
mydata_numeric$Amount_giving.f <- as.numeric(factor(mydata_numeric$Amount_giving , levels=c("Low", "Core", "Major", "Top")))
mydata_numeric$STATE_OF_DONOR.f <- as.numeric(factor(mydata_numeric$STATE_OF_DONOR , levels=c("Dormant", "Inactive", "Lapsed", "Current")))



mydata_numeric$AMOUNT_TREND.f <- as.numeric(factor(mydata_numeric$AMOUNT_TREND , levels=c("DOWN", "No Data", "NOT_STABLE", "STABLE", "UP")))

mydata_numeric$Hefresh_from_last_year.f <- as.numeric(factor(mydata_numeric$Hefresh_from_last_year , levels=c("not_active", "No Data", "not started", "started", "near_completed", "above")))

cols <- "STATE_OF_DONOR"
mydata_numeric <-  mydata_numeric[,-match(cols,colnames(mydata_numeric))]

cols <- "Amount_giving"
mydata_numeric <-  mydata_numeric[,-match(cols,colnames(mydata_numeric))]


cols <- "AMOUNT_TREND"
mydata_numeric <-  mydata_numeric[,-match(cols,colnames(mydata_numeric))]

cols <- "Hefresh_from_last_year"
mydata_numeric <-  mydata_numeric[,-match(cols,colnames(mydata_numeric))]

mydata_numeric <- as.data.frame(lapply(mydata_numeric, function(x) as.numeric(as.character(x))))

Now let’s fix the model target feature and the formula

target_feature <- "TARAM_MOKED"
fmla <- as.formula(paste(target_feature,"~.",sep=""))
target_feature_numeric <- "TARAM_MOKED"
fmla_numeric <- as.formula(paste(target_feature_numeric,"~.",sep=""))

splitting

# splitdf function will return a list of training and testing sets, we don't want to use 2/3 and 1/3 because we want to make sure the split will include enough positive target 
splitdf <- function(mydata_num, seed=NULL) {
  if (!is.null(seed)) set.seed(seed)
  index <- 1:nrow(mydata)
  trainindex <- sample(index, trunc(length(index)*0.8))
  trainset <- mydata[trainindex, ]
  testset <- mydata[-trainindex, ]
  list(trainset=trainset,testset=testset)
}

#apply the function
splits <- splitdf(mydata, seed=808)

# save the training and testing sets as data frames
train_set <- splits$trainset
test_set <- splits$testset

# split the numeric dataframe

splitdf <- function(mydata_numeric, seed=NULL) {
  if (!is.null(seed)) set.seed(seed)
  index <- 1:nrow(mydata_numeric)
  trainindex <- sample(index, trunc(length(index)*0.8))
  trainset <- mydata_numeric[trainindex, ]
  testset <- mydata_numeric[-trainindex, ]
  list(trainset=trainset,testset=testset)
}

#apply the function
splits <- splitdf(mydata_numeric, seed=808)

# save the training and testing sets as data frames
train_set_numeric <- splits$trainset
test_set_numeric <- splits$testset

classifaction models

Since this a classification question we will first try classification models

Decision tree

dt_model_train <- rpart(fmla,train_set)

rctrl <- rpart.control(minbucket=20 # i'm ready to accept 1 as a group
                       ,minsplit=20 # node with min before split attemp
                       ,cp=0)
dt_model_complete <- rpart(fmla,train_set,control = rctrl, method = "class")
dt_predict <- predict(dt_model_complete,test_set,type="class") #class = classification, 
fancyRpartPlot(dt_model_complete)

Here we can see that the tree is relatively big for what we excpected and it might worth trying prunning.

let’s first take a look in the tree confusion matrix:

# Confusion matrix
confusion_matrix <- table(test_set[,target_feature],dt_predict)
confusion_matrix
##    dt_predict
##         0     1
##   0 19639    74
##   1   401   185
accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
accuracy
## [1] 0.9765998
recall  <- confusion_matrix[2,2]/sum(confusion_matrix[2,])
recall
## [1] 0.3156997

OK, so the recall parameter (what we want to be very high while accuracy does not matter as long as it above 90%) is 31%, not bad. let’s see if we can do some prunning

cp <- printcp(dt_model_complete)
## 
## Classification tree:
## rpart(formula = fmla, data = train_set, method = "class", control = rctrl)
## 
## Variables actually used in tree construction:
##  [1] Account                Amount_giving          Clali                 
##  [4] email                  has_cellphone          Hefresh_from_last_year
##  [7] HOK                    Moked                  not_cash              
## [10] Religious              STATE_OF_DONOR         YADANI                
## 
## Root node error: 2296/81192 = 0.028279
## 
## n= 81192 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.03208479      0   1.00000 1.00000 0.020572
## 2  0.01916376      3   0.90375 0.90375 0.019585
## 3  0.01742160      5   0.86542 0.89547 0.019497
## 4  0.00696864      6   0.84800 0.84800 0.018986
## 5  0.00468206      7   0.84103 0.84321 0.018934
## 6  0.00435540     11   0.82230 0.83319 0.018824
## 7  0.00348432     14   0.80923 0.82883 0.018776
## 8  0.00261324     17   0.79878 0.81490 0.018621
## 9  0.00217770     19   0.79355 0.81577 0.018631
## 10 0.00195993     22   0.78702 0.81533 0.018626
## 11 0.00159698     25   0.78092 0.81141 0.018582
## 12 0.00152439     30   0.77091 0.81054 0.018572
## 13 0.00087108     32   0.76786 0.80793 0.018543
## 14 0.00072590     35   0.76524 0.81620 0.018636
## 15 0.00058072     38   0.76307 0.81838 0.018660
## 16 0.00043554     41   0.76132 0.81838 0.018660
## 17 0.00021777     45   0.75958 0.83275 0.018819
## 18 0.00000000     51   0.75828 0.82883 0.018776
plot(cp[,2],cp[,3],type="l")
lines(cp[,2],cp[,4],col="red")

We can see that the cp xerror parameter is indicates that a higher trend at 32 splits we will stop there

prunning

pruned_dt_model <- prune(dt_model_complete,cp = 0.00159698026)
fancyRpartPlot(pruned_dt_model)

redt_predict <- predict(pruned_dt_model,test_set,type="class") #class = classification, 
confusion_matrix <- table(test_set[,target_feature],dt_predict)
confusion_matrix
##    dt_predict
##         0     1
##   0 19639    74
##   1   401   185

we can notice now that the confusion matrix almost did not change so we made the tree better.

rmse_summary<-data.frame(model="DT", RMSE= rmse(test_set$TARAM_MOKED,dt_predict))

recall_summary <- data.frame(model= "DT", recall = recall)
  
rocrpred <- ROCR::prediction(as.numeric(redt_predict), test_set$TARAM_MOKED)
DT_ROC<-performance(rocrpred, "tpr", "fpr")

Naive base

mydata_cat <-  as.data.frame(sapply(mydata, as.factor))
cat_features <- names(mydata_cat)[c(2,3,4,5,6,7,8,10,11,12,13,14)]


# splitdf function will return a list of training and testing sets, we don't want to use 2/3 and 1/3 because we want to make sure the split will include enough positive target 
splitdf <- function(mydata_cat, seed=NULL) {
  if (!is.null(seed)) set.seed(seed)
  index <- 1:nrow(mydata_cat)
  trainindex <- sample(index, trunc(length(index)*0.8))
  trainset <- mydata_cat[trainindex, ]
  testset <- mydata_cat[-trainindex, ]
  list(trainset=trainset,testset=testset)
}

#apply the function
splits <- splitdf(mydata_cat, seed=808)

# save the training and testing sets as data frames
train_set <- splits$trainset
test_set <- splits$testset



NB_model <- naiveBayes(x=train_set[,cat_features],y=train_set$TARAM_MOKED)

NB_predict <- predict(NB_model,test_set[,cat_features],type="class")
confusion_matrix_NB <- table(test_set$TARAM_MOKED,NB_predict)
confusion_matrix_NB
##    NB_predict
##         0     1
##   0 18865   848
##   1   315   271
# Confusion matrix
confusion_matrix_NB <- table(test_set[,target_feature],NB_predict)
confusion_matrix_NB
##    NB_predict
##         0     1
##   0 18865   848
##   1   315   271
accuracy <- sum(diag(confusion_matrix_NB))/sum(confusion_matrix_NB)
accuracy
## [1] 0.9427065
recall  <- confusion_matrix_NB[2,2]/sum(confusion_matrix_NB[2,])
recall
## [1] 0.4624573
recall_summary <- rbind(recall_summary,data.frame(model= "NB", recall = recall))

rmse_summary<-rbind(rmse_summary, data.frame(model="NB", RMSE= rmse(test_set$TARAM_MOKED,NB_predict)))

rocrpred <- ROCR::prediction(as.numeric(NB_predict), test_set$TARAM_MOKED)
NB_ROC<-performance(rocrpred, "tpr", "fpr")

SVM

numeric_features <- names(mydata_numeric)[c(1,2,3,4,6,7,8,9,10,11,12,13)]
mydata_numeric$TARAM_MOKED <- as.factor(mydata_numeric$TARAM_MOKED )
SVM_model <- svm(fmla_numeric,train_set_numeric,kernel = "polynomial", type = "C")
SVM_Predict <- predict(SVM_model,test_set_numeric)
confusion_matrix_SVM <- table(test_set_numeric$TARAM_MOKED,SVM_Predict)
confusion_matrix_SVM
##    SVM_Predict
##         0     1
##   0 19687    26
##   1   474   112
accuracy <- sum(diag(confusion_matrix_SVM))/sum(confusion_matrix_SVM)
accuracy
## [1] 0.9753682
recall  <- confusion_matrix_SVM[2,2]/sum(confusion_matrix_SVM[2,])
recall
## [1] 0.1911263
recall_summary <- rbind(recall_summary,data.frame(model= "SVM", recall = recall))
rmse_summary<-rbind(rmse_summary, data.frame(model="SVM", RMSE= rmse(test_set_numeric$TARAM_MOKED,SVM_Predict)))

rocrpred <- ROCR::prediction(as.numeric(SVM_Predict), test_set_numeric$TARAM_MOKED)
svm_ROC<-performance(rocrpred, "tpr", "fpr")

nnet

train_set_nnet <-as.data.frame(sapply(train_set_numeric[,c(1,2,3,4,6,7,8,9,10,11,12,13)], function(x) (x-min(x))/(max(x)-min(x))))
train_set_nnet$TARAM_MOKED <- as.factor(train_set_numeric[,4])
nnet_model <- nnet(fmla, data = train_set_nnet, size = 3 ,maxit=1000,linout = F)
## # weights:  40
## initial  value 56008.963020 
## final  value 0.000000 
## converged
test_set_nnet <- sapply(test_set_numeric[,c(1,2,3,4,6,7,8,9,10,11,12,13)], function(x) (x-min(x))/(max(x)-min(x))) 
nnet_pred <- predict(nnet_model,test_set_nnet, type = "class")

confusion_matrix_nnet <- table(test_set_numeric$TARAM_MOKED,nnet_pred)
confusion_matrix_nnet
##    nnet_pred
##         0     1
##   0 13559  6154
##   1   199   387
accuracy <- sum(diag(confusion_matrix_nnet))/sum(confusion_matrix_nnet)
accuracy
## [1] 0.6870289
recall  <- confusion_matrix_nnet[2,2]/sum(confusion_matrix_nnet[2,])
recall
## [1] 0.6604096
recall_summary <- rbind(recall_summary,data.frame(model= "nn", recall = recall))


rmse_summary<-rbind(rmse_summary, data.frame(model="nn", RMSE= rmse(test_set_numeric$TARAM_MOKED,as.numeric(nnet_pred))))

rocrpred <- ROCR::prediction(as.numeric(nnet_pred), test_set_numeric$TARAM_MOKED)
nnet_ROC<-performance(rocrpred, "tpr", "fpr")

It’s important to pay attention here. We got 60% recall but in what price. only 60% accuaracy.

random model

n<-nrow(test_set)
percentOf1<-nrow(test_set[test_set$TARAM_MOKED==1,])/n

#Create normal distribution of random numbers between 0 and 1 (the random numbers simulate
#probability of "1"), and find X value which represents percent of "1" in test set
set.seed(n) 
normDistribution<-rnorm(n, mean = 0, sd=1)
minND<-min(normDistribution)
maxND<-max(normDistribution)
normDistribution<- sapply(normDistribution, function(x) (x-minND)/(maxND-minND))
X<-qnorm(1-percentOf1, mean = mean(normDistribution), sd = sd(normDistribution))
#plot( density(normDistribution))

# Apply "1" to numbers with simulated probability equal or more than X value, "0" for the rest
NoModel<-sapply(normDistribution, function(x) if (x>=X) 1 else 0 )
  
confusion_matrix <- table(test_set[,target_feature],NoModel)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")

recall.random<-confusion_matrix[2,2]/(confusion_matrix[,2])
accuracy.random<-sum(diag(confusion_matrix))/sum(confusion_matrix)

noModel_predict <-ROCR::prediction(as.numeric(normDistribution), test_set$TARAM_MOKED)
noModel_ROC<-performance(noModel_predict, "tpr", "fpr")
confusion_matrix
##           NoModel
##            predicted 0 predicted 1
##   actual 0       19179         534
##   actual 1         566          20
colors <- c('black','red','blue','green', 'orange')
legends_list <- c("noModel", "nnet", "SVM", "Decision tree", "Naive Bayes")
plot.new()
plot(noModel_ROC)
plot(nnet_ROC,add =TRUE, col = "red")
plot(svm_ROC,add =TRUE, col = "blue")
plot(DT_ROC,add =TRUE, col = "green")
plot(NB_ROC,add =TRUE, col = "orange")
legend(x = "topright",legend = legends_list, fill   = colors)

print(rmse_summary)
##   model      RMSE
## 1    DT 0.1529711
## 2    NB 0.2393605
## 3   SVM 0.9901978
## 4    nn 0.5594382
print(recall_summary)
##   model    recall
## 1    DT 0.3156997
## 2    NB 0.4624573
## 3   SVM 0.1911263
## 4    nn 0.6604096

So infact we have a winner. The Naive Bayes model had the best recall (it will cost some of the accuracy over the decision tree but we’ll get a lot of donators that way). The nnet model reach highest recall but it cost a lot of the accuarcy and trade off probably is no cost-effective. Next summer we will try this model as a pilot. let’s hope it will bring added value to the call center.