ML_Binary Predictions

Load Packages and Dataset

Packages used to import and analyse the data include class, dplyr, googlesheet4.

class - various functions for classification, including k-nearest neighbor, learning vector quantification and self-organizing maps.
googlesheet4 - used for importing the next_sign dataset built in googlesheet.
used write csv to save a copy of the next_sign dataframe into local drive to make the data-read faster for re-run

Theories

regression method - one of ML methods
linear regression - y (dependent variable) and x (independent variable)
logistic regression - curve model the points output between zero and one. family = binomial in the glm function

Some Coding into it

Build a logistic regression model

Examine the dataset to identify potential independent variables

str(donors)

## 'data.frame':    93462 obs. of  13 variables:
##  $ donated          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ veteran          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bad_address      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age              : int  60 46 NA 70 78 NA 38 NA NA 65 ...
##  $ has_children     : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ wealth_rating    : int  0 3 1 2 1 0 2 3 1 0 ...
##  $ interest_veterans: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ interest_religion: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pet_owner        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ catalog_shopper  : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ recency          : Factor w/ 2 levels "CURRENT","LAPSED": 1 1 1 1 1 1 1 1 1 1 ...
##  $ frequency        : Factor w/ 2 levels "FREQUENT","INFREQUENT": 1 1 1 1 1 2 2 1 2 2 ...
##  $ money            : Factor w/ 2 levels "HIGH","MEDIUM": 2 1 2 2 2 2 2 2 2 2 ...

Explore the dependent variable

table(donors$donated)

## 
##     0     1 
## 88751  4711

Build the donation model

donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans, 
                      data = donors, family = "binomial")

Summarize the model results

summary(donation_model)

## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5

Make some predictions

Estimate the donation probability

donors$donation_prob <- predict(donation_model, donors, type = "response")

Find the donation probability of the average prospect

mean(donors$donated)

## [1] 0.05040551

Predict a donation if probability of donation is greater than average

donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)

Calculate the model’s accuracy, and as it shows below, it is about 80%

mean(donors$donated == donors$donation_pred)

## [1] 0.794815

But think from another perspective, despite this relatively high accuracy, the result is misleading due to the rarity of outcome being predicted. The accuracy would been around 95% if a model just simply predicted “no donation” for each person.

mean(donors$donated == 0)

## [1] 0.9495945

prop.table(table(donors$donated==0, donors$donated ==1))

##        
##              FALSE       TRUE
##   FALSE 0.00000000 0.05040551
##   TRUE  0.94959449 0.00000000

Model Performance trade-offs

When one outcome is very rare, predicting the opposite can result in very high accuracy. IN this case, it might be necessary to sacrifice a bit of overall accuracy in order to better target the outcome of interest.

ROC curves - model’s ability to distinguish between positive and negative predictions the outcome of interest versus all others.
AUC (area under the curve) is used to quantify the area under the ROC curve.

Plot some curves

First of all, have to mutate one column into the original donors’ dataset.

donors_with_Prob <- donors %>% 
  mutate(donation_prob = predict(donation_model, donors, type = "response"))

Check if it is in the form we would like to have - confirm it is.

head(donors_with_Prob)

##   donated veteran bad_address age has_children wealth_rating interest_veterans
## 1       0       0           0  60            0             0                 0
## 2       0       0           0  46            1             3                 0
## 3       0       0           0  NA            0             1                 0
## 4       0       0           0  70            0             2                 0
## 5       0       0           0  78            1             1                 0
## 6       0       0           0  NA            0             0                 0
##   interest_religion pet_owner catalog_shopper recency  frequency  money
## 1                 0         0               0 CURRENT   FREQUENT MEDIUM
## 2                 0         0               0 CURRENT   FREQUENT   HIGH
## 3                 0         0               0 CURRENT   FREQUENT MEDIUM
## 4                 0         0               0 CURRENT   FREQUENT MEDIUM
## 5                 1         0               1 CURRENT   FREQUENT MEDIUM
## 6                 0         0               0 CURRENT INFREQUENT MEDIUM
##   donation_prob donation_pred
## 1    0.04967101             0
## 2    0.04967101             0
## 3    0.04967101             0
## 4    0.04967101             0
## 5    0.05294280             1
## 6    0.04967101             0

Now we can go with some plotting. and library pROC has already been loaded.

Create a ROC curve with roc() and the columns of actual and predicted donations. Store the result as ROC.

ROC <- donors_with_Prob %>% 
  roc(donated, donation_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

Plot the ROC curve. The plotted ROC curve is very close to the baseline and it shows the model is actually just making predictions at random.

plot(ROC, col = "blue")

AUC - Calculate the area under the ROC curve. And note that when AUC values are very close, it is important to know more about how the model will be used. Simply the better model cannot be decided alone by AUC or ROC.

auc(ROC)

## Area under the curve: 0.5102

Dummy, missing data and interactions

imputing missing data - with a guess value i.e. mean imputation, or use model to predict the missing data
interaction effects - basically as learned in previous chapter, here we replace the plus sign in the linear model with the multiple sign. Their combination may strengthen, weaken, or completely eliminate the impact of the individual predictors. (important to remember that the combination is not necessarily strengthen the impact of prediction but it depends on the situation)
sometimes a dataset contains numeric values that represent a categorical feature. So in this case, we want to revise or coerce them into factors (i.e. not leaving them as they are as numeric)
in binomial regression, often used “relevel” which means - the levels of a factor are re-ordered so that the level specified by ref is first and the others are moved down. The variable used as reference level (i.e. reference category 1) is not shown in the linear model results.

Some coding exercise on dummy number and dealing with missing values

RE-level independent variables

check on wealth rating column in donors dataset.

donors %>% 
  distinct(wealth_rating)

##   wealth_rating
## 1             0
## 2             3
## 3             1
## 4             2

convert the wealth rating to a factor

donors$wealth_levels <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))

use relevel() to change the reference category to Medium.

donors$wealth_levels <- relevel(donors$wealth_levels, ref = "Medium")

see how the factor coding impacts the model results. below we build a logistic regression model using column wealth_levels. as we can see from the summary of model results, medium level is not shown as it is regarded as the reference level.

summary(glm(donated ~ wealth_levels, family = "binomial", data = donors))

## 
## Call:
## glm(formula = donated ~ wealth_levels, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_levelsUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_levelsLow     -0.05245    0.05332  -0.984    0.325    
## wealth_levelsHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5

Dealing with missing data

note that R will exclude any cases with NA values when building a regression model.
workaround solution is to impute the missing values with an estimated value - in this case, we will the average value. (not the only option, not necessarily the best option). It is just one way to handle missing data. Sometimes missing data has to be dealt with using more complicated methods.

Find the average age among non-missing values with summary function - as it automatically does not include missing values in calculation.

summary(donors$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546

alternatively we can use mean function directly and with additional arguments. also if want to preserve two decimal places, we can use round function together with mean function.

mean_age <- round(mean(donors$age, na.rm = TRUE),2)

Impute missing age values with the mean age. Usage of ifelse is very similar to if function in excel. “is.na” is a useful function to deal with missing values.

donors$imputed_age <- ifelse(is.na(donors$age), mean_age, donors$age)

Create a missing data indicator to model the possibility that cases with missing data are different in some way from those without. This missing value indicator provides a reminder that before imputation there was a missing value present on the record - useful to include this indicator because: 1. missing value may represent a unique category by itself, 2. there may be an important difference between records with and without missing data 3. whatever caused the missing value may also be related to the outcome.

donors$missing_age <- ifelse(is.na(donors$age), 1, 0)

Fun to Include more Independent Variables

There are other categorical values such as recency, frequency, money (as R/F/M in marketing terms). Combination of recency and frequency could have a larger impact than each of the separate effect.

Create a logistic regression model of donated as a function of money plus the interaction of recency and frequency. Used * to add the interaction term.

rfm_model <- glm(donated ~ money + recency * frequency, family = "binomial", data = donors)

examine the summary of the model and confirm interaction term is added

summary(rfm_model)

## 
## Call:
## glm(formula = donated ~ money + recency * frequency, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6

predict the probabilities. important to note is to include response term in type argument.

rfm_prob <- predict(rfm_model, donors, type = "response")

plot the ROC curves - using the pROC package. recall the first argument for roc function is the predicted (dependent variable), and compute AUC values. Plot shows a “slightly” better fit than the original prediction as shown in this section: Plot some curves

ROC <- roc(donors$donated, rfm_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5785

Automatic feature selection

unlike other ML models, regression asks human to specify the model’s predictors ahead of time. that means we have to identity the variables that might be impacting the result (need reasonable level of knowledge/insight if do not want to randomly through in variables for trial and error practice)

stepwise regression - backward stepwise (starting from all factors and eliminate one by one), forward stepwise (opposite to backward stepwise) - please note possible two methods come to complete different versions of most important predictors to the model
neither of them is guaranteed to find the best possible model - which could over or under estimate the importance of some of the predictors
but stepwise method allows the model to be built in the absence of theory or even common sense

Forward stepwise model demo

add in two more columns in the donors dataset

donors_complete <- donors %>% 
  mutate(imputed_age = ifelse(is.na(donors$age), mean_age, donors$age)) %>% 
  mutate(missing_age = ifelse(is.na(donors$age), 1, 0)) %>% 
  mutate(wealth_rating = factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))) %>% 
  select(-age, -wealth_levels)

specify a null model with no predictors.

null_model <- glm(donated ~ 1, data = donors_complete, family = "binomial")

Specify the full model using all of the potential predictors

full_model <- glm(donated ~ ., data = donors_complete, family = "binomial")

Use a forward stepwise algorithm to build a parsimonious model

step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

## Start:  AIC=37332.13
## donated ~ 1
## 
##                     Df Deviance   AIC
## + frequency          1    37021 37025
## + money              1    37209 37213
## + donation_prob      1    37317 37321
## + donation_pred      1    37321 37325
## + pet_owner          1    37322 37326
## + interest_veterans  1    37323 37327
## + has_children       1    37323 37327
## + imputed_age        1    37323 37327
## + missing_age        1    37325 37329
## + bad_address        1    37325 37329
## + catalog_shopper    1    37327 37331
## + interest_religion  1    37327 37331
## + wealth_rating      3    37323 37331
## + recency            1    37328 37332
## <none>                    37330 37332
## + veteran            1    37330 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency
## 
##                     Df Deviance   AIC
## + money              1    36944 36950
## + donation_prob      1    37008 37014
## + donation_pred      1    37014 37020
## + pet_owner          1    37014 37020
## + has_children       1    37014 37020
## + bad_address        1    37014 37020
## + interest_veterans  1    37015 37021
## + missing_age        1    37015 37021
## + wealth_rating      3    37012 37022
## + catalog_shopper    1    37018 37024
## + interest_religion  1    37018 37024
## + imputed_age        1    37018 37024
## + recency            1    37018 37024
## <none>                    37021 37025
## + veteran            1    37021 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money
## 
##                     Df Deviance   AIC
## + donation_prob      1    36933 36941
## + has_children       1    36936 36944
## + pet_owner          1    36937 36945
## + wealth_rating      3    36933 36945
## + donation_pred      1    36938 36946
## + missing_age        1    36938 36946
## + bad_address        1    36938 36946
## + interest_veterans  1    36938 36946
## + catalog_shopper    1    36941 36949
## + imputed_age        1    36941 36949
## + interest_religion  1    36942 36950
## <none>                    36944 36950
## + recency            1    36942 36950
## + veteran            1    36944 36952
## 
## Step:  AIC=36940.52
## donated ~ frequency + money + donation_prob
## 
##                     Df Deviance   AIC
## + has_children       1    36924 36934
## + wealth_rating      3    36923 36937
## + missing_age        1    36928 36938
## + pet_owner          1    36930 36940
## + imputed_age        1    36930 36940
## <none>                    36933 36941
## + recency            1    36931 36941
## + catalog_shopper    1    36932 36942
## + bad_address        1    36932 36942
## + interest_religion  1    36932 36942
## + interest_veterans  1    36932 36942
## + donation_pred      1    36932 36942
## + veteran            1    36932 36942
## 
## Step:  AIC=36933.52
## donated ~ frequency + money + donation_prob + has_children
## 
##                     Df Deviance   AIC
## + wealth_rating      3    36913 36929
## + missing_age        1    36917 36929
## + pet_owner          1    36919 36931
## <none>                    36924 36934
## + recency            1    36922 36934
## + catalog_shopper    1    36922 36934
## + imputed_age        1    36923 36935
## + bad_address        1    36923 36935
## + interest_veterans  1    36923 36935
## + veteran            1    36923 36935
## + interest_religion  1    36924 36936
## + donation_pred      1    36924 36936
## 
## Step:  AIC=36928.7
## donated ~ frequency + money + donation_prob + has_children + 
##     wealth_rating
## 
##                     Df Deviance   AIC
## + missing_age        1    36908 36926
## + pet_owner          1    36909 36927
## <none>                    36913 36929
## + recency            1    36911 36929
## + catalog_shopper    1    36912 36930
## + imputed_age        1    36912 36930
## + bad_address        1    36912 36930
## + interest_veterans  1    36913 36931
## + veteran            1    36913 36931
## + donation_pred      1    36913 36931
## + interest_religion  1    36913 36931
## 
## Step:  AIC=36926.04
## donated ~ frequency + money + donation_prob + has_children + 
##     wealth_rating + missing_age
## 
##                     Df Deviance   AIC
## + pet_owner          1    36905 36925
## <none>                    36908 36926
## + recency            1    36906 36926
## + catalog_shopper    1    36907 36927
## + imputed_age        1    36907 36927
## + bad_address        1    36908 36928
## + interest_veterans  1    36908 36928
## + donation_pred      1    36908 36928
## + veteran            1    36908 36928
## + interest_religion  1    36908 36928
## 
## Step:  AIC=36924.6
## donated ~ frequency + money + donation_prob + has_children + 
##     wealth_rating + missing_age + pet_owner
## 
##                     Df Deviance   AIC
## <none>                    36905 36925
## + recency            1    36903 36925
## + imputed_age        1    36904 36926
## + bad_address        1    36904 36926
## + donation_pred      1    36904 36926
## + interest_veterans  1    36904 36926
## + catalog_shopper    1    36904 36926
## + interest_religion  1    36904 36926
## + veteran            1    36905 36927

Estimate the stepwise donation probability

step_prob <- predict(step_model, donors_complete, type = "response")

Plot the ROC of the stepwise model

library(pROC)
ROC <- roc(donors_complete$donated, step_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5872