Load Packages and Dataset
Packages used to import and analyse the data include class, dplyr, googlesheet4.
class - various functions for classification, including k-nearest neighbor, learning vector quantification and self-organizing maps.
googlesheet4 - used for importing the next_sign dataset built in googlesheet.
used write csv to save a copy of the next_sign dataframe into local drive to make the data-read faster for re-run
Theories
regression method - one of ML methods
linear regression - y (dependent variable) and x (independent variable)
logistic regression - curve model the points output between zero and one. family = binomial in the glm function
Some Coding into it
- Build a logistic regression model
Examine the dataset to identify potential independent variables
str(donors)## 'data.frame': 93462 obs. of 13 variables:
## $ donated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ veteran : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bad_address : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 60 46 NA 70 78 NA 38 NA NA 65 ...
## $ has_children : int 0 1 0 0 1 0 1 0 0 0 ...
## $ wealth_rating : int 0 3 1 2 1 0 2 3 1 0 ...
## $ interest_veterans: int 0 0 0 0 0 0 0 0 0 0 ...
## $ interest_religion: int 0 0 0 0 1 0 0 0 0 0 ...
## $ pet_owner : int 0 0 0 0 0 0 1 0 0 0 ...
## $ catalog_shopper : int 0 0 0 0 1 0 0 0 0 0 ...
## $ recency : Factor w/ 2 levels "CURRENT","LAPSED": 1 1 1 1 1 1 1 1 1 1 ...
## $ frequency : Factor w/ 2 levels "FREQUENT","INFREQUENT": 1 1 1 1 1 2 2 1 2 2 ...
## $ money : Factor w/ 2 levels "HIGH","MEDIUM": 2 1 2 2 2 2 2 2 2 2 ...
Explore the dependent variable
table(donors$donated)##
## 0 1
## 88751 4711
Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans,
data = donors, family = "binomial")Summarize the model results
summary(donation_model)##
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans,
## family = "binomial", data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3480 -0.3192 -0.3192 -0.3192 2.5678
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.95139 0.01652 -178.664 <2e-16 ***
## bad_address -0.30780 0.14348 -2.145 0.0319 *
## interest_religion 0.06724 0.05069 1.327 0.1847
## interest_veterans 0.11009 0.04676 2.354 0.0186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 37316 on 93458 degrees of freedom
## AIC: 37324
##
## Number of Fisher Scoring iterations: 5
- Make some predictions
Estimate the donation probability
donors$donation_prob <- predict(donation_model, donors, type = "response")Find the donation probability of the average prospect
mean(donors$donated)## [1] 0.05040551
Predict a donation if probability of donation is greater than average
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)Calculate the model’s accuracy, and as it shows below, it is about 80%
mean(donors$donated == donors$donation_pred)## [1] 0.794815
But think from another perspective, despite this relatively high accuracy, the result is misleading due to the rarity of outcome being predicted. The accuracy would been around 95% if a model just simply predicted “no donation” for each person.
mean(donors$donated == 0)## [1] 0.9495945
prop.table(table(donors$donated==0, donors$donated ==1))##
## FALSE TRUE
## FALSE 0.00000000 0.05040551
## TRUE 0.94959449 0.00000000
Model Performance trade-offs
When one outcome is very rare, predicting the opposite can result in very high accuracy. IN this case, it might be necessary to sacrifice a bit of overall accuracy in order to better target the outcome of interest.
- ROC curves - model’s ability to distinguish between positive and negative predictions the outcome of interest versus all others.
- AUC (area under the curve) is used to quantify the area under the ROC curve.
Plot some curves
First of all, have to mutate one column into the original donors’ dataset.
donors_with_Prob <- donors %>%
mutate(donation_prob = predict(donation_model, donors, type = "response"))Check if it is in the form we would like to have - confirm it is.
head(donors_with_Prob)## donated veteran bad_address age has_children wealth_rating interest_veterans
## 1 0 0 0 60 0 0 0
## 2 0 0 0 46 1 3 0
## 3 0 0 0 NA 0 1 0
## 4 0 0 0 70 0 2 0
## 5 0 0 0 78 1 1 0
## 6 0 0 0 NA 0 0 0
## interest_religion pet_owner catalog_shopper recency frequency money
## 1 0 0 0 CURRENT FREQUENT MEDIUM
## 2 0 0 0 CURRENT FREQUENT HIGH
## 3 0 0 0 CURRENT FREQUENT MEDIUM
## 4 0 0 0 CURRENT FREQUENT MEDIUM
## 5 1 0 1 CURRENT FREQUENT MEDIUM
## 6 0 0 0 CURRENT INFREQUENT MEDIUM
## donation_prob donation_pred
## 1 0.04967101 0
## 2 0.04967101 0
## 3 0.04967101 0
## 4 0.04967101 0
## 5 0.05294280 1
## 6 0.04967101 0
Now we can go with some plotting. and library pROC has already been loaded.
- Create a ROC curve with roc() and the columns of actual and predicted donations. Store the result as ROC.
ROC <- donors_with_Prob %>%
roc(donated, donation_prob)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
- Plot the ROC curve. The plotted ROC curve is very close to the baseline and it shows the model is actually just making predictions at random.
plot(ROC, col = "blue")- AUC - Calculate the area under the ROC curve. And note that when AUC values are very close, it is important to know more about how the model will be used. Simply the better model cannot be decided alone by AUC or ROC.
auc(ROC)## Area under the curve: 0.5102
Dummy, missing data and interactions
imputing missing data - with a guess value i.e. mean imputation, or use model to predict the missing data
interaction effects - basically as learned in previous chapter, here we replace the plus sign in the linear model with the multiple sign. Their combination may strengthen, weaken, or completely eliminate the impact of the individual predictors. (important to remember that the combination is not necessarily strengthen the impact of prediction but it depends on the situation)
sometimes a dataset contains numeric values that represent a categorical feature. So in this case, we want to revise or coerce them into factors (i.e. not leaving them as they are as numeric)
in binomial regression, often used “relevel” which means - the levels of a factor are re-ordered so that the level specified by ref is first and the others are moved down. The variable used as reference level (i.e. reference category 1) is not shown in the linear model results.
Some coding exercise on dummy number and dealing with missing values
RE-level independent variables
- check on wealth rating column in donors dataset.
donors %>%
distinct(wealth_rating)## wealth_rating
## 1 0
## 2 3
## 3 1
## 4 2
- convert the wealth rating to a factor
donors$wealth_levels <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))- use relevel() to change the reference category to Medium.
donors$wealth_levels <- relevel(donors$wealth_levels, ref = "Medium")- see how the factor coding impacts the model results. below we build a logistic regression model using column wealth_levels. as we can see from the summary of model results, medium level is not shown as it is regarded as the reference level.
summary(glm(donated ~ wealth_levels, family = "binomial", data = donors))##
## Call:
## glm(formula = donated ~ wealth_levels, family = "binomial", data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3320 -0.3243 -0.3175 -0.3175 2.4582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.91894 0.03614 -80.772 <2e-16 ***
## wealth_levelsUnknown -0.04373 0.04243 -1.031 0.303
## wealth_levelsLow -0.05245 0.05332 -0.984 0.325
## wealth_levelsHigh 0.04804 0.04768 1.008 0.314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 37323 on 93458 degrees of freedom
## AIC: 37331
##
## Number of Fisher Scoring iterations: 5
Dealing with missing data
note that R will exclude any cases with NA values when building a regression model.
workaround solution is to impute the missing values with an estimated value - in this case, we will the average value. (not the only option, not necessarily the best option). It is just one way to handle missing data. Sometimes missing data has to be dealt with using more complicated methods.
Find the average age among non-missing values with summary function - as it automatically does not include missing values in calculation.
summary(donors$age)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 48.00 62.00 61.65 75.00 98.00 22546
alternatively we can use mean function directly and with additional arguments. also if want to preserve two decimal places, we can use round function together with mean function.
mean_age <- round(mean(donors$age, na.rm = TRUE),2)Impute missing age values with the mean age. Usage of ifelse is very similar to if function in excel. “is.na” is a useful function to deal with missing values.
donors$imputed_age <- ifelse(is.na(donors$age), mean_age, donors$age)Create a missing data indicator to model the possibility that cases with missing data are different in some way from those without. This missing value indicator provides a reminder that before imputation there was a missing value present on the record - useful to include this indicator because: 1. missing value may represent a unique category by itself, 2. there may be an important difference between records with and without missing data 3. whatever caused the missing value may also be related to the outcome.
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)Fun to Include more Independent Variables
There are other categorical values such as recency, frequency, money (as R/F/M in marketing terms). Combination of recency and frequency could have a larger impact than each of the separate effect.
- Create a logistic regression model of donated as a function of money plus the interaction of recency and frequency. Used * to add the interaction term.
rfm_model <- glm(donated ~ money + recency * frequency, family = "binomial", data = donors)- examine the summary of the model and confirm interaction term is added
summary(rfm_model)##
## Call:
## glm(formula = donated ~ money + recency * frequency, family = "binomial",
## data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3696 -0.3696 -0.2895 -0.2895 2.7924
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.01142 0.04279 -70.375 <2e-16 ***
## moneyMEDIUM 0.36186 0.04300 8.415 <2e-16 ***
## recencyLAPSED -0.86677 0.41434 -2.092 0.0364 *
## frequencyINFREQUENT -0.50148 0.03107 -16.143 <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT 1.01787 0.51713 1.968 0.0490 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 36938 on 93457 degrees of freedom
## AIC: 36948
##
## Number of Fisher Scoring iterations: 6
- predict the probabilities. important to note is to include response term in type argument.
rfm_prob <- predict(rfm_model, donors, type = "response")- plot the ROC curves - using the pROC package. recall the first argument for roc function is the predicted (dependent variable), and compute AUC values. Plot shows a “slightly” better fit than the original prediction as shown in this section: Plot some curves
ROC <- roc(donors$donated, rfm_prob)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col = "red")auc(ROC)## Area under the curve: 0.5785
Automatic feature selection
unlike other ML models, regression asks human to specify the model’s predictors ahead of time. that means we have to identity the variables that might be impacting the result (need reasonable level of knowledge/insight if do not want to randomly through in variables for trial and error practice)
stepwise regression - backward stepwise (starting from all factors and eliminate one by one), forward stepwise (opposite to backward stepwise) - please note possible two methods come to complete different versions of most important predictors to the model
neither of them is guaranteed to find the best possible model - which could over or under estimate the importance of some of the predictors
but stepwise method allows the model to be built in the absence of theory or even common sense
Forward stepwise model demo
- add in two more columns in the donors dataset
donors_complete <- donors %>%
mutate(imputed_age = ifelse(is.na(donors$age), mean_age, donors$age)) %>%
mutate(missing_age = ifelse(is.na(donors$age), 1, 0)) %>%
mutate(wealth_rating = factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))) %>%
select(-age, -wealth_levels)- specify a null model with no predictors.
null_model <- glm(donated ~ 1, data = donors_complete, family = "binomial")- Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors_complete, family = "binomial")- Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")## Start: AIC=37332.13
## donated ~ 1
##
## Df Deviance AIC
## + frequency 1 37021 37025
## + money 1 37209 37213
## + donation_prob 1 37317 37321
## + donation_pred 1 37321 37325
## + pet_owner 1 37322 37326
## + interest_veterans 1 37323 37327
## + has_children 1 37323 37327
## + imputed_age 1 37323 37327
## + missing_age 1 37325 37329
## + bad_address 1 37325 37329
## + catalog_shopper 1 37327 37331
## + interest_religion 1 37327 37331
## + wealth_rating 3 37323 37331
## + recency 1 37328 37332
## <none> 37330 37332
## + veteran 1 37330 37334
##
## Step: AIC=37024.77
## donated ~ frequency
##
## Df Deviance AIC
## + money 1 36944 36950
## + donation_prob 1 37008 37014
## + donation_pred 1 37014 37020
## + pet_owner 1 37014 37020
## + has_children 1 37014 37020
## + bad_address 1 37014 37020
## + interest_veterans 1 37015 37021
## + missing_age 1 37015 37021
## + wealth_rating 3 37012 37022
## + catalog_shopper 1 37018 37024
## + interest_religion 1 37018 37024
## + imputed_age 1 37018 37024
## + recency 1 37018 37024
## <none> 37021 37025
## + veteran 1 37021 37027
##
## Step: AIC=36949.71
## donated ~ frequency + money
##
## Df Deviance AIC
## + donation_prob 1 36933 36941
## + has_children 1 36936 36944
## + pet_owner 1 36937 36945
## + wealth_rating 3 36933 36945
## + donation_pred 1 36938 36946
## + missing_age 1 36938 36946
## + bad_address 1 36938 36946
## + interest_veterans 1 36938 36946
## + catalog_shopper 1 36941 36949
## + imputed_age 1 36941 36949
## + interest_religion 1 36942 36950
## <none> 36944 36950
## + recency 1 36942 36950
## + veteran 1 36944 36952
##
## Step: AIC=36940.52
## donated ~ frequency + money + donation_prob
##
## Df Deviance AIC
## + has_children 1 36924 36934
## + wealth_rating 3 36923 36937
## + missing_age 1 36928 36938
## + pet_owner 1 36930 36940
## + imputed_age 1 36930 36940
## <none> 36933 36941
## + recency 1 36931 36941
## + catalog_shopper 1 36932 36942
## + bad_address 1 36932 36942
## + interest_religion 1 36932 36942
## + interest_veterans 1 36932 36942
## + donation_pred 1 36932 36942
## + veteran 1 36932 36942
##
## Step: AIC=36933.52
## donated ~ frequency + money + donation_prob + has_children
##
## Df Deviance AIC
## + wealth_rating 3 36913 36929
## + missing_age 1 36917 36929
## + pet_owner 1 36919 36931
## <none> 36924 36934
## + recency 1 36922 36934
## + catalog_shopper 1 36922 36934
## + imputed_age 1 36923 36935
## + bad_address 1 36923 36935
## + interest_veterans 1 36923 36935
## + veteran 1 36923 36935
## + interest_religion 1 36924 36936
## + donation_pred 1 36924 36936
##
## Step: AIC=36928.7
## donated ~ frequency + money + donation_prob + has_children +
## wealth_rating
##
## Df Deviance AIC
## + missing_age 1 36908 36926
## + pet_owner 1 36909 36927
## <none> 36913 36929
## + recency 1 36911 36929
## + catalog_shopper 1 36912 36930
## + imputed_age 1 36912 36930
## + bad_address 1 36912 36930
## + interest_veterans 1 36913 36931
## + veteran 1 36913 36931
## + donation_pred 1 36913 36931
## + interest_religion 1 36913 36931
##
## Step: AIC=36926.04
## donated ~ frequency + money + donation_prob + has_children +
## wealth_rating + missing_age
##
## Df Deviance AIC
## + pet_owner 1 36905 36925
## <none> 36908 36926
## + recency 1 36906 36926
## + catalog_shopper 1 36907 36927
## + imputed_age 1 36907 36927
## + bad_address 1 36908 36928
## + interest_veterans 1 36908 36928
## + donation_pred 1 36908 36928
## + veteran 1 36908 36928
## + interest_religion 1 36908 36928
##
## Step: AIC=36924.6
## donated ~ frequency + money + donation_prob + has_children +
## wealth_rating + missing_age + pet_owner
##
## Df Deviance AIC
## <none> 36905 36925
## + recency 1 36903 36925
## + imputed_age 1 36904 36926
## + bad_address 1 36904 36926
## + donation_pred 1 36904 36926
## + interest_veterans 1 36904 36926
## + catalog_shopper 1 36904 36926
## + interest_religion 1 36904 36926
## + veteran 1 36905 36927
- Estimate the stepwise donation probability
step_prob <- predict(step_model, donors_complete, type = "response")- Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors_complete$donated, step_prob)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(ROC, col = "red")auc(ROC)## Area under the curve: 0.5872