1. Project Background

1.0 Introduction

A general challenge in Freemium communities is getting people to pay premium when they can also use the service for free (at the expense of seeing ads say or not getting some features). However, candidates are not as profitable as premium users because the fight for ad dollars from brands is intense. HighNote would like to use a data-driven approach to try to get more candidates to become premium users. They are willing to offer a two month free trial, but they do not know who to give this promotion to. They have hired you as a consultant to help them design a targeting strategy.

1.1 Objective

Compare different binary classification models, and predict the top 1000 most likely candidates to convert to premium.

  1. Classification

2.1 Libraries & Packages

The following libraries are being used:

library("tidyverse")
library("skimr")
library("readxl") # used to read excel files
library("dplyr") # used for data munging 
library("FNN") # used for knn regression (knn.reg function)
library("caret") # used for various predictive models
library("class") # for using confusion matrix function
library("rpart.plot") # used to plot decision tree
library("rpart")  # used for Regression tree
library('NeuralNetTools') # used to plot Neural Networks
library("PRROC") # top plot ROC curve
library("ROCR") # top plot lift curve

2.2 Data Preparation for Modeling:

# load the data set
# read the CSV file into a data frame 'highnote'
highnote <- read_csv("HN_data_PostModule.csv", 
                     col_types = "cnfnnnnnnnnnnnfnnnnnnnnnnff")

# look at all the variables by using the skim function
skim(highnote)
Data summary
Name highnote
Number of rows 107213
Number of columns 27
_______________________
Column type frequency:
character 1
factor 4
numeric 22
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
net_user 0 1 1 19 0 107205 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
male 0 1 FALSE 3 1: 42548, NA: 38950, 0: 25715
good_country 0 1 FALSE 3 0: 43025, NA: 39155, 1: 25033
delta1_good_country 0 1 FALSE 4 0: 67728, NA: 39393, 1: 57, -1: 35
adopter 0 1 FALSE 2 0: 100000, 1: 7213

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 50859 0.53 24.39 6.84 8 20.00 23.00 27.00 79.0 ▇▇▁▁▁
friend_cnt 1 1.00 12.23 48.19 0 1.00 3.00 10.00 5089.0 ▇▁▁▁▁
avg_friend_age 21626 0.80 24.61 5.73 8 20.99 23.33 27.00 79.0 ▆▇▁▁▁
avg_friend_male 16517 0.85 0.63 0.35 0 0.40 0.67 1.00 1.0 ▃▂▃▃▇
friend_country_cnt 1 1.00 2.79 4.98 0 1.00 1.00 3.00 136.0 ▇▁▁▁▁
subscriber_friend_cnt 1 1.00 0.33 2.12 0 0.00 0.00 0.00 309.0 ▇▁▁▁▁
songsListened 0 1.00 12863.89 25193.67 0 439.00 3432.00 14785.00 1000000.0 ▇▁▁▁▁
lovedTracks 0 1.00 77.76 284.19 0 0.00 9.00 58.00 44005.0 ▇▁▁▁▁
posts 0 1.00 3.77 93.96 0 0.00 0.00 0.00 15185.0 ▇▁▁▁▁
playlists 0 1.00 0.53 6.14 0 0.00 0.00 1.00 1943.0 ▇▁▁▁▁
shouts 1927 0.98 20.85 261.08 0 1.00 2.00 6.00 65872.0 ▇▁▁▁▁
tenure 32 1.00 39.55 19.29 0 24.00 38.00 54.00 111.0 ▅▇▆▂▁
delta1_friend_cnt 26 1.00 0.47 5.34 -486 0.00 0.00 0.00 521.0 ▁▁▇▁▁
delta1_avg_friend_age 22057 0.79 0.23 0.71 -50 0.00 0.14 0.38 19.5 ▁▁▁▇▁
delta1_avg_friend_male 16826 0.84 0.00 0.05 -1 0.00 0.00 0.00 1.0 ▁▁▇▁▁
delta1_friend_country_cnt 26 1.00 0.06 0.68 -52 0.00 0.00 0.00 30.0 ▁▁▁▇▁
delta1_subscriber_friend_cnt 26 1.00 -0.01 0.48 -23 0.00 0.00 0.00 43.0 ▁▇▁▁▁
delta1_songsListened 1001 0.99 587.81 2135.08 -93584 0.00 0.00 310.00 129096.0 ▁▁▇▁▁
delta1_lovedTracks 27 1.00 3.26 24.32 -1040 0.00 0.00 0.00 1338.0 ▁▁▇▁▁
delta1_posts 27 1.00 0.15 13.53 -6 0.00 0.00 0.00 3391.0 ▇▁▁▁▁
delta1_playlists 27 1.00 0.00 0.38 -8 0.00 0.00 0.00 116.0 ▇▁▁▁▁
delta1_shouts 1995 0.98 0.78 36.46 -953 0.00 0.00 0.00 10022.0 ▇▁▁▁▁
#perform exploratory analysis of the categorical features
highnote %>%  keep(is.factor) %>%  summary()
##  male       good_country delta1_good_country adopter   
##  0 :25715   1 :25033     0 :67728            0:100000  
##  NA:38950   NA:39155     NA:39393            1:  7213  
##  1 :42548   0 :43025     1 :   57                      
##                          -1:   35
# replace missing values in the factor variables from "NA" to "UNK".
highnote <- highnote %>%
  mutate(male = recode(male, "NA" = "UNK")) %>%
  mutate(good_country = recode(as.factor(good_country), "NA" = "UNK")) %>%
  mutate(delta1_good_country = recode(delta1_good_country, "NA" = "UNK"))
#perform exploratory analysis of the numeric variables 
highnote %>%  keep(is.numeric) %>%  summary()
##       age          friend_cnt      avg_friend_age  avg_friend_male
##  Min.   : 8.00   Min.   :   0.00   Min.   : 8.00   Min.   :0.000  
##  1st Qu.:20.00   1st Qu.:   1.00   1st Qu.:20.99   1st Qu.:0.400  
##  Median :23.00   Median :   3.00   Median :23.33   Median :0.670  
##  Mean   :24.39   Mean   :  12.23   Mean   :24.61   Mean   :0.632  
##  3rd Qu.:27.00   3rd Qu.:  10.00   3rd Qu.:27.00   3rd Qu.:1.000  
##  Max.   :79.00   Max.   :5089.00   Max.   :79.00   Max.   :1.000  
##  NA's   :50859   NA's   :1         NA's   :21626   NA's   :16517  
##  friend_country_cnt subscriber_friend_cnt songsListened      lovedTracks      
##  Min.   :  0.000    Min.   :  0.0000      Min.   :      0   Min.   :    0.00  
##  1st Qu.:  1.000    1st Qu.:  0.0000      1st Qu.:    439   1st Qu.:    0.00  
##  Median :  1.000    Median :  0.0000      Median :   3432   Median :    9.00  
##  Mean   :  2.788    Mean   :  0.3348      Mean   :  12864   Mean   :   77.76  
##  3rd Qu.:  3.000    3rd Qu.:  0.0000      3rd Qu.:  14785   3rd Qu.:   58.00  
##  Max.   :136.000    Max.   :309.0000      Max.   :1000000   Max.   :44005.00  
##  NA's   :1          NA's   :1                                                 
##      posts             playlists             shouts             tenure      
##  Min.   :    0.000   Min.   :   0.0000   Min.   :    0.00   Min.   :  0.00  
##  1st Qu.:    0.000   1st Qu.:   0.0000   1st Qu.:    1.00   1st Qu.: 24.00  
##  Median :    0.000   Median :   0.0000   Median :    2.00   Median : 38.00  
##  Mean   :    3.773   Mean   :   0.5304   Mean   :   20.85   Mean   : 39.55  
##  3rd Qu.:    0.000   3rd Qu.:   1.0000   3rd Qu.:    6.00   3rd Qu.: 54.00  
##  Max.   :15185.000   Max.   :1943.0000   Max.   :65872.00   Max.   :111.00  
##                                          NA's   :1927       NA's   :32      
##  delta1_friend_cnt   delta1_avg_friend_age delta1_avg_friend_male
##  Min.   :-486.0000   Min.   :-50.000       Min.   :-1            
##  1st Qu.:   0.0000   1st Qu.:  0.000       1st Qu.: 0            
##  Median :   0.0000   Median :  0.143       Median : 0            
##  Mean   :   0.4707   Mean   :  0.229       Mean   : 0            
##  3rd Qu.:   0.0000   3rd Qu.:  0.375       3rd Qu.: 0            
##  Max.   : 521.0000   Max.   : 19.500       Max.   : 1            
##  NA's   :26          NA's   :22057         NA's   :16826         
##  delta1_friend_country_cnt delta1_subscriber_friend_cnt delta1_songsListened
##  Min.   :-52.00000         Min.   :-23.0000             Min.   :-93584.0    
##  1st Qu.:  0.00000         1st Qu.:  0.0000             1st Qu.:     0.0    
##  Median :  0.00000         Median :  0.0000             Median :     0.0    
##  Mean   :  0.06187         Mean   : -0.0135             Mean   :   587.8    
##  3rd Qu.:  0.00000         3rd Qu.:  0.0000             3rd Qu.:   310.0    
##  Max.   : 30.00000         Max.   : 43.0000             Max.   :129096.0    
##  NA's   :26                NA's   :26                   NA's   :1001        
##  delta1_lovedTracks   delta1_posts      delta1_playlists   delta1_shouts      
##  Min.   :-1040.000   Min.   :  -6.000   Min.   : -8.0000   Min.   : -953.000  
##  1st Qu.:    0.000   1st Qu.:   0.000   1st Qu.:  0.0000   1st Qu.:    0.000  
##  Median :    0.000   Median :   0.000   Median :  0.0000   Median :    0.000  
##  Mean   :    3.262   Mean   :   0.146   Mean   :  0.0042   Mean   :    0.779  
##  3rd Qu.:    0.000   3rd Qu.:   0.000   3rd Qu.:  0.0000   3rd Qu.:    0.000  
##  Max.   : 1338.000   Max.   :3391.000   Max.   :116.0000   Max.   :10022.000  
##  NA's   :27          NA's   :27         NA's   :27         NA's   :1995

Completion rate for Age is fairly low at 53%, while the completion rates for avg_friend_age (80%), avg_friend_male (85%), delta1_avg_friend_age (79%) and delta1_avg_friend_male (84%) are also concerning.

#perform exploratory analysis of the numeric variables 
highnote %>%  keep(is.factor) %>%  summary()
##   male       good_country delta1_good_country adopter   
##  0  :25715   1  :25033    0  :67728           0:100000  
##  UNK:38950   UNK:39155    UNK:39393           1:  7213  
##  1  :42548   0  :43025    1  :   57                     
##                           -1 :   35
# replace missing values in the numeric variables with mean or median by gender
highnote <- highnote %>%
  group_by(male) %>%
  mutate(age=ifelse(is.na(age), median(age, na.rm=TRUE),age)) %>%
  mutate(friend_cnt=ifelse(is.na(friend_cnt), median(friend_cnt, na.rm=TRUE), friend_cnt)) %>%
  mutate(avg_friend_age=ifelse(is.na(avg_friend_age), 
                               mean(avg_friend_age, na.rm=TRUE),avg_friend_age)) %>%
  mutate(avg_friend_male=ifelse(is.na(avg_friend_male),
                                mean(avg_friend_male, na.rm=TRUE), avg_friend_male)) %>%
  mutate(friend_country_cnt=ifelse(is.na(friend_country_cnt),
                                   median(friend_country_cnt, na.rm = TRUE), 
                                          friend_country_cnt)) %>%
  mutate(subscriber_friend_cnt=ifelse(is.na(subscriber_friend_cnt),
                                      median(subscriber_friend_cnt,na.rm = TRUE),
                                             subscriber_friend_cnt)) %>%
  mutate(shouts=ifelse(is.na(shouts),median(shouts,na.rm = TRUE),shouts)) %>%
  mutate(tenure=ifelse(is.na(tenure),median(tenure,na.rm=TRUE),tenure)) %>%
  # the delta1 (data recorded in the past period)
  mutate(delta1_friend_cnt=ifelse(is.na(delta1_friend_cnt),
                                  median(delta1_friend_cnt,na.rm=TRUE),
                                         delta1_friend_cnt)) %>%
  mutate(delta1_avg_friend_age=ifelse(is.na(delta1_avg_friend_age),
                                      mean(delta1_avg_friend_age, na.rm=TRUE),
                                           delta1_avg_friend_age)) %>%
  mutate(delta1_avg_friend_male=ifelse(is.na(delta1_avg_friend_male),
                                       mean(delta1_avg_friend_male, na.rm=TRUE),
                                       delta1_avg_friend_male)) %>%
  mutate(delta1_friend_country_cnt=ifelse(is.na(delta1_friend_country_cnt),
                                          median(delta1_friend_country_cnt,na.rm=TRUE),
                                         delta1_friend_country_cnt)) %>%
  mutate(delta1_subscriber_friend_cnt=ifelse(is.na(delta1_subscriber_friend_cnt),
                                             median(delta1_subscriber_friend_cnt,na.rm=TRUE),
                                             delta1_subscriber_friend_cnt)) %>%
  mutate(delta1_songsListened=ifelse(is.na(delta1_songsListened),
                                     median(delta1_songsListened, na.rm=TRUE),
                                     delta1_songsListened)) %>%
  mutate(delta1_lovedTracks=ifelse(is.na(delta1_lovedTracks), 
                                   median(delta1_lovedTracks,na.rm = TRUE),
                                   delta1_lovedTracks)) %>%
  mutate(delta1_posts=ifelse(is.na(delta1_posts), median(delta1_posts,na.rm=TRUE),
                             delta1_posts)) %>%
  mutate(delta1_playlists=ifelse(is.na(delta1_playlists), median(delta1_playlists,na.rm = TRUE),
                                 delta1_playlists)) %>%
  mutate(delta1_shouts=ifelse(is.na(delta1_shouts ),median(delta1_shouts,na.rm = TRUE),
                              delta1_shouts )) %>%
  ungroup()


#double check if the NAs are replaced successfully
skim(highnote)
Data summary
Name highnote
Number of rows 107213
Number of columns 27
_______________________
Column type frequency:
character 1
factor 4
numeric 22
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
net_user 0 1 1 19 0 107205 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
male 0 1 FALSE 3 1: 42548, UNK: 38950, 0: 25715
good_country 0 1 FALSE 3 0: 43025, UNK: 39155, 1: 25033
delta1_good_country 0 1 FALSE 4 0: 67728, UNK: 39393, 1: 57, -1: 35
adopter 0 1 FALSE 2 0: 100000, 1: 7213

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 23.76 5.02 8 22.00 23.00 24.00 79.0 ▃▇▁▁▁
friend_cnt 0 1 12.23 48.19 0 1.00 3.00 10.00 5089.0 ▇▁▁▁▁
avg_friend_age 0 1 24.70 5.13 8 21.45 24.51 26.00 79.0 ▃▇▁▁▁
avg_friend_male 0 1 0.63 0.33 0 0.50 0.66 1.00 1.0 ▃▂▃▇▇
friend_country_cnt 0 1 2.79 4.98 0 1.00 1.00 3.00 136.0 ▇▁▁▁▁
subscriber_friend_cnt 0 1 0.33 2.12 0 0.00 0.00 0.00 309.0 ▇▁▁▁▁
songsListened 0 1 12863.89 25193.67 0 439.00 3432.00 14785.00 1000000.0 ▇▁▁▁▁
lovedTracks 0 1 77.76 284.19 0 0.00 9.00 58.00 44005.0 ▇▁▁▁▁
posts 0 1 3.77 93.96 0 0.00 0.00 0.00 15185.0 ▇▁▁▁▁
playlists 0 1 0.53 6.14 0 0.00 0.00 1.00 1943.0 ▇▁▁▁▁
shouts 0 1 20.53 258.74 0 1.00 2.00 6.00 65872.0 ▇▁▁▁▁
tenure 0 1 39.55 19.28 0 24.00 38.00 54.00 111.0 ▅▇▆▂▁
delta1_friend_cnt 0 1 0.47 5.34 -486 0.00 0.00 0.00 521.0 ▁▁▇▁▁
delta1_avg_friend_age 0 1 0.23 0.63 -50 0.00 0.22 0.32 19.5 ▁▁▁▇▁
delta1_avg_friend_male 0 1 0.00 0.05 -1 0.00 0.00 0.00 1.0 ▁▁▇▁▁
delta1_friend_country_cnt 0 1 0.06 0.68 -52 0.00 0.00 0.00 30.0 ▁▁▁▇▁
delta1_subscriber_friend_cnt 0 1 -0.01 0.48 -23 0.00 0.00 0.00 43.0 ▁▇▁▁▁
delta1_songsListened 0 1 582.32 2125.84 -93584 0.00 0.00 298.00 129096.0 ▁▁▇▁▁
delta1_lovedTracks 0 1 3.26 24.32 -1040 0.00 0.00 0.00 1338.0 ▁▁▇▁▁
delta1_posts 0 1 0.15 13.53 -6 0.00 0.00 0.00 3391.0 ▇▁▁▁▁
delta1_playlists 0 1 0.00 0.38 -8 0.00 0.00 0.00 116.0 ▇▁▁▁▁
delta1_shouts 0 1 0.76 36.12 -953 0.00 0.00 0.00 10022.0 ▇▁▁▁▁

2.3 Split data into predictors(X) and outcome(Y)

# create X and Y data frame
highnote_y <- highnote %>% pull("adopter") %>% as.factor()
highnote_prenorm_numeric_x <- highnote %>% select(-c("net_user", "adopter", "male", "delta1_good_country", "good_country")) # x set is the predictors 
highnote_prenorm_factor_x <- highnote %>% select(c("male", "delta1_good_country", "good_country"))

2.4 Normalization Create a normalization function that normalizes columns to ensure that each column is at same scale.

# function to normalize data (0 to 1)
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
# Normalize x variables since they are at different scale
highnote_x_normalized_1 <- as.data.frame(lapply(highnote_prenorm_numeric_x, normalize))
highnote_x_normalized <- cbind(highnote_x_normalized_1,highnote_prenorm_factor_x)

2.5 Split the data into training and test

We split up the data set into test and training data in the ratio of 25% : 75%.

set.seed(1234)

# Split the data into training and test (75%/25%)
smp_size <- floor(0.75 * nrow(highnote_x_normalized))

# Randomly select row numbers for training data set
train_ind <- sample(seq_len(nrow(highnote_x_normalized)), size = smp_size)

# Create training and test sets for x
highnote_x_train <- highnote_x_normalized[train_ind, ]
highnote_x_test <- highnote_x_normalized[-train_ind, ]

# Create training and test sets for y
highnote_y_train <- highnote_y[train_ind]
highnote_y_test <- highnote_y[-train_ind]

# Create an empty data frame to store results from different models
clf_results <- data.frame(matrix(ncol = 5, nrow = 0))
names(clf_results) <- c("Model", "Accuracy", "Precision", "Recall", "F1")

# Create an empty data frame to store TP, TN, FP and FN values
cost_benefit_df <- data.frame(matrix(ncol = 5, nrow = 0))
names(cost_benefit_df) <- c("Model", "TP", "FN", "FP", "TN")

2.6 SMOTE to deal with class imbalance

# Check the proportions for the class between all 3 datasets.
round(prop.table(table(highnote$adopter)), 4) * 100
## 
##     0     1 
## 93.27  6.73
round(prop.table(table(highnote_y_train)), 4) * 100
## highnote_y_train
##     0     1 
## 93.23  6.77
round(prop.table(table(highnote_y_test)), 4) * 100
## highnote_y_test
##     0     1 
## 93.41  6.59

The proportion of “0” and “1” is highly imbalanced in each of the three above data set. Over 93% of the adopter results represents “0” (non-converted free users). If we end up using these data sets, the models will typically over-classify the majority class. Hence, we are balancing class distribution by randomly increasing minority class examples by replicating them using SMOTE (Synthetic Minority Oversampling Technique).

# install.packages("abind")
library("abind")

install.packages( "https://cran.r-project.org/src/contrib/Archive/DMwR/DMwR_0.4.1.tar.gz", repos=NULL, type="source" )
library("DMwR")
set.seed(1234)

# create the full training dataset with X and Y variable
highnote_train <-  cbind(highnote_x_train, highnote_y_train)

# parameters for SMOTE are 1) prediction model, 2) data, 3) percent to oversample, 4) percent to undersample)
highnote_train_balanced <- SMOTE(highnote_y_train ~ ., data.frame(highnote_train), perc.over = 100, perc.under = 200)
# Check the proportions for the class between balanced data sets. 
round(prop.table(table(highnote_train_balanced$highnote_y_train)), 4) * 100
## 
##  0  1 
## 50 50

After we apply SMOTE, the data set is now balanced, having equal proportions for “0” and “1”

# remove the Y column from the newly balanced training set
highnote_x_train <- highnote_train_balanced %>% select(-highnote_y_train)

# store the Y column
highnote_y_train <- highnote_train_balanced %>% pull(highnote_y_train) %>% as.factor()

2.7 Transforming dataframe with One Hot Encoding. The approach is that we split the factor variables from the numeric variables, so that they can be one hot encoded.

#Split the train predictor variables into factors & numeric prior to doing one hot encoding transformation on the factor variables
highnote_x_train_factors <- highnote_x_train %>% select(c("delta1_good_country","good_country","male"))

highnote_x_train_numeric <- highnote_x_train %>% select (-c("delta1_good_country","good_country","male"))

#Perform the same splitting on test dataset
highnote_x_test_factors <- highnote_x_test %>% select(c("delta1_good_country","good_country","male"))

highnote_x_test_numeric <- highnote_x_test %>% select (-c("delta1_good_country","good_country","male"))


#Performing One Hot Encoding on both Datasets
library(dummies)

highnote_x_train_factors_ohe <- dummy.data.frame(data=highnote_x_train_factors, sep=" ")

highnote_x_test_factors_ohe <- dummy.data.frame(data=highnote_x_test_factors, sep=" ")

#Combining the factors and numeric variables together using cbind()

highnote_x_train_ohe <- cbind(highnote_x_train_factors_ohe,highnote_x_train_numeric)

highnote_x_test_ohe <- cbind(highnote_x_test_factors_ohe,highnote_x_test_numeric)
  1. Classification Models

3.1 Logistic Regression

3.1.1 Logistic Regression for All Data

glm_fit <- train(highnote_x_train_ohe,
                 highnote_y_train, 
                 method = "glm",
                 family = "binomial",
                 preProc = c("center", "scale"))


# predict on test data 
glm_predict <- predict(glm_fit, newdata=highnote_x_test_ohe)

# predict on test data prob
glm_predict_prob <- predict(glm_fit, newdata = highnote_x_test_ohe, type="prob")

Based on a threshold, converting probability into categorical outcome

y_pred_num <- ifelse(glm_predict_prob[,2] > 0.6, 
                     "1", "0")
#As the dataset is imbalanced it is important to focus on class specific metrics like recall, precision and f-measure instead of accuracy. While accuracy continues to increase as we increase threshold as the model is conservative and classifies "0" as the class in most cases, but f-measure drops. Hence choosing a threshold which has best f-measure.
# Print Confusion matrix, Accuracy, Sensitivity etc 
confusionMatrix(as.factor(y_pred_num), as.factor(highnote_y_test), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 22360  1047
##          1  2677   720
##                                           
##                Accuracy : 0.8611          
##                  95% CI : (0.8569, 0.8652)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2104          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.40747         
##             Specificity : 0.89308         
##          Pos Pred Value : 0.21195         
##          Neg Pred Value : 0.95527         
##              Prevalence : 0.06592         
##          Detection Rate : 0.02686         
##    Detection Prevalence : 0.12673         
##       Balanced Accuracy : 0.65027         
##                                           
##        'Positive' Class : 1               
## 
# Add results into clf_results dataframe
x1 <- confusionMatrix(as.factor(y_pred_num), as.factor(highnote_y_test), positive = "1")[["overall"]]
y1 <- confusionMatrix(as.factor(y_pred_num), as.factor(highnote_y_test), positive = "1")[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Logistic Regression (All Time)", 
                                             Accuracy = round (x1[["Accuracy"]],3), 
                                            Precision = round (y1[["Precision"]],3), 
                                            Recall = round (y1[["Recall"]],3), 
                                            F1 = round (y1[["F1"]],3))

# Print Accuracy and F1 score
cat("Accuracy is ", round(x1[["Accuracy"]],3), "and F1 is ", round (y1[["F1"]],3)  )
## Accuracy is  0.861 and F1 is  0.279
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a1 <- confusionMatrix(as.factor(y_pred_num), as.factor(highnote_y_test))

# pickign up the TP, FN, FP and TN
cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Logistic Regression (All Time)", 
                                             TP = a1[["table"]][4], 
                                             FN = a1[["table"]][3], 
                                             FP = a1[["table"]][2], 
                                             TN = a1[["table"]][1])

3.1.2 Logistic Regression for Only past 3 months change data

# remove present variables in the training and test data sets

highnote_x_train_delta1 <- highnote_x_train_ohe %>% 
  select(-c("friend_cnt","avg_friend_age","avg_friend_male","friend_country_cnt","subscriber_friend_cnt","songsListened","lovedTracks","posts","playlists","shouts"))

highnote_x_test_delta1 <- highnote_x_test_ohe %>% 
  select(-c("friend_cnt","avg_friend_age","avg_friend_male","friend_country_cnt","subscriber_friend_cnt","songsListened","lovedTracks","posts","playlists","shouts"))

glm_fit_delta1 <- train(highnote_x_train_delta1,
                 highnote_y_train, 
                 method = "glm",
                 family = "binomial",
                 preProc = c("center", "scale"))

# predict on test data 
glm_predict_delta1 <- predict(glm_fit_delta1, newdata=highnote_x_test_delta1)

# predict on test data prob
glm_predict_prob_delta1 <- predict(glm_fit_delta1, newdata = highnote_x_test_delta1, type="prob")
# convert probability outcome into categorical ones.
y_pred_num_delta1 <- ifelse(glm_predict_prob_delta1[,2] > 0.6, 
                     "1", "0")

# Print Confusion matrix, Accuarcy, Sensitivity etc 
confusionMatrix(as.factor(y_pred_num_delta1), as.factor(highnote_y_test), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 22622  1214
##          1  2415   553
##                                           
##                Accuracy : 0.8646          
##                  95% CI : (0.8605, 0.8687)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1645          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.31296         
##             Specificity : 0.90354         
##          Pos Pred Value : 0.18632         
##          Neg Pred Value : 0.94907         
##              Prevalence : 0.06592         
##          Detection Rate : 0.02063         
##    Detection Prevalence : 0.11073         
##       Balanced Accuracy : 0.60825         
##                                           
##        'Positive' Class : 1               
## 
# Add results into clf_results dataframe
x1_d <- confusionMatrix(as.factor(y_pred_num_delta1), as.factor(highnote_y_test), positive = "1")[["overall"]]
y1_d <- confusionMatrix(as.factor(y_pred_num_delta1), as.factor(highnote_y_test), positive = "1")[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Logistic Regression (Past)", 
                                             Accuracy = round (x1_d[["Accuracy"]],3), 
                                            Precision = round (y1_d[["Precision"]],3), 
                                            Recall = round (y1_d[["Recall"]],3), 
                                            F1 = round (y1_d[["F1"]],3))

# Print Accuracy and F1 score
cat("Accuracy is ", round(x1_d[["Accuracy"]],3), "and F1 is ", round (y1_d[["F1"]],3)  )
## Accuracy is  0.865 and F1 is  0.234
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a1_d <- confusionMatrix(as.factor(y_pred_num_delta1), as.factor(highnote_y_test))

# pickign up the TP, FN, FP and TN
cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Logistic Regression (Past)", 
                                             TP = a1_d[["table"]][4], 
                                             FN = a1_d[["table"]][3], 
                                             FP = a1_d[["table"]][2], 
                                             TN = a1_d[["table"]][1])

3.2 Decision Tree Classification 3.2.1 Decision Tree for All Data

# Cross validation
cross_validation <- trainControl(## 10-fold CV
                                method = "repeatedcv",
                                number = 10,
                                ## repeated three times
                                repeats = 3)
# Hyperparamter tuning
# maxdepth =  the maximum depth of the tree that will be created or
# the length of the longest path from the tree root to a leaf.

Param_Grid <-  expand.grid(maxdepth = 2:10)

dtree_fit <- train(highnote_x_train_ohe,
                   highnote_y_train, 
                   method = "rpart2",
                   # split - criteria to split nodes
                   parms = list(split = "gini"),
                  tuneGrid = Param_Grid,
                   trControl = cross_validation,
                  # preProc -  perform listed pre-processing to predictor dataframe
                   preProc = c("center", "scale"))

# check the accuracy for different models
dtree_fit
## CART 
## 
## 21784 samples
##    32 predictor
##     2 classes: '0', '1' 
## 
## Pre-processing: centered (32), scaled (32) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 19606, 19605, 19605, 19606, 19606, 19605, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##    2        0.7440474  0.4880943
##    3        0.7442157  0.4884310
##    4        0.7447054  0.4894108
##    5        0.7447054  0.4894108
##    6        0.7447054  0.4894108
##    7        0.7447054  0.4894108
##    8        0.7447054  0.4894108
##    9        0.7447054  0.4894108
##   10        0.7447054  0.4894108
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 4.
# print the final model
dtree_fit$finalModel
## n= 21784 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 21784 10892 0 (0.5000000 0.5000000)  
##   2) delta1_songsListened< -0.3632505 8896  1974 0 (0.7781025 0.2218975) *
##   3) delta1_songsListened>=-0.3632505 12888  3970 1 (0.3080385 0.6919615)  
##     6) lovedTracks< -0.2982294 1622   614 0 (0.6214550 0.3785450) *
##     7) lovedTracks>=-0.2982294 11266  2962 1 (0.2629150 0.7370850) *
# Plot decision tree
prp(dtree_fit$finalModel, box.palette = "Blue", tweak = 1.2)

# Predict on test data
dtree_predict <- predict(dtree_fit, newdata = highnote_x_test_ohe)

# predict probabilities
dtree_predict_prob <- predict(dtree_fit, newdata = highnote_x_test_ohe, type="prob")

# Print Confusion matrix, Accuracy, Sensitivity etc 
confusionMatrix(dtree_predict,  highnote_y_test, positive ="1" )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 18156   585
##          1  6881  1182
##                                           
##                Accuracy : 0.7215          
##                  95% CI : (0.7161, 0.7268)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1484          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.66893         
##             Specificity : 0.72517         
##          Pos Pred Value : 0.14660         
##          Neg Pred Value : 0.96879         
##              Prevalence : 0.06592         
##          Detection Rate : 0.04410         
##    Detection Prevalence : 0.30081         
##       Balanced Accuracy : 0.69705         
##                                           
##        'Positive' Class : 1               
## 
# Add results into clf_results dataframe
x2 <- confusionMatrix(dtree_predict,  highnote_y_test, positive ="1" )[["overall"]]
y2 <- confusionMatrix(dtree_predict,  highnote_y_test, positive ="1" )[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Decision Tree (All Time)", 
                                             Accuracy = round (x2[["Accuracy"]],3), 
                                            Precision = round (y2[["Precision"]],3), 
                                            Recall = round (y2[["Recall"]],3), 
                                            F1 = round (y2[["F1"]],3))

# Print Accuracy and F1 score

cat("Accuracy is ", round(x2[["Accuracy"]],3), "and F1 is ", round (y2[["F1"]],3)  )
## Accuracy is  0.721 and F1 is  0.24
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a2 <- confusionMatrix(dtree_predict,  highnote_y_test, positive ="1" )

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Decision Tree (All Time)", 
                                             TP = a2[["table"]][4], 
                                             FN = a2[["table"]][3], 
                                             FP = a2[["table"]][2], 
                                             TN = a2[["table"]][1])

3.2.2 Decision Tree for Only Past 3 months change data

# Cross validation
cross_validation <- trainControl(## 10-fold CV
                                method = "repeatedcv",
                                number = 10,
                                ## repeated three times
                                repeats = 3)
# Hyperparamter tuning
# maxdepth =  the maximum depth of the tree that will be created or
# the length of the longest path from the tree root to a leaf.

Param_Grid <-  expand.grid(maxdepth = 2:10)

dtree_fit_delta1 <- train(highnote_x_train_delta1,
                   highnote_y_train, 
                   method = "rpart2",
                   # split - criteria to split nodes
                   parms = list(split = "gini"),
                  tuneGrid = Param_Grid,
                   trControl = cross_validation,
                  # preProc -  perform listed pre-processing to predictor dataframe
                   preProc = c("center", "scale"))

# check the accuracy for different models
dtree_fit_delta1
## CART 
## 
## 21784 samples
##    22 predictor
##     2 classes: '0', '1' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 19605, 19606, 19606, 19606, 19606, 19606, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##    2        0.7414611  0.4829227
##    3        0.7414611  0.4829227
##    4        0.7514836  0.5029677
##    5        0.7514836  0.5029677
##    6        0.7514836  0.5029677
##    7        0.7514836  0.5029677
##    8        0.7514836  0.5029677
##    9        0.7514836  0.5029677
##   10        0.7514836  0.5029677
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 4.
# print the final model
dtree_fit_delta1$finalModel
## n= 21784 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 21784 10892 0 (0.5000000 0.5000000)  
##    2) delta1_songsListened< -0.3632505 8896  1974 0 (0.7781025 0.2218975)  
##      4) age< -0.2341147 5787   937 0 (0.8380854 0.1619146) *
##      5) age>=-0.2341147 3109  1037 0 (0.6664522 0.3335478)  
##       10) age>=-0.03440615 2877   805 0 (0.7201946 0.2798054) *
##       11) age< -0.03440615 232     0 1 (0.0000000 1.0000000) *
##    3) delta1_songsListened>=-0.3632505 12888  3970 1 (0.3080385 0.6919615)  
##      6) delta1_lovedTracks< -0.202196 5489  2419 1 (0.4406996 0.5593004)  
##       12) age< -0.2352699 2864  1276 0 (0.5544693 0.4455307) *
##       13) age>=-0.2352699 2625   831 1 (0.3165714 0.6834286) *
##      7) delta1_lovedTracks>=-0.202196 7399  1551 1 (0.2096229 0.7903771) *
# Plot decision tree
prp(dtree_fit_delta1$finalModel, box.palette = "Blue", tweak = 1.2)

# Predict on test data
dtree_predict_delta1 <- predict(dtree_fit_delta1, newdata = highnote_x_test_delta1)

# predict probabilities
dtree_predict_prob_delta1 <- predict(dtree_fit_delta1, newdata = highnote_x_test_delta1, type="prob")

# Print Confusion matrix, Accuarcy, Sensitivity etc 
confusionMatrix(dtree_predict_delta1,  highnote_y_test, positive ="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 19446   696
##          1  5591  1071
##                                           
##                Accuracy : 0.7654          
##                  95% CI : (0.7603, 0.7705)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1674          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.60611         
##             Specificity : 0.77669         
##          Pos Pred Value : 0.16076         
##          Neg Pred Value : 0.96545         
##              Prevalence : 0.06592         
##          Detection Rate : 0.03996         
##    Detection Prevalence : 0.24854         
##       Balanced Accuracy : 0.69140         
##                                           
##        'Positive' Class : 1               
## 
# Add results into clf_results dataframe
x2_d <- confusionMatrix(dtree_predict_delta1,  highnote_y_test, positive ="1" )[["overall"]]
y2_d <- confusionMatrix(dtree_predict_delta1,  highnote_y_test, positive ="1" )[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Decision Tree (Past)", 
                                             Accuracy = round (x2_d[["Accuracy"]],3), 
                                            Precision = round (y2_d[["Precision"]],3), 
                                            Recall = round (y2_d[["Recall"]],3), 
                                            F1 = round (y2_d[["F1"]],3))

# Print Accuracy and F1 score

cat("Accuracy is ", round(x2_d[["Accuracy"]],3), "and F1 is ", round (y2_d[["F1"]],3)  )
## Accuracy is  0.765 and F1 is  0.254

The performance of the Decision tree model based on past data is actually better than the Decision tree model based on All data, with a higher F1.

# Add results into cost_benefit_df dataframe for cost benefit analysis 
a2_d <- confusionMatrix(dtree_predict_delta1,  highnote_y_test, positive ="1" )

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Decision Tree (Past)", 
                                             TP = a2_d[["table"]][4], 
                                             FN = a2_d[["table"]][3], 
                                             FP = a2_d[["table"]][2], 
                                             TN = a2_d[["table"]][1])

3.3 K-Nearest Neighbour Classification 3.3.1 KNN for All Data

# Cross validation 
cross_validation <- trainControl(## 10-fold CV
                                method = "repeatedcv",
                                number = 10,
                                ## repeated three times
                                repeats = 3)
# Hyper-paramter tuning
# k = number of nearest neighbours
Param_Grid <-  expand.grid(k = 1:10)

# fit the model to training data
knn_clf_fit <- train(highnote_x_train_ohe,
                     highnote_y_train, 
                     method = "knn",
                     tuneGrid = Param_Grid,
                     trControl = cross_validation )

# check the accuracy for different models
knn_clf_fit
## k-Nearest Neighbors 
## 
## 21784 samples
##    32 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 19605, 19605, 19606, 19606, 19605, 19606, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7087620  0.4175238
##    2  0.6836824  0.3673645
##    3  0.7032685  0.4065364
##    4  0.6935520  0.3871033
##    5  0.6971630  0.3943250
##    6  0.6936279  0.3872552
##    7  0.6984481  0.3968961
##    8  0.6950210  0.3900417
##    9  0.6988768  0.3977533
##   10  0.6974843  0.3949688
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
# Plot accuracies for different k values
plot(knn_clf_fit)

# print the best model
print(knn_clf_fit$finalModel)
## 1-nearest neighbor model
## Training set outcome distribution:
## 
##     0     1 
## 10892 10892
# Predict on test data
knnPredict <- predict(knn_clf_fit, newdata = highnote_x_test_ohe)

knnProb <- predict(knn_clf_fit, newdata=highnote_x_test_ohe, type="prob")
# Print Confusion matrix, Accuracy, Sensitivity etc 
confusionMatrix(knnPredict, highnote_y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16980   957
##          1  8057   810
##                                          
##                Accuracy : 0.6637         
##                  95% CI : (0.658, 0.6694)
##     No Information Rate : 0.9341         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0476         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.67820        
##             Specificity : 0.45840        
##          Pos Pred Value : 0.94665        
##          Neg Pred Value : 0.09135        
##              Prevalence : 0.93408        
##          Detection Rate : 0.63349        
##    Detection Prevalence : 0.66919        
##       Balanced Accuracy : 0.56830        
##                                          
##        'Positive' Class : 0              
## 
# Add results into clf_results dataframe
x3 <- confusionMatrix(knnPredict, highnote_y_test, positive ="1" )[["overall"]]
y3 <- confusionMatrix(knnPredict, highnote_y_test, positive ="1" )[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "KNN (All Time)", 
                                             Accuracy = round (x3[["Accuracy"]],3), 
                                            Precision = round (y3[["Precision"]],3), 
                                            Recall = round (y3[["Recall"]],3), 
                                            F1 = round (y3[["F1"]],3))
# Print Accuracy and F1 score

cat("Accuracy is ", round(x3[["Accuracy"]],3), "and F1 is ", round (y3[["F1"]],3)  )
## Accuracy is  0.664 and F1 is  0.152
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a3 <- confusionMatrix(knnPredict, highnote_y_test, positive ="1")

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "KNN (All Time)", 
                                             TP = a3[["table"]][4], 
                                             FN = a3[["table"]][3], 
                                             FP = a3[["table"]][2], 
                                             TN = a3[["table"]][1])

3.3.2 KNN for Change in the past 3 months Data

# Cross validation 
cross_validation <- trainControl(## 10-fold CV
                                method = "repeatedcv",
                                number = 10,
                                ## repeated three times
                                repeats = 3)
# Hyper-paramter tuning
# k = number of nearest neighbours
Param_Grid <-  expand.grid(k = 1:10)

# fit the model to training data
knn_clf_fit_delta1 <- train(highnote_x_train_delta1,
                     highnote_y_train, 
                     method = "knn",
                     tuneGrid = Param_Grid,
                     trControl = cross_validation )

# check the accuracy for different models
knn_clf_fit_delta1
## k-Nearest Neighbors 
## 
## 21784 samples
##    22 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 19605, 19605, 19606, 19606, 19606, 19606, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7127851  0.4255693
##    2  0.6916073  0.3832146
##    3  0.7037121  0.4074238
##    4  0.6964892  0.3929780
##    5  0.7011101  0.4022197
##    6  0.6976060  0.3952113
##    7  0.6991977  0.3983951
##    8  0.6957242  0.3914481
##    9  0.6977132  0.3954260
##   10  0.6938576  0.3877149
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
# Plot accuracies for different k values
plot(knn_clf_fit_delta1)

# print the best model
print(knn_clf_fit_delta1$finalModel)
## 1-nearest neighbor model
## Training set outcome distribution:
## 
##     0     1 
## 10892 10892
# Predict on test data
knnPredict_delta1 <- predict(knn_clf_fit_delta1, newdata = highnote_x_test_delta1)

# Predict probabilities
knnProb_delta1 <- predict(knn_clf_fit_delta1, newdata=highnote_x_test_delta1, type="prob")

# Print Confusion matrix, Accuracy, Sensitivity etc 
confusionMatrix(knnPredict_delta1, highnote_y_test, positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 18240   965
##          1  6797   802
##                                           
##                Accuracy : 0.7104          
##                  95% CI : (0.7049, 0.7158)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.072           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.45388         
##             Specificity : 0.72852         
##          Pos Pred Value : 0.10554         
##          Neg Pred Value : 0.94975         
##              Prevalence : 0.06592         
##          Detection Rate : 0.02992         
##    Detection Prevalence : 0.28350         
##       Balanced Accuracy : 0.59120         
##                                           
##        'Positive' Class : 1               
## 
# Add results into clf_results dataframe
x3_d <- confusionMatrix(knnPredict_delta1, highnote_y_test, positive ="1")[["overall"]]
y3_d <- confusionMatrix(knnPredict_delta1, highnote_y_test, positive ="1")[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "KNN (Past)", 
                                             Accuracy = round (x3_d[["Accuracy"]],3), 
                                            Precision = round (y3_d[["Precision"]],3), 
                                            Recall = round (y3_d[["Recall"]],3), 
                                            F1 = round (y3_d[["F1"]],3))
# Print Accuracy and F1 score

cat("Accuracy is ", round(x3_d[["Accuracy"]],3), "and F1 is ", round (y3_d[["F1"]],3)  )
## Accuracy is  0.71 and F1 is  0.171
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a3_d <- confusionMatrix(knnPredict_delta1, highnote_y_test, positive ="1")

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "KNN (Past)", 
                                             TP = a3_d[["table"]][4], 
                                             FN = a3_d[["table"]][3], 
                                             FP = a3_d[["table"]][2], 
                                             TN = a3_d[["table"]][1])

For both Knn and Dtree, the probability calculations are fairly coarse, hence while we tested for multiple threshold from 0.4 to 0.8 for all models, the accuracy wasn’t dramatically different or the f-measure wasn’t significantly better. In certain cases the confusion matrix didn’t change at all for different threshold levels.

3.4 XG Boost Classification 3.4.1 XG Boost for All Data

#Training the Model
XG_clf_fit <- train(highnote_x_train_ohe, 
                    highnote_y_train,
                    verbosity = 0,
                    method = "xgbTree",
                    preProc = c("center", "scale"))

# print the final model
XG_clf_fit$finalModel
## ##### xgb.Booster
## raw: 170.1 Kb 
## call:
##   xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight, subsample = param$subsample), 
##     data = x, nrounds = param$nrounds, objective = "binary:logistic", 
##     verbosity = 0)
## params (as set within xgb.train):
##   eta = "0.3", max_depth = "3", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "0.75", objective = "binary:logistic", verbosity = "0", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## # of features: 32 
## niter: 150
## nfeatures : 32 
## xNames : delta1_good_country 0 delta1_good_country UNK delta1_good_country 1 delta1_good_country -1 good_country 1 good_country UNK good_country 0 male 0 male UNK male 1 age friend_cnt avg_friend_age avg_friend_male friend_country_cnt subscriber_friend_cnt songsListened lovedTracks posts playlists shouts tenure delta1_friend_cnt delta1_avg_friend_age delta1_avg_friend_male delta1_friend_country_cnt delta1_subscriber_friend_cnt delta1_songsListened delta1_lovedTracks delta1_posts delta1_playlists delta1_shouts 
## problemType : Classification 
## tuneValue :
##     nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 42     150         3 0.3     0              0.6                1      0.75
## obsLevels : 0 1 
## param :
##  $verbosity
## [1] 0
# Predict on test data
XG_clf_predict <- predict(XG_clf_fit,newdata=highnote_x_test_ohe)

# predict probabilities
XG_clf_predict_prob <- predict(XG_clf_fit, newdata=highnote_x_test_ohe, type="prob")

# convert probability outcome into categorical ones.
y_pred_num_xg <- ifelse(XG_clf_predict_prob[,2] > 0.6, "1", "0")
#Highest value of f-measure across all threshold is at 0.6, and even accuracy is the highest

# Print Confusion matrix, Accuracy, Sensitivity etc 
confusionMatrix(as.factor(y_pred_num_xg), as.factor(highnote_y_test), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 22467   932
##          1  2570   835
##                                           
##                Accuracy : 0.8693          
##                  95% CI : (0.8653, 0.8734)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2585          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.47255         
##             Specificity : 0.89735         
##          Pos Pred Value : 0.24523         
##          Neg Pred Value : 0.96017         
##              Prevalence : 0.06592         
##          Detection Rate : 0.03115         
##    Detection Prevalence : 0.12703         
##       Balanced Accuracy : 0.68495         
##                                           
##        'Positive' Class : 1               
## 
# Add results into clf_results dataframe
x4 <- confusionMatrix(as.factor(y_pred_num_xg), as.factor(highnote_y_test), positive = "1" )[["overall"]]
y4 <- confusionMatrix(as.factor(y_pred_num_xg), as.factor(highnote_y_test), positive = "1" )[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "XG Boost (All Time)", 
                                             Accuracy = round (x4[["Accuracy"]],3), 
                                            Precision = round (y4[["Precision"]],3), 
                                            Recall = round (y4[["Recall"]],3), 
                                            F1 = round (y4[["F1"]],3))

# Print Accuracy and F1 score
cat("Accuracy is ", round(x4[["Accuracy"]],3), "and F1 is ", round (y4[["F1"]],3)  )
## Accuracy is  0.869 and F1 is  0.323
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a4 <- confusionMatrix(XG_clf_predict,  highnote_y_test, positive="1" )

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "XG Boost (All Time)", 
                                             TP = a4[["table"]][4], 
                                             FN = a4[["table"]][3], 
                                             FP = a4[["table"]][2], 
                                             TN = a4[["table"]][1])

3.4.2 XG Boost for the change/delta in data for the past 3 months:

# Train the model

XG_clf_fit_delta1 <- train(highnote_x_train_delta1, 
                    highnote_y_train,
                    verbosity = 0,
                    method = "xgbTree",
                    preProc = c("center", "scale"))

# print the final model
XG_clf_fit_delta1$finalModel
## ##### xgb.Booster
## raw: 167.6 Kb 
## call:
##   xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight, subsample = param$subsample), 
##     data = x, nrounds = param$nrounds, objective = "binary:logistic", 
##     verbosity = 0)
## params (as set within xgb.train):
##   eta = "0.3", max_depth = "3", gamma = "0", colsample_bytree = "0.8", min_child_weight = "1", subsample = "1", objective = "binary:logistic", verbosity = "0", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## # of features: 22 
## niter: 150
## nfeatures : 22 
## xNames : delta1_good_country 0 delta1_good_country UNK delta1_good_country 1 delta1_good_country -1 good_country 1 good_country UNK good_country 0 male 0 male UNK male 1 age tenure delta1_friend_cnt delta1_avg_friend_age delta1_avg_friend_male delta1_friend_country_cnt delta1_subscriber_friend_cnt delta1_songsListened delta1_lovedTracks delta1_posts delta1_playlists delta1_shouts 
## problemType : Classification 
## tuneValue :
##     nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 54     150         3 0.3     0              0.8                1         1
## obsLevels : 0 1 
## param :
##  $verbosity
## [1] 0
# Predict on test data
XG_clf_predict_delta1 <- predict(XG_clf_fit_delta1,newdata=highnote_x_test_delta1)

# predict probabilities
XG_clf_predict_delta1_prob <- predict(XG_clf_fit_delta1, newdata=highnote_x_test_delta1, type="prob")

# Print Confusion matrix, Accuracy, Sensitivity etc 
confusionMatrix(XG_clf_predict_delta1,  highnote_y_test, positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 21330   835
##          1  3707   932
##                                         
##                Accuracy : 0.8305        
##                  95% CI : (0.826, 0.835)
##     No Information Rate : 0.9341        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.2161        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.52745       
##             Specificity : 0.85194       
##          Pos Pred Value : 0.20091       
##          Neg Pred Value : 0.96233       
##              Prevalence : 0.06592       
##          Detection Rate : 0.03477       
##    Detection Prevalence : 0.17307       
##       Balanced Accuracy : 0.68969       
##                                         
##        'Positive' Class : 1             
## 
x4_d <- confusionMatrix(XG_clf_predict_delta1,  highnote_y_test, positive="1" )[["overall"]]
y4_d <- confusionMatrix(XG_clf_predict_delta1,  highnote_y_test, positive="1" )[["byClass"]]
#a default 0.5 threshold has the best f-measure albeit at a lesser accuracy than a 0.6 threshold. Going for f-measure as the defining metric to pick the right model

clf_results[nrow(clf_results) + 1,] <-  list(Model = "XG Boost (Past)", 
                                             Accuracy = round (x4_d[["Accuracy"]],3), 
                                            Precision = round (y4_d[["Precision"]],3), 
                                            Recall = round (y4_d[["Recall"]],3), 
                                            F1 = round (y4_d[["F1"]],3))

# Print Accuracy and F1 score
cat("Accuracy is ", round(x4_d[["Accuracy"]],3), "and F1 is ", round (y4_d[["F1"]],3)  )
## Accuracy is  0.831 and F1 is  0.291
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a4_d <- confusionMatrix(XG_clf_predict_delta1,  highnote_y_test, positive="1" )

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "XG Boost (Past)", 
                                             TP = a4_d[["table"]][4], 
                                             FN = a4_d[["table"]][3], 
                                             FP = a4_d[["table"]][2], 
                                             TN = a4_d[["table"]][1])

3.5 Neural Network Classification 3.5.1 NN for All Data

my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 7))

nn_clf_fit <- train(highnote_x_train_ohe,
                    highnote_y_train,
                      method = "nnet",
                      threshold=0.01,
                      stepmax = 100,
                      trace = F, linout = 0,
                      tuneGrid = my.grid)  

print(nn_clf_fit)
## Neural Network 
## 
## 21784 samples
##    32 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 21784, 21784, 21784, 21784, 21784, 21784, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  Accuracy   Kappa    
##   0.1    5     0.7388745  0.4777467
##   0.1    7     0.7419210  0.4838560
##   0.5    5     0.7292708  0.4585945
##   0.5    7     0.7294157  0.4588638
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
# Plotting neural network
plotnet(nn_clf_fit$finalModel, y_names = "adopter")

# prediction on test data
nn_predict <- predict(nn_clf_fit, newdata = highnote_x_test_ohe)

# predict probabilities
nn_predict_prob <- predict(nn_clf_fit, newdata=highnote_x_test_ohe, type="prob")

# convert probability outcome into categorical ones.
y_pred_num_nn <- ifelse(nn_predict_prob[,2] > 0.6, "1", "0")

# Print Confusion matrix, Accuracy, Sensitivity etc 
confusionMatrix(as.factor(y_pred_num_nn), as.factor(highnote_y_test), positive= "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 22029  1118
##          1  3008   649
##                                           
##                Accuracy : 0.8461          
##                  95% CI : (0.8417, 0.8504)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1651          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.36729         
##             Specificity : 0.87986         
##          Pos Pred Value : 0.17747         
##          Neg Pred Value : 0.95170         
##              Prevalence : 0.06592         
##          Detection Rate : 0.02421         
##    Detection Prevalence : 0.13643         
##       Balanced Accuracy : 0.62357         
##                                           
##        'Positive' Class : 1               
## 
# Add results into clf_results data frame
x5 <- confusionMatrix(as.factor(y_pred_num_nn), as.factor(highnote_y_test), positive= "1")[["overall"]]
y5 <- confusionMatrix(as.factor(y_pred_num_nn), as.factor(highnote_y_test), positive ="1")[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Neural Network (All Time)", 
                                             Accuracy = round (x5[["Accuracy"]],3), 
                                            Precision = round (y5[["Precision"]],3), 
                                            Recall = round (y5[["Recall"]],3), 
                                            F1 = round (y5[["F1"]],3))
# Print Accuracy and F1 score

cat("Accuracy is ", round(x5[["Accuracy"]],3), "and F1 is ", round (y5[["F1"]],3)  )
## Accuracy is  0.846 and F1 is  0.239
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a5 <- confusionMatrix(nn_predict, highnote_y_test, positive ="1")

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Neural Network (All Time)", 
                                             TP = a5[["table"]][4], 
                                             FN = a5[["table"]][3], 
                                             FP = a5[["table"]][2], 
                                             TN = a5[["table"]][1])

3.5.2 NN for Data capturing Change/Delta in past 3 months

my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 7))

nn_clf_fit_delta1 <- train(highnote_x_train_delta1,
                    highnote_y_train,
                      method = "nnet",
                      threshold=0.01,
                      stepmax = 100,
                      trace = F, linout = 0,
                      tuneGrid = my.grid)  

print(nn_clf_fit_delta1)
## Neural Network 
## 
## 21784 samples
##    22 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 21784, 21784, 21784, 21784, 21784, 21784, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  Accuracy   Kappa    
##   0.1    5     0.7061388  0.4121987
##   0.1    7     0.7095989  0.4191031
##   0.5    5     0.7004434  0.4007779
##   0.5    7     0.7006438  0.4011657
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
# Plotting neural network
plotnet(nn_clf_fit_delta1$finalModel, y_names = "adopter")

# prediction on test data
nn_predict_delta1 <- predict(nn_clf_fit_delta1, newdata = highnote_x_test_delta1)

# predict probabilities
nn_predict_prob_delta1 <- predict(nn_clf_fit_delta1, newdata=highnote_x_test_delta1, type="prob")

# Print Confusion matrix, Accuracy, Sensitivity etc 
confusionMatrix(nn_predict_delta1, highnote_y_test, positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 20523  1080
##          1  4514   687
##                                           
##                Accuracy : 0.7913          
##                  95% CI : (0.7864, 0.7962)
##     No Information Rate : 0.9341          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1096          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.38879         
##             Specificity : 0.81971         
##          Pos Pred Value : 0.13209         
##          Neg Pred Value : 0.95001         
##              Prevalence : 0.06592         
##          Detection Rate : 0.02563         
##    Detection Prevalence : 0.19404         
##       Balanced Accuracy : 0.60425         
##                                           
##        'Positive' Class : 1               
## 
# Add results into clf_results dataframe
x5_d <- confusionMatrix(nn_predict_delta1, highnote_y_test, positive ="1")[["overall"]]
y5_d <- confusionMatrix(nn_predict_delta1, highnote_y_test, positive ="1")[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Neural Network (Past)", 
                                             Accuracy = round (x5_d[["Accuracy"]],3), 
                                            Precision = round (y5_d[["Precision"]],3), 
                                            Recall = round (y5_d[["Recall"]],3), 
                                            F1 = round (y5_d[["F1"]],3))
# Print Accuracy and F1 score

cat("Accuracy is ", round(x5_d[["Accuracy"]],3), "and F1 is ", round (y5_d[["F1"]],3)  )
## Accuracy is  0.791 and F1 is  0.197
# Add results into cost_benefit_df dataframe for cost benefit analysis 
a5_d <- confusionMatrix(nn_predict_delta1, highnote_y_test, positive ="1")

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Neural Network (Past)", 
                                             TP = a5_d[["table"]][4], 
                                             FN = a5_d[["table"]][3], 
                                             FP = a5_d[["table"]][2], 
                                             TN = a5_d[["table"]][1])
  1. Compare all classification models

4.1 Compare Accuracy, Precision, Recall, and F-1 Score

# compare cost benefit
print(cost_benefit_df)
##                             Model   TP   FN   FP    TN
## 1  Logistic Regression (All Time)  720 1047 2677 22360
## 2      Logistic Regression (Past)  553 1214 2415 22622
## 3        Decision Tree (All Time) 1182  585 6881 18156
## 4            Decision Tree (Past) 1071  696 5591 19446
## 5                  KNN (All Time)  810  957 8057 16980
## 6                      KNN (Past)  802  965 6797 18240
## 7             XG Boost (All Time) 1050  717 3960 21077
## 8                 XG Boost (Past)  932  835 3707 21330
## 9       Neural Network (All Time)  920  847 5049 19988
## 10          Neural Network (Past)  687 1080 4514 20523
# compare classification results
print(clf_results)
##                             Model Accuracy Precision Recall    F1
## 1  Logistic Regression (All Time)    0.861     0.212  0.407 0.279
## 2      Logistic Regression (Past)    0.865     0.186  0.313 0.234
## 3        Decision Tree (All Time)    0.721     0.147  0.669 0.240
## 4            Decision Tree (Past)    0.765     0.161  0.606 0.254
## 5                  KNN (All Time)    0.664     0.091  0.458 0.152
## 6                      KNN (Past)    0.710     0.106  0.454 0.171
## 7             XG Boost (All Time)    0.869     0.245  0.473 0.323
## 8                 XG Boost (Past)    0.831     0.201  0.527 0.291
## 9       Neural Network (All Time)    0.846     0.177  0.367 0.239
## 10          Neural Network (Past)    0.791     0.132  0.389 0.197
# Plot accuracy for all the Classification Models
ggplot(clf_results[1:10,] %>% arrange(desc(Accuracy)) %>%
       mutate(Model=factor(Model, levels=Model) ), 
       aes(x = Model, y = Accuracy)) +
  geom_bar(stat = "identity" , width=0.3, fill="steelblue") + 
  coord_cartesian(ylim = c(0.2, 1)) +
  geom_hline(aes(yintercept = mean(Accuracy)),
             colour = "green",linetype="dashed") +
  ggtitle("Compare Accuracy for all Models") +
  theme(plot.title = element_text(color="black", size=10, hjust = 0.5),
        axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

# Plot F-1 scores for all the Classification Models
ggplot(clf_results[1:10,] %>% arrange(desc(F1)) %>%
       mutate(Model=factor(Model, levels=Model) ), 
       aes(x = Model, y = F1)) +
  geom_bar(stat = "identity" , width=0.3, fill="steelblue") + 
  coord_cartesian(ylim = c(0.2, 0.4)) +
  geom_hline(aes(yintercept = mean(Accuracy)),
             colour = "green",linetype="dashed") +
  ggtitle("Compare F-1 scores for all Models") +
  theme(plot.title = element_text(color="black", size=10, hjust = 0.5),
        axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

Based on the above, it seems the XGBoost (All Time) model is the most accurate model at an accuracy of 86.9%. Also, considering the dataset here is highly imbalanced, we need a class specific metric to choose the right model. For this purpose, we choose the model with the best F-measure, which is the XG Boost(All Time) model. The XG Boost(All Time) model also has the best Precision in positive prediction of all models.

4.2 Comparing ROC-AUC and Lift Curves for all models.

# List of predictions
preds_list <- list(glm_predict_prob[,2], glm_predict_prob_delta1[,2],
                   dtree_predict_prob[,2], dtree_predict_prob_delta1[,2],
                   knnProb[,2], knnProb_delta1[,2], 
                   XG_clf_predict_prob[,2], XG_clf_predict_delta1_prob[,2],
                   nn_predict_prob[,2], nn_predict_prob_delta1[,2])

# List of actual values (same for all)
m <- length(preds_list)
actuals_list <- rep(list(highnote_y_test), m)

# Plot the ROC curves
pred <- prediction(preds_list, actuals_list)
rocs <- performance(pred, "tpr", "fpr")

# calculate AUC for all models
AUC_models <- performance(pred, "auc")
auc_lr = round(AUC_models@y.values[[1]], 3)
auc_lr_delta1 = round(AUC_models@y.values[[2]], 3)
auc_dt = round(AUC_models@y.values[[3]], 3)
auc_dt_delta1 = round(AUC_models@y.values[[4]], 3)
auc_knn = round(AUC_models@y.values[[5]], 3)
auc_knn_delta1 = round(AUC_models@y.values[[6]], 3)
auc_xg = round(AUC_models@y.values[[7]], 3)
auc_xg_delta1 = round(AUC_models@y.values[[8]], 3)
auc_nn = round(AUC_models@y.values[[9]], 3)
auc_nn_delta1 = round(AUC_models@y.values[[10]], 3)

# Plot the ROC curves
plot(rocs, col = as.list(1:m), main = "ROC-AUC Curves of Different Models (All Time and the Past)")
legend(x = "bottomright", 
       legend = c( paste0("Logistic Regression (All) - ", auc_lr), 
                   paste0("Logistic Regression (Past) -", auc_lr_delta1),
                   paste0("Decision Tree (All) - ", auc_dt), 
                   paste0("Decision Tree (Past) -", auc_dt_delta1),
                   paste0("KNN (All) - ", auc_knn), 
                   paste0("KNN (Past) -", auc_knn_delta1),
                   paste0("XG Boost (All) - ", auc_xg), 
                   paste0("XG Boost (Past) -", auc_xg_delta1),
                   paste0("Neural Net (All) - ", auc_nn),
                   paste0("Neural Net (Past) -", auc_nn_delta1)), fill = 1:m)

The thumb rule in case of identifying the best classifier on an ROC curve is the proximity of the classifier to the top left corner, which primarily tells us the ratio between true positive & false positive rates at different thresholds. On the basis of this the XG Boost(All Time) model outperforms all other classifiers across all thresholds.

AUC similarly is also the highest for the XG Boost(All Time) model as inferred visually here. Hence that’s an indication that the XG Boost(All Time) seems to be best among all classifiers.

Lift Curve essentially tells us how do the classifiers stack up against a scenario without a predictive model. This is more useful as a determining factor in scenarios where you are doing classification followed by ranking. In our case, as we need to identify the top 1000 most likely free users and give them a two-month trial in a bid to make them convert. Hence the better our classifier is at identifying the most important 1000 free users, the more it is useful for our use case.

lifts <- performance(pred, "lift", "rpp")

# Plot the Lift curves
plot(lifts, col = as.list(1:m), main = "Lift Curves of Different Models (All Time and the Past)")
legend(x = "topright", 
       legend = c( "Logistic Regression (All)", 
                   "Logistic Regression (Past)",
                   "Decision Tree (All)", 
                   "Decision Tree (Past)",
                   "KNN (All)",
                   "KNN (Past)",
                   "XG Boost (All)", 
                   "XG Boost (Past)",
                   "Neural Net (All)",
                   "Neural Net (Past)"), fill = 1:m)

The total number of test instances in our case is 100k free users. XG Boost(All) model outperforms all the other models here as well. Hence it seems to be best model for our use case.

  1. Model Selection

Based on the above metrics, and understanding the relevance of the above metrics to our use case, it seems that XG Boost(All) is the best performing classifier. It performs the best in terms of F-1 Scores, is the best model in terms of precision and delivers the highest lift as per the lift curves. The 3rd metric is especially important as we are combining classification with ranking and are only interested in top 1k ranked customers most likely to convert to premium.

Hence, we would now apply the XGBoost(All) model to predict the top 1000 users likely to respond positively to the two-month trial and convert to a premium customer.

# remove net_user id and adopter information for the free users data set
highnote_id <- highnote %>% select(c("net_user", "adopter"))
highnote_factors <- highnote %>% select(c("delta1_good_country", "good_country", "male"))
highnote_numeric <- highnote %>% select(-c("net_user", "adopter", "male", "delta1_good_country", "good_country"))

# do one-hot encoding for the free users data
highnote_factors_ohe_1 <- data.frame(highnote_factors)
highnote_factors_ohe <- dummy.data.frame(data=highnote_factors_ohe_1, sep=" ")

# Normalize numeric data
highnote_numeric_norm <- as.data.frame(lapply(highnote_numeric, normalize))


#Combine numeric & factor variables in one dataframe
highnote_all <- cbind(highnote_id,highnote_factors_ohe, highnote_numeric_norm)

# create a new data set that contains only free users
highnote_free <- highnote_all %>% filter(adopter==0)
highnote_free_x <- highnote_free %>% select(-c("net_user", "adopter"))

# predict the data
predict_free_prob <- predict(XG_clf_fit, newdata = highnote_free_x, type="prob")
prob_of_adopting <- predict_free_prob[,2]

# put net user id back to the table
highnote_free_id <- highnote_free %>% select(c("net_user"))
adopters <- cbind(highnote_free_id, prob_of_adopting)

# select the top 1,000 mostly likely free users who have the highest probabilities of converting to premium users 
adopters_top1k <- head(adopters[order(-adopters$prob_of_adopting),], 1000)

# show these 1,000 users' net id and the corresponding probability of adopting
summary(adopters_top1k)
##    net_user         prob_of_adopting
##  Length:1000        Min.   :0.8800  
##  Class :character   1st Qu.:0.8991  
##  Mode  :character   Median :0.9164  
##                     Mean   :0.9218  
##                     3rd Qu.:0.9424  
##                     Max.   :0.9990
# export the data into a csv file
write.csv(x=adopters_top1k, file = "adopters_top1k.csv", row.names = TRUE)
successNum <- adopters_top1k %>% filter(prob_of_adopting>0.90) %>% tally()

successNum
##     n
## 1 734
  1. Extra Credit Question: How can we make sure that if we target the people we can compute the correct ROI from the proposed model? Describe your strategy of actually deploying the targeting strategy from the predictive model.

Answer: In the Highnote case, we are essentially providing a two-month free trial to free users with the highest likelihood of converting to a premium customer. As we are going to offer up the free trial to top 1000 users most likely to subscribe to premium services, we should use the lift to determine the correct ROI from the proposed model. For the XGBoost(All) model chosen here, the average probability of adoption for the top 1000 predicted customer is 92.1%, which means of the top 1000 customers, 921 are likely to adopt the premium service. Comparing this with the average conversion rate of free users to premium based on such campaigns historically gives us the lift. We can make the lift metric cost aware by multiplying the number of incremental customer gained due to the model with the profit from acquiring a new premium customer. Hence, this will give us the true ROI from the proposed model.