1.0 Introduction
A general challenge in Freemium communities is getting people to pay premium when they can also use the service for free (at the expense of seeing ads say or not getting some features). However, candidates are not as profitable as premium users because the fight for ad dollars from brands is intense. HighNote would like to use a data-driven approach to try to get more candidates to become premium users. They are willing to offer a two month free trial, but they do not know who to give this promotion to. They have hired you as a consultant to help them design a targeting strategy.
1.1 Objective
Compare different binary classification models, and predict the top 1000 most likely candidates to convert to premium.
2.1 Libraries & Packages
The following libraries are being used:
library("tidyverse")
library("skimr")
library("readxl") # used to read excel files
library("dplyr") # used for data munging
library("FNN") # used for knn regression (knn.reg function)
library("caret") # used for various predictive models
library("class") # for using confusion matrix function
library("rpart.plot") # used to plot decision tree
library("rpart") # used for Regression tree
library('NeuralNetTools') # used to plot Neural Networks
library("PRROC") # top plot ROC curve
library("ROCR") # top plot lift curve
2.2 Data Preparation for Modeling:
# load the data set
# read the CSV file into a data frame 'highnote'
highnote <- read_csv("HN_data_PostModule.csv",
col_types = "cnfnnnnnnnnnnnfnnnnnnnnnnff")
# look at all the variables by using the skim function
skim(highnote)
| Name | highnote |
| Number of rows | 107213 |
| Number of columns | 27 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 4 |
| numeric | 22 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| net_user | 0 | 1 | 1 | 19 | 0 | 107205 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| male | 0 | 1 | FALSE | 3 | 1: 42548, NA: 38950, 0: 25715 |
| good_country | 0 | 1 | FALSE | 3 | 0: 43025, NA: 39155, 1: 25033 |
| delta1_good_country | 0 | 1 | FALSE | 4 | 0: 67728, NA: 39393, 1: 57, -1: 35 |
| adopter | 0 | 1 | FALSE | 2 | 0: 100000, 1: 7213 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 50859 | 0.53 | 24.39 | 6.84 | 8 | 20.00 | 23.00 | 27.00 | 79.0 | ▇▇▁▁▁ |
| friend_cnt | 1 | 1.00 | 12.23 | 48.19 | 0 | 1.00 | 3.00 | 10.00 | 5089.0 | ▇▁▁▁▁ |
| avg_friend_age | 21626 | 0.80 | 24.61 | 5.73 | 8 | 20.99 | 23.33 | 27.00 | 79.0 | ▆▇▁▁▁ |
| avg_friend_male | 16517 | 0.85 | 0.63 | 0.35 | 0 | 0.40 | 0.67 | 1.00 | 1.0 | ▃▂▃▃▇ |
| friend_country_cnt | 1 | 1.00 | 2.79 | 4.98 | 0 | 1.00 | 1.00 | 3.00 | 136.0 | ▇▁▁▁▁ |
| subscriber_friend_cnt | 1 | 1.00 | 0.33 | 2.12 | 0 | 0.00 | 0.00 | 0.00 | 309.0 | ▇▁▁▁▁ |
| songsListened | 0 | 1.00 | 12863.89 | 25193.67 | 0 | 439.00 | 3432.00 | 14785.00 | 1000000.0 | ▇▁▁▁▁ |
| lovedTracks | 0 | 1.00 | 77.76 | 284.19 | 0 | 0.00 | 9.00 | 58.00 | 44005.0 | ▇▁▁▁▁ |
| posts | 0 | 1.00 | 3.77 | 93.96 | 0 | 0.00 | 0.00 | 0.00 | 15185.0 | ▇▁▁▁▁ |
| playlists | 0 | 1.00 | 0.53 | 6.14 | 0 | 0.00 | 0.00 | 1.00 | 1943.0 | ▇▁▁▁▁ |
| shouts | 1927 | 0.98 | 20.85 | 261.08 | 0 | 1.00 | 2.00 | 6.00 | 65872.0 | ▇▁▁▁▁ |
| tenure | 32 | 1.00 | 39.55 | 19.29 | 0 | 24.00 | 38.00 | 54.00 | 111.0 | ▅▇▆▂▁ |
| delta1_friend_cnt | 26 | 1.00 | 0.47 | 5.34 | -486 | 0.00 | 0.00 | 0.00 | 521.0 | ▁▁▇▁▁ |
| delta1_avg_friend_age | 22057 | 0.79 | 0.23 | 0.71 | -50 | 0.00 | 0.14 | 0.38 | 19.5 | ▁▁▁▇▁ |
| delta1_avg_friend_male | 16826 | 0.84 | 0.00 | 0.05 | -1 | 0.00 | 0.00 | 0.00 | 1.0 | ▁▁▇▁▁ |
| delta1_friend_country_cnt | 26 | 1.00 | 0.06 | 0.68 | -52 | 0.00 | 0.00 | 0.00 | 30.0 | ▁▁▁▇▁ |
| delta1_subscriber_friend_cnt | 26 | 1.00 | -0.01 | 0.48 | -23 | 0.00 | 0.00 | 0.00 | 43.0 | ▁▇▁▁▁ |
| delta1_songsListened | 1001 | 0.99 | 587.81 | 2135.08 | -93584 | 0.00 | 0.00 | 310.00 | 129096.0 | ▁▁▇▁▁ |
| delta1_lovedTracks | 27 | 1.00 | 3.26 | 24.32 | -1040 | 0.00 | 0.00 | 0.00 | 1338.0 | ▁▁▇▁▁ |
| delta1_posts | 27 | 1.00 | 0.15 | 13.53 | -6 | 0.00 | 0.00 | 0.00 | 3391.0 | ▇▁▁▁▁ |
| delta1_playlists | 27 | 1.00 | 0.00 | 0.38 | -8 | 0.00 | 0.00 | 0.00 | 116.0 | ▇▁▁▁▁ |
| delta1_shouts | 1995 | 0.98 | 0.78 | 36.46 | -953 | 0.00 | 0.00 | 0.00 | 10022.0 | ▇▁▁▁▁ |
#perform exploratory analysis of the categorical features
highnote %>% keep(is.factor) %>% summary()
## male good_country delta1_good_country adopter
## 0 :25715 1 :25033 0 :67728 0:100000
## NA:38950 NA:39155 NA:39393 1: 7213
## 1 :42548 0 :43025 1 : 57
## -1: 35
# replace missing values in the factor variables from "NA" to "UNK".
highnote <- highnote %>%
mutate(male = recode(male, "NA" = "UNK")) %>%
mutate(good_country = recode(as.factor(good_country), "NA" = "UNK")) %>%
mutate(delta1_good_country = recode(delta1_good_country, "NA" = "UNK"))
#perform exploratory analysis of the numeric variables
highnote %>% keep(is.numeric) %>% summary()
## age friend_cnt avg_friend_age avg_friend_male
## Min. : 8.00 Min. : 0.00 Min. : 8.00 Min. :0.000
## 1st Qu.:20.00 1st Qu.: 1.00 1st Qu.:20.99 1st Qu.:0.400
## Median :23.00 Median : 3.00 Median :23.33 Median :0.670
## Mean :24.39 Mean : 12.23 Mean :24.61 Mean :0.632
## 3rd Qu.:27.00 3rd Qu.: 10.00 3rd Qu.:27.00 3rd Qu.:1.000
## Max. :79.00 Max. :5089.00 Max. :79.00 Max. :1.000
## NA's :50859 NA's :1 NA's :21626 NA's :16517
## friend_country_cnt subscriber_friend_cnt songsListened lovedTracks
## Min. : 0.000 Min. : 0.0000 Min. : 0 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 0.0000 1st Qu.: 439 1st Qu.: 0.00
## Median : 1.000 Median : 0.0000 Median : 3432 Median : 9.00
## Mean : 2.788 Mean : 0.3348 Mean : 12864 Mean : 77.76
## 3rd Qu.: 3.000 3rd Qu.: 0.0000 3rd Qu.: 14785 3rd Qu.: 58.00
## Max. :136.000 Max. :309.0000 Max. :1000000 Max. :44005.00
## NA's :1 NA's :1
## posts playlists shouts tenure
## Min. : 0.000 Min. : 0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 1.00 1st Qu.: 24.00
## Median : 0.000 Median : 0.0000 Median : 2.00 Median : 38.00
## Mean : 3.773 Mean : 0.5304 Mean : 20.85 Mean : 39.55
## 3rd Qu.: 0.000 3rd Qu.: 1.0000 3rd Qu.: 6.00 3rd Qu.: 54.00
## Max. :15185.000 Max. :1943.0000 Max. :65872.00 Max. :111.00
## NA's :1927 NA's :32
## delta1_friend_cnt delta1_avg_friend_age delta1_avg_friend_male
## Min. :-486.0000 Min. :-50.000 Min. :-1
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0
## Median : 0.0000 Median : 0.143 Median : 0
## Mean : 0.4707 Mean : 0.229 Mean : 0
## 3rd Qu.: 0.0000 3rd Qu.: 0.375 3rd Qu.: 0
## Max. : 521.0000 Max. : 19.500 Max. : 1
## NA's :26 NA's :22057 NA's :16826
## delta1_friend_country_cnt delta1_subscriber_friend_cnt delta1_songsListened
## Min. :-52.00000 Min. :-23.0000 Min. :-93584.0
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.0
## Median : 0.00000 Median : 0.0000 Median : 0.0
## Mean : 0.06187 Mean : -0.0135 Mean : 587.8
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 3rd Qu.: 310.0
## Max. : 30.00000 Max. : 43.0000 Max. :129096.0
## NA's :26 NA's :26 NA's :1001
## delta1_lovedTracks delta1_posts delta1_playlists delta1_shouts
## Min. :-1040.000 Min. : -6.000 Min. : -8.0000 Min. : -953.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.0000 Median : 0.000
## Mean : 3.262 Mean : 0.146 Mean : 0.0042 Mean : 0.779
## 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. : 1338.000 Max. :3391.000 Max. :116.0000 Max. :10022.000
## NA's :27 NA's :27 NA's :27 NA's :1995
Completion rate for Age is fairly low at 53%, while the completion rates for avg_friend_age (80%), avg_friend_male (85%), delta1_avg_friend_age (79%) and delta1_avg_friend_male (84%) are also concerning.
#perform exploratory analysis of the numeric variables
highnote %>% keep(is.factor) %>% summary()
## male good_country delta1_good_country adopter
## 0 :25715 1 :25033 0 :67728 0:100000
## UNK:38950 UNK:39155 UNK:39393 1: 7213
## 1 :42548 0 :43025 1 : 57
## -1 : 35
# replace missing values in the numeric variables with mean or median by gender
highnote <- highnote %>%
group_by(male) %>%
mutate(age=ifelse(is.na(age), median(age, na.rm=TRUE),age)) %>%
mutate(friend_cnt=ifelse(is.na(friend_cnt), median(friend_cnt, na.rm=TRUE), friend_cnt)) %>%
mutate(avg_friend_age=ifelse(is.na(avg_friend_age),
mean(avg_friend_age, na.rm=TRUE),avg_friend_age)) %>%
mutate(avg_friend_male=ifelse(is.na(avg_friend_male),
mean(avg_friend_male, na.rm=TRUE), avg_friend_male)) %>%
mutate(friend_country_cnt=ifelse(is.na(friend_country_cnt),
median(friend_country_cnt, na.rm = TRUE),
friend_country_cnt)) %>%
mutate(subscriber_friend_cnt=ifelse(is.na(subscriber_friend_cnt),
median(subscriber_friend_cnt,na.rm = TRUE),
subscriber_friend_cnt)) %>%
mutate(shouts=ifelse(is.na(shouts),median(shouts,na.rm = TRUE),shouts)) %>%
mutate(tenure=ifelse(is.na(tenure),median(tenure,na.rm=TRUE),tenure)) %>%
# the delta1 (data recorded in the past period)
mutate(delta1_friend_cnt=ifelse(is.na(delta1_friend_cnt),
median(delta1_friend_cnt,na.rm=TRUE),
delta1_friend_cnt)) %>%
mutate(delta1_avg_friend_age=ifelse(is.na(delta1_avg_friend_age),
mean(delta1_avg_friend_age, na.rm=TRUE),
delta1_avg_friend_age)) %>%
mutate(delta1_avg_friend_male=ifelse(is.na(delta1_avg_friend_male),
mean(delta1_avg_friend_male, na.rm=TRUE),
delta1_avg_friend_male)) %>%
mutate(delta1_friend_country_cnt=ifelse(is.na(delta1_friend_country_cnt),
median(delta1_friend_country_cnt,na.rm=TRUE),
delta1_friend_country_cnt)) %>%
mutate(delta1_subscriber_friend_cnt=ifelse(is.na(delta1_subscriber_friend_cnt),
median(delta1_subscriber_friend_cnt,na.rm=TRUE),
delta1_subscriber_friend_cnt)) %>%
mutate(delta1_songsListened=ifelse(is.na(delta1_songsListened),
median(delta1_songsListened, na.rm=TRUE),
delta1_songsListened)) %>%
mutate(delta1_lovedTracks=ifelse(is.na(delta1_lovedTracks),
median(delta1_lovedTracks,na.rm = TRUE),
delta1_lovedTracks)) %>%
mutate(delta1_posts=ifelse(is.na(delta1_posts), median(delta1_posts,na.rm=TRUE),
delta1_posts)) %>%
mutate(delta1_playlists=ifelse(is.na(delta1_playlists), median(delta1_playlists,na.rm = TRUE),
delta1_playlists)) %>%
mutate(delta1_shouts=ifelse(is.na(delta1_shouts ),median(delta1_shouts,na.rm = TRUE),
delta1_shouts )) %>%
ungroup()
#double check if the NAs are replaced successfully
skim(highnote)
| Name | highnote |
| Number of rows | 107213 |
| Number of columns | 27 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 4 |
| numeric | 22 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| net_user | 0 | 1 | 1 | 19 | 0 | 107205 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| male | 0 | 1 | FALSE | 3 | 1: 42548, UNK: 38950, 0: 25715 |
| good_country | 0 | 1 | FALSE | 3 | 0: 43025, UNK: 39155, 1: 25033 |
| delta1_good_country | 0 | 1 | FALSE | 4 | 0: 67728, UNK: 39393, 1: 57, -1: 35 |
| adopter | 0 | 1 | FALSE | 2 | 0: 100000, 1: 7213 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 23.76 | 5.02 | 8 | 22.00 | 23.00 | 24.00 | 79.0 | ▃▇▁▁▁ |
| friend_cnt | 0 | 1 | 12.23 | 48.19 | 0 | 1.00 | 3.00 | 10.00 | 5089.0 | ▇▁▁▁▁ |
| avg_friend_age | 0 | 1 | 24.70 | 5.13 | 8 | 21.45 | 24.51 | 26.00 | 79.0 | ▃▇▁▁▁ |
| avg_friend_male | 0 | 1 | 0.63 | 0.33 | 0 | 0.50 | 0.66 | 1.00 | 1.0 | ▃▂▃▇▇ |
| friend_country_cnt | 0 | 1 | 2.79 | 4.98 | 0 | 1.00 | 1.00 | 3.00 | 136.0 | ▇▁▁▁▁ |
| subscriber_friend_cnt | 0 | 1 | 0.33 | 2.12 | 0 | 0.00 | 0.00 | 0.00 | 309.0 | ▇▁▁▁▁ |
| songsListened | 0 | 1 | 12863.89 | 25193.67 | 0 | 439.00 | 3432.00 | 14785.00 | 1000000.0 | ▇▁▁▁▁ |
| lovedTracks | 0 | 1 | 77.76 | 284.19 | 0 | 0.00 | 9.00 | 58.00 | 44005.0 | ▇▁▁▁▁ |
| posts | 0 | 1 | 3.77 | 93.96 | 0 | 0.00 | 0.00 | 0.00 | 15185.0 | ▇▁▁▁▁ |
| playlists | 0 | 1 | 0.53 | 6.14 | 0 | 0.00 | 0.00 | 1.00 | 1943.0 | ▇▁▁▁▁ |
| shouts | 0 | 1 | 20.53 | 258.74 | 0 | 1.00 | 2.00 | 6.00 | 65872.0 | ▇▁▁▁▁ |
| tenure | 0 | 1 | 39.55 | 19.28 | 0 | 24.00 | 38.00 | 54.00 | 111.0 | ▅▇▆▂▁ |
| delta1_friend_cnt | 0 | 1 | 0.47 | 5.34 | -486 | 0.00 | 0.00 | 0.00 | 521.0 | ▁▁▇▁▁ |
| delta1_avg_friend_age | 0 | 1 | 0.23 | 0.63 | -50 | 0.00 | 0.22 | 0.32 | 19.5 | ▁▁▁▇▁ |
| delta1_avg_friend_male | 0 | 1 | 0.00 | 0.05 | -1 | 0.00 | 0.00 | 0.00 | 1.0 | ▁▁▇▁▁ |
| delta1_friend_country_cnt | 0 | 1 | 0.06 | 0.68 | -52 | 0.00 | 0.00 | 0.00 | 30.0 | ▁▁▁▇▁ |
| delta1_subscriber_friend_cnt | 0 | 1 | -0.01 | 0.48 | -23 | 0.00 | 0.00 | 0.00 | 43.0 | ▁▇▁▁▁ |
| delta1_songsListened | 0 | 1 | 582.32 | 2125.84 | -93584 | 0.00 | 0.00 | 298.00 | 129096.0 | ▁▁▇▁▁ |
| delta1_lovedTracks | 0 | 1 | 3.26 | 24.32 | -1040 | 0.00 | 0.00 | 0.00 | 1338.0 | ▁▁▇▁▁ |
| delta1_posts | 0 | 1 | 0.15 | 13.53 | -6 | 0.00 | 0.00 | 0.00 | 3391.0 | ▇▁▁▁▁ |
| delta1_playlists | 0 | 1 | 0.00 | 0.38 | -8 | 0.00 | 0.00 | 0.00 | 116.0 | ▇▁▁▁▁ |
| delta1_shouts | 0 | 1 | 0.76 | 36.12 | -953 | 0.00 | 0.00 | 0.00 | 10022.0 | ▇▁▁▁▁ |
2.3 Split data into predictors(X) and outcome(Y)
# create X and Y data frame
highnote_y <- highnote %>% pull("adopter") %>% as.factor()
highnote_prenorm_numeric_x <- highnote %>% select(-c("net_user", "adopter", "male", "delta1_good_country", "good_country")) # x set is the predictors
highnote_prenorm_factor_x <- highnote %>% select(c("male", "delta1_good_country", "good_country"))
2.4 Normalization Create a normalization function that normalizes columns to ensure that each column is at same scale.
# function to normalize data (0 to 1)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
# Normalize x variables since they are at different scale
highnote_x_normalized_1 <- as.data.frame(lapply(highnote_prenorm_numeric_x, normalize))
highnote_x_normalized <- cbind(highnote_x_normalized_1,highnote_prenorm_factor_x)
2.5 Split the data into training and test
We split up the data set into test and training data in the ratio of 25% : 75%.
set.seed(1234)
# Split the data into training and test (75%/25%)
smp_size <- floor(0.75 * nrow(highnote_x_normalized))
# Randomly select row numbers for training data set
train_ind <- sample(seq_len(nrow(highnote_x_normalized)), size = smp_size)
# Create training and test sets for x
highnote_x_train <- highnote_x_normalized[train_ind, ]
highnote_x_test <- highnote_x_normalized[-train_ind, ]
# Create training and test sets for y
highnote_y_train <- highnote_y[train_ind]
highnote_y_test <- highnote_y[-train_ind]
# Create an empty data frame to store results from different models
clf_results <- data.frame(matrix(ncol = 5, nrow = 0))
names(clf_results) <- c("Model", "Accuracy", "Precision", "Recall", "F1")
# Create an empty data frame to store TP, TN, FP and FN values
cost_benefit_df <- data.frame(matrix(ncol = 5, nrow = 0))
names(cost_benefit_df) <- c("Model", "TP", "FN", "FP", "TN")
2.6 SMOTE to deal with class imbalance
# Check the proportions for the class between all 3 datasets.
round(prop.table(table(highnote$adopter)), 4) * 100
##
## 0 1
## 93.27 6.73
round(prop.table(table(highnote_y_train)), 4) * 100
## highnote_y_train
## 0 1
## 93.23 6.77
round(prop.table(table(highnote_y_test)), 4) * 100
## highnote_y_test
## 0 1
## 93.41 6.59
The proportion of “0” and “1” is highly imbalanced in each of the three above data set. Over 93% of the adopter results represents “0” (non-converted free users). If we end up using these data sets, the models will typically over-classify the majority class. Hence, we are balancing class distribution by randomly increasing minority class examples by replicating them using SMOTE (Synthetic Minority Oversampling Technique).
# install.packages("abind")
library("abind")
install.packages( "https://cran.r-project.org/src/contrib/Archive/DMwR/DMwR_0.4.1.tar.gz", repos=NULL, type="source" )
library("DMwR")
set.seed(1234)
# create the full training dataset with X and Y variable
highnote_train <- cbind(highnote_x_train, highnote_y_train)
# parameters for SMOTE are 1) prediction model, 2) data, 3) percent to oversample, 4) percent to undersample)
highnote_train_balanced <- SMOTE(highnote_y_train ~ ., data.frame(highnote_train), perc.over = 100, perc.under = 200)
# Check the proportions for the class between balanced data sets.
round(prop.table(table(highnote_train_balanced$highnote_y_train)), 4) * 100
##
## 0 1
## 50 50
After we apply SMOTE, the data set is now balanced, having equal proportions for “0” and “1”
# remove the Y column from the newly balanced training set
highnote_x_train <- highnote_train_balanced %>% select(-highnote_y_train)
# store the Y column
highnote_y_train <- highnote_train_balanced %>% pull(highnote_y_train) %>% as.factor()
2.7 Transforming dataframe with One Hot Encoding. The approach is that we split the factor variables from the numeric variables, so that they can be one hot encoded.
#Split the train predictor variables into factors & numeric prior to doing one hot encoding transformation on the factor variables
highnote_x_train_factors <- highnote_x_train %>% select(c("delta1_good_country","good_country","male"))
highnote_x_train_numeric <- highnote_x_train %>% select (-c("delta1_good_country","good_country","male"))
#Perform the same splitting on test dataset
highnote_x_test_factors <- highnote_x_test %>% select(c("delta1_good_country","good_country","male"))
highnote_x_test_numeric <- highnote_x_test %>% select (-c("delta1_good_country","good_country","male"))
#Performing One Hot Encoding on both Datasets
library(dummies)
highnote_x_train_factors_ohe <- dummy.data.frame(data=highnote_x_train_factors, sep=" ")
highnote_x_test_factors_ohe <- dummy.data.frame(data=highnote_x_test_factors, sep=" ")
#Combining the factors and numeric variables together using cbind()
highnote_x_train_ohe <- cbind(highnote_x_train_factors_ohe,highnote_x_train_numeric)
highnote_x_test_ohe <- cbind(highnote_x_test_factors_ohe,highnote_x_test_numeric)
3.1 Logistic Regression
3.1.1 Logistic Regression for All Data
glm_fit <- train(highnote_x_train_ohe,
highnote_y_train,
method = "glm",
family = "binomial",
preProc = c("center", "scale"))
# predict on test data
glm_predict <- predict(glm_fit, newdata=highnote_x_test_ohe)
# predict on test data prob
glm_predict_prob <- predict(glm_fit, newdata = highnote_x_test_ohe, type="prob")
Based on a threshold, converting probability into categorical outcome
y_pred_num <- ifelse(glm_predict_prob[,2] > 0.6,
"1", "0")
#As the dataset is imbalanced it is important to focus on class specific metrics like recall, precision and f-measure instead of accuracy. While accuracy continues to increase as we increase threshold as the model is conservative and classifies "0" as the class in most cases, but f-measure drops. Hence choosing a threshold which has best f-measure.
# Print Confusion matrix, Accuracy, Sensitivity etc
confusionMatrix(as.factor(y_pred_num), as.factor(highnote_y_test), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22360 1047
## 1 2677 720
##
## Accuracy : 0.8611
## 95% CI : (0.8569, 0.8652)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2104
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.40747
## Specificity : 0.89308
## Pos Pred Value : 0.21195
## Neg Pred Value : 0.95527
## Prevalence : 0.06592
## Detection Rate : 0.02686
## Detection Prevalence : 0.12673
## Balanced Accuracy : 0.65027
##
## 'Positive' Class : 1
##
# Add results into clf_results dataframe
x1 <- confusionMatrix(as.factor(y_pred_num), as.factor(highnote_y_test), positive = "1")[["overall"]]
y1 <- confusionMatrix(as.factor(y_pred_num), as.factor(highnote_y_test), positive = "1")[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Logistic Regression (All Time)",
Accuracy = round (x1[["Accuracy"]],3),
Precision = round (y1[["Precision"]],3),
Recall = round (y1[["Recall"]],3),
F1 = round (y1[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x1[["Accuracy"]],3), "and F1 is ", round (y1[["F1"]],3) )
## Accuracy is 0.861 and F1 is 0.279
# Add results into cost_benefit_df dataframe for cost benefit analysis
a1 <- confusionMatrix(as.factor(y_pred_num), as.factor(highnote_y_test))
# pickign up the TP, FN, FP and TN
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Logistic Regression (All Time)",
TP = a1[["table"]][4],
FN = a1[["table"]][3],
FP = a1[["table"]][2],
TN = a1[["table"]][1])
3.1.2 Logistic Regression for Only past 3 months change data
# remove present variables in the training and test data sets
highnote_x_train_delta1 <- highnote_x_train_ohe %>%
select(-c("friend_cnt","avg_friend_age","avg_friend_male","friend_country_cnt","subscriber_friend_cnt","songsListened","lovedTracks","posts","playlists","shouts"))
highnote_x_test_delta1 <- highnote_x_test_ohe %>%
select(-c("friend_cnt","avg_friend_age","avg_friend_male","friend_country_cnt","subscriber_friend_cnt","songsListened","lovedTracks","posts","playlists","shouts"))
glm_fit_delta1 <- train(highnote_x_train_delta1,
highnote_y_train,
method = "glm",
family = "binomial",
preProc = c("center", "scale"))
# predict on test data
glm_predict_delta1 <- predict(glm_fit_delta1, newdata=highnote_x_test_delta1)
# predict on test data prob
glm_predict_prob_delta1 <- predict(glm_fit_delta1, newdata = highnote_x_test_delta1, type="prob")
# convert probability outcome into categorical ones.
y_pred_num_delta1 <- ifelse(glm_predict_prob_delta1[,2] > 0.6,
"1", "0")
# Print Confusion matrix, Accuarcy, Sensitivity etc
confusionMatrix(as.factor(y_pred_num_delta1), as.factor(highnote_y_test), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22622 1214
## 1 2415 553
##
## Accuracy : 0.8646
## 95% CI : (0.8605, 0.8687)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1645
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.31296
## Specificity : 0.90354
## Pos Pred Value : 0.18632
## Neg Pred Value : 0.94907
## Prevalence : 0.06592
## Detection Rate : 0.02063
## Detection Prevalence : 0.11073
## Balanced Accuracy : 0.60825
##
## 'Positive' Class : 1
##
# Add results into clf_results dataframe
x1_d <- confusionMatrix(as.factor(y_pred_num_delta1), as.factor(highnote_y_test), positive = "1")[["overall"]]
y1_d <- confusionMatrix(as.factor(y_pred_num_delta1), as.factor(highnote_y_test), positive = "1")[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Logistic Regression (Past)",
Accuracy = round (x1_d[["Accuracy"]],3),
Precision = round (y1_d[["Precision"]],3),
Recall = round (y1_d[["Recall"]],3),
F1 = round (y1_d[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x1_d[["Accuracy"]],3), "and F1 is ", round (y1_d[["F1"]],3) )
## Accuracy is 0.865 and F1 is 0.234
# Add results into cost_benefit_df dataframe for cost benefit analysis
a1_d <- confusionMatrix(as.factor(y_pred_num_delta1), as.factor(highnote_y_test))
# pickign up the TP, FN, FP and TN
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Logistic Regression (Past)",
TP = a1_d[["table"]][4],
FN = a1_d[["table"]][3],
FP = a1_d[["table"]][2],
TN = a1_d[["table"]][1])
3.2 Decision Tree Classification 3.2.1 Decision Tree for All Data
# Cross validation
cross_validation <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 3)
# Hyperparamter tuning
# maxdepth = the maximum depth of the tree that will be created or
# the length of the longest path from the tree root to a leaf.
Param_Grid <- expand.grid(maxdepth = 2:10)
dtree_fit <- train(highnote_x_train_ohe,
highnote_y_train,
method = "rpart2",
# split - criteria to split nodes
parms = list(split = "gini"),
tuneGrid = Param_Grid,
trControl = cross_validation,
# preProc - perform listed pre-processing to predictor dataframe
preProc = c("center", "scale"))
# check the accuracy for different models
dtree_fit
## CART
##
## 21784 samples
## 32 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (32), scaled (32)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 19606, 19605, 19605, 19606, 19606, 19605, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 2 0.7440474 0.4880943
## 3 0.7442157 0.4884310
## 4 0.7447054 0.4894108
## 5 0.7447054 0.4894108
## 6 0.7447054 0.4894108
## 7 0.7447054 0.4894108
## 8 0.7447054 0.4894108
## 9 0.7447054 0.4894108
## 10 0.7447054 0.4894108
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 4.
# print the final model
dtree_fit$finalModel
## n= 21784
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 21784 10892 0 (0.5000000 0.5000000)
## 2) delta1_songsListened< -0.3632505 8896 1974 0 (0.7781025 0.2218975) *
## 3) delta1_songsListened>=-0.3632505 12888 3970 1 (0.3080385 0.6919615)
## 6) lovedTracks< -0.2982294 1622 614 0 (0.6214550 0.3785450) *
## 7) lovedTracks>=-0.2982294 11266 2962 1 (0.2629150 0.7370850) *
# Plot decision tree
prp(dtree_fit$finalModel, box.palette = "Blue", tweak = 1.2)
# Predict on test data
dtree_predict <- predict(dtree_fit, newdata = highnote_x_test_ohe)
# predict probabilities
dtree_predict_prob <- predict(dtree_fit, newdata = highnote_x_test_ohe, type="prob")
# Print Confusion matrix, Accuracy, Sensitivity etc
confusionMatrix(dtree_predict, highnote_y_test, positive ="1" )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18156 585
## 1 6881 1182
##
## Accuracy : 0.7215
## 95% CI : (0.7161, 0.7268)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1484
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.66893
## Specificity : 0.72517
## Pos Pred Value : 0.14660
## Neg Pred Value : 0.96879
## Prevalence : 0.06592
## Detection Rate : 0.04410
## Detection Prevalence : 0.30081
## Balanced Accuracy : 0.69705
##
## 'Positive' Class : 1
##
# Add results into clf_results dataframe
x2 <- confusionMatrix(dtree_predict, highnote_y_test, positive ="1" )[["overall"]]
y2 <- confusionMatrix(dtree_predict, highnote_y_test, positive ="1" )[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Decision Tree (All Time)",
Accuracy = round (x2[["Accuracy"]],3),
Precision = round (y2[["Precision"]],3),
Recall = round (y2[["Recall"]],3),
F1 = round (y2[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x2[["Accuracy"]],3), "and F1 is ", round (y2[["F1"]],3) )
## Accuracy is 0.721 and F1 is 0.24
# Add results into cost_benefit_df dataframe for cost benefit analysis
a2 <- confusionMatrix(dtree_predict, highnote_y_test, positive ="1" )
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Decision Tree (All Time)",
TP = a2[["table"]][4],
FN = a2[["table"]][3],
FP = a2[["table"]][2],
TN = a2[["table"]][1])
3.2.2 Decision Tree for Only Past 3 months change data
# Cross validation
cross_validation <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 3)
# Hyperparamter tuning
# maxdepth = the maximum depth of the tree that will be created or
# the length of the longest path from the tree root to a leaf.
Param_Grid <- expand.grid(maxdepth = 2:10)
dtree_fit_delta1 <- train(highnote_x_train_delta1,
highnote_y_train,
method = "rpart2",
# split - criteria to split nodes
parms = list(split = "gini"),
tuneGrid = Param_Grid,
trControl = cross_validation,
# preProc - perform listed pre-processing to predictor dataframe
preProc = c("center", "scale"))
# check the accuracy for different models
dtree_fit_delta1
## CART
##
## 21784 samples
## 22 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 19605, 19606, 19606, 19606, 19606, 19606, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 2 0.7414611 0.4829227
## 3 0.7414611 0.4829227
## 4 0.7514836 0.5029677
## 5 0.7514836 0.5029677
## 6 0.7514836 0.5029677
## 7 0.7514836 0.5029677
## 8 0.7514836 0.5029677
## 9 0.7514836 0.5029677
## 10 0.7514836 0.5029677
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 4.
# print the final model
dtree_fit_delta1$finalModel
## n= 21784
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 21784 10892 0 (0.5000000 0.5000000)
## 2) delta1_songsListened< -0.3632505 8896 1974 0 (0.7781025 0.2218975)
## 4) age< -0.2341147 5787 937 0 (0.8380854 0.1619146) *
## 5) age>=-0.2341147 3109 1037 0 (0.6664522 0.3335478)
## 10) age>=-0.03440615 2877 805 0 (0.7201946 0.2798054) *
## 11) age< -0.03440615 232 0 1 (0.0000000 1.0000000) *
## 3) delta1_songsListened>=-0.3632505 12888 3970 1 (0.3080385 0.6919615)
## 6) delta1_lovedTracks< -0.202196 5489 2419 1 (0.4406996 0.5593004)
## 12) age< -0.2352699 2864 1276 0 (0.5544693 0.4455307) *
## 13) age>=-0.2352699 2625 831 1 (0.3165714 0.6834286) *
## 7) delta1_lovedTracks>=-0.202196 7399 1551 1 (0.2096229 0.7903771) *
# Plot decision tree
prp(dtree_fit_delta1$finalModel, box.palette = "Blue", tweak = 1.2)
# Predict on test data
dtree_predict_delta1 <- predict(dtree_fit_delta1, newdata = highnote_x_test_delta1)
# predict probabilities
dtree_predict_prob_delta1 <- predict(dtree_fit_delta1, newdata = highnote_x_test_delta1, type="prob")
# Print Confusion matrix, Accuarcy, Sensitivity etc
confusionMatrix(dtree_predict_delta1, highnote_y_test, positive ="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 19446 696
## 1 5591 1071
##
## Accuracy : 0.7654
## 95% CI : (0.7603, 0.7705)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1674
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.60611
## Specificity : 0.77669
## Pos Pred Value : 0.16076
## Neg Pred Value : 0.96545
## Prevalence : 0.06592
## Detection Rate : 0.03996
## Detection Prevalence : 0.24854
## Balanced Accuracy : 0.69140
##
## 'Positive' Class : 1
##
# Add results into clf_results dataframe
x2_d <- confusionMatrix(dtree_predict_delta1, highnote_y_test, positive ="1" )[["overall"]]
y2_d <- confusionMatrix(dtree_predict_delta1, highnote_y_test, positive ="1" )[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Decision Tree (Past)",
Accuracy = round (x2_d[["Accuracy"]],3),
Precision = round (y2_d[["Precision"]],3),
Recall = round (y2_d[["Recall"]],3),
F1 = round (y2_d[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x2_d[["Accuracy"]],3), "and F1 is ", round (y2_d[["F1"]],3) )
## Accuracy is 0.765 and F1 is 0.254
The performance of the Decision tree model based on past data is actually better than the Decision tree model based on All data, with a higher F1.
# Add results into cost_benefit_df dataframe for cost benefit analysis
a2_d <- confusionMatrix(dtree_predict_delta1, highnote_y_test, positive ="1" )
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Decision Tree (Past)",
TP = a2_d[["table"]][4],
FN = a2_d[["table"]][3],
FP = a2_d[["table"]][2],
TN = a2_d[["table"]][1])
3.3 K-Nearest Neighbour Classification 3.3.1 KNN for All Data
# Cross validation
cross_validation <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 3)
# Hyper-paramter tuning
# k = number of nearest neighbours
Param_Grid <- expand.grid(k = 1:10)
# fit the model to training data
knn_clf_fit <- train(highnote_x_train_ohe,
highnote_y_train,
method = "knn",
tuneGrid = Param_Grid,
trControl = cross_validation )
# check the accuracy for different models
knn_clf_fit
## k-Nearest Neighbors
##
## 21784 samples
## 32 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 19605, 19605, 19606, 19606, 19605, 19606, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.7087620 0.4175238
## 2 0.6836824 0.3673645
## 3 0.7032685 0.4065364
## 4 0.6935520 0.3871033
## 5 0.6971630 0.3943250
## 6 0.6936279 0.3872552
## 7 0.6984481 0.3968961
## 8 0.6950210 0.3900417
## 9 0.6988768 0.3977533
## 10 0.6974843 0.3949688
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
# Plot accuracies for different k values
plot(knn_clf_fit)
# print the best model
print(knn_clf_fit$finalModel)
## 1-nearest neighbor model
## Training set outcome distribution:
##
## 0 1
## 10892 10892
# Predict on test data
knnPredict <- predict(knn_clf_fit, newdata = highnote_x_test_ohe)
knnProb <- predict(knn_clf_fit, newdata=highnote_x_test_ohe, type="prob")
# Print Confusion matrix, Accuracy, Sensitivity etc
confusionMatrix(knnPredict, highnote_y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16980 957
## 1 8057 810
##
## Accuracy : 0.6637
## 95% CI : (0.658, 0.6694)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0476
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.67820
## Specificity : 0.45840
## Pos Pred Value : 0.94665
## Neg Pred Value : 0.09135
## Prevalence : 0.93408
## Detection Rate : 0.63349
## Detection Prevalence : 0.66919
## Balanced Accuracy : 0.56830
##
## 'Positive' Class : 0
##
# Add results into clf_results dataframe
x3 <- confusionMatrix(knnPredict, highnote_y_test, positive ="1" )[["overall"]]
y3 <- confusionMatrix(knnPredict, highnote_y_test, positive ="1" )[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "KNN (All Time)",
Accuracy = round (x3[["Accuracy"]],3),
Precision = round (y3[["Precision"]],3),
Recall = round (y3[["Recall"]],3),
F1 = round (y3[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x3[["Accuracy"]],3), "and F1 is ", round (y3[["F1"]],3) )
## Accuracy is 0.664 and F1 is 0.152
# Add results into cost_benefit_df dataframe for cost benefit analysis
a3 <- confusionMatrix(knnPredict, highnote_y_test, positive ="1")
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "KNN (All Time)",
TP = a3[["table"]][4],
FN = a3[["table"]][3],
FP = a3[["table"]][2],
TN = a3[["table"]][1])
3.3.2 KNN for Change in the past 3 months Data
# Cross validation
cross_validation <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 3)
# Hyper-paramter tuning
# k = number of nearest neighbours
Param_Grid <- expand.grid(k = 1:10)
# fit the model to training data
knn_clf_fit_delta1 <- train(highnote_x_train_delta1,
highnote_y_train,
method = "knn",
tuneGrid = Param_Grid,
trControl = cross_validation )
# check the accuracy for different models
knn_clf_fit_delta1
## k-Nearest Neighbors
##
## 21784 samples
## 22 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 19605, 19605, 19606, 19606, 19606, 19606, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.7127851 0.4255693
## 2 0.6916073 0.3832146
## 3 0.7037121 0.4074238
## 4 0.6964892 0.3929780
## 5 0.7011101 0.4022197
## 6 0.6976060 0.3952113
## 7 0.6991977 0.3983951
## 8 0.6957242 0.3914481
## 9 0.6977132 0.3954260
## 10 0.6938576 0.3877149
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
# Plot accuracies for different k values
plot(knn_clf_fit_delta1)
# print the best model
print(knn_clf_fit_delta1$finalModel)
## 1-nearest neighbor model
## Training set outcome distribution:
##
## 0 1
## 10892 10892
# Predict on test data
knnPredict_delta1 <- predict(knn_clf_fit_delta1, newdata = highnote_x_test_delta1)
# Predict probabilities
knnProb_delta1 <- predict(knn_clf_fit_delta1, newdata=highnote_x_test_delta1, type="prob")
# Print Confusion matrix, Accuracy, Sensitivity etc
confusionMatrix(knnPredict_delta1, highnote_y_test, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18240 965
## 1 6797 802
##
## Accuracy : 0.7104
## 95% CI : (0.7049, 0.7158)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.072
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.45388
## Specificity : 0.72852
## Pos Pred Value : 0.10554
## Neg Pred Value : 0.94975
## Prevalence : 0.06592
## Detection Rate : 0.02992
## Detection Prevalence : 0.28350
## Balanced Accuracy : 0.59120
##
## 'Positive' Class : 1
##
# Add results into clf_results dataframe
x3_d <- confusionMatrix(knnPredict_delta1, highnote_y_test, positive ="1")[["overall"]]
y3_d <- confusionMatrix(knnPredict_delta1, highnote_y_test, positive ="1")[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "KNN (Past)",
Accuracy = round (x3_d[["Accuracy"]],3),
Precision = round (y3_d[["Precision"]],3),
Recall = round (y3_d[["Recall"]],3),
F1 = round (y3_d[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x3_d[["Accuracy"]],3), "and F1 is ", round (y3_d[["F1"]],3) )
## Accuracy is 0.71 and F1 is 0.171
# Add results into cost_benefit_df dataframe for cost benefit analysis
a3_d <- confusionMatrix(knnPredict_delta1, highnote_y_test, positive ="1")
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "KNN (Past)",
TP = a3_d[["table"]][4],
FN = a3_d[["table"]][3],
FP = a3_d[["table"]][2],
TN = a3_d[["table"]][1])
For both Knn and Dtree, the probability calculations are fairly coarse, hence while we tested for multiple threshold from 0.4 to 0.8 for all models, the accuracy wasn’t dramatically different or the f-measure wasn’t significantly better. In certain cases the confusion matrix didn’t change at all for different threshold levels.
3.4 XG Boost Classification 3.4.1 XG Boost for All Data
#Training the Model
XG_clf_fit <- train(highnote_x_train_ohe,
highnote_y_train,
verbosity = 0,
method = "xgbTree",
preProc = c("center", "scale"))
# print the final model
XG_clf_fit$finalModel
## ##### xgb.Booster
## raw: 170.1 Kb
## call:
## xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight, subsample = param$subsample),
## data = x, nrounds = param$nrounds, objective = "binary:logistic",
## verbosity = 0)
## params (as set within xgb.train):
## eta = "0.3", max_depth = "3", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "0.75", objective = "binary:logistic", verbosity = "0", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 32
## niter: 150
## nfeatures : 32
## xNames : delta1_good_country 0 delta1_good_country UNK delta1_good_country 1 delta1_good_country -1 good_country 1 good_country UNK good_country 0 male 0 male UNK male 1 age friend_cnt avg_friend_age avg_friend_male friend_country_cnt subscriber_friend_cnt songsListened lovedTracks posts playlists shouts tenure delta1_friend_cnt delta1_avg_friend_age delta1_avg_friend_male delta1_friend_country_cnt delta1_subscriber_friend_cnt delta1_songsListened delta1_lovedTracks delta1_posts delta1_playlists delta1_shouts
## problemType : Classification
## tuneValue :
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 42 150 3 0.3 0 0.6 1 0.75
## obsLevels : 0 1
## param :
## $verbosity
## [1] 0
# Predict on test data
XG_clf_predict <- predict(XG_clf_fit,newdata=highnote_x_test_ohe)
# predict probabilities
XG_clf_predict_prob <- predict(XG_clf_fit, newdata=highnote_x_test_ohe, type="prob")
# convert probability outcome into categorical ones.
y_pred_num_xg <- ifelse(XG_clf_predict_prob[,2] > 0.6, "1", "0")
#Highest value of f-measure across all threshold is at 0.6, and even accuracy is the highest
# Print Confusion matrix, Accuracy, Sensitivity etc
confusionMatrix(as.factor(y_pred_num_xg), as.factor(highnote_y_test), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22467 932
## 1 2570 835
##
## Accuracy : 0.8693
## 95% CI : (0.8653, 0.8734)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2585
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.47255
## Specificity : 0.89735
## Pos Pred Value : 0.24523
## Neg Pred Value : 0.96017
## Prevalence : 0.06592
## Detection Rate : 0.03115
## Detection Prevalence : 0.12703
## Balanced Accuracy : 0.68495
##
## 'Positive' Class : 1
##
# Add results into clf_results dataframe
x4 <- confusionMatrix(as.factor(y_pred_num_xg), as.factor(highnote_y_test), positive = "1" )[["overall"]]
y4 <- confusionMatrix(as.factor(y_pred_num_xg), as.factor(highnote_y_test), positive = "1" )[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "XG Boost (All Time)",
Accuracy = round (x4[["Accuracy"]],3),
Precision = round (y4[["Precision"]],3),
Recall = round (y4[["Recall"]],3),
F1 = round (y4[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x4[["Accuracy"]],3), "and F1 is ", round (y4[["F1"]],3) )
## Accuracy is 0.869 and F1 is 0.323
# Add results into cost_benefit_df dataframe for cost benefit analysis
a4 <- confusionMatrix(XG_clf_predict, highnote_y_test, positive="1" )
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "XG Boost (All Time)",
TP = a4[["table"]][4],
FN = a4[["table"]][3],
FP = a4[["table"]][2],
TN = a4[["table"]][1])
3.4.2 XG Boost for the change/delta in data for the past 3 months:
# Train the model
XG_clf_fit_delta1 <- train(highnote_x_train_delta1,
highnote_y_train,
verbosity = 0,
method = "xgbTree",
preProc = c("center", "scale"))
# print the final model
XG_clf_fit_delta1$finalModel
## ##### xgb.Booster
## raw: 167.6 Kb
## call:
## xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight, subsample = param$subsample),
## data = x, nrounds = param$nrounds, objective = "binary:logistic",
## verbosity = 0)
## params (as set within xgb.train):
## eta = "0.3", max_depth = "3", gamma = "0", colsample_bytree = "0.8", min_child_weight = "1", subsample = "1", objective = "binary:logistic", verbosity = "0", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 22
## niter: 150
## nfeatures : 22
## xNames : delta1_good_country 0 delta1_good_country UNK delta1_good_country 1 delta1_good_country -1 good_country 1 good_country UNK good_country 0 male 0 male UNK male 1 age tenure delta1_friend_cnt delta1_avg_friend_age delta1_avg_friend_male delta1_friend_country_cnt delta1_subscriber_friend_cnt delta1_songsListened delta1_lovedTracks delta1_posts delta1_playlists delta1_shouts
## problemType : Classification
## tuneValue :
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 54 150 3 0.3 0 0.8 1 1
## obsLevels : 0 1
## param :
## $verbosity
## [1] 0
# Predict on test data
XG_clf_predict_delta1 <- predict(XG_clf_fit_delta1,newdata=highnote_x_test_delta1)
# predict probabilities
XG_clf_predict_delta1_prob <- predict(XG_clf_fit_delta1, newdata=highnote_x_test_delta1, type="prob")
# Print Confusion matrix, Accuracy, Sensitivity etc
confusionMatrix(XG_clf_predict_delta1, highnote_y_test, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21330 835
## 1 3707 932
##
## Accuracy : 0.8305
## 95% CI : (0.826, 0.835)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2161
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.52745
## Specificity : 0.85194
## Pos Pred Value : 0.20091
## Neg Pred Value : 0.96233
## Prevalence : 0.06592
## Detection Rate : 0.03477
## Detection Prevalence : 0.17307
## Balanced Accuracy : 0.68969
##
## 'Positive' Class : 1
##
x4_d <- confusionMatrix(XG_clf_predict_delta1, highnote_y_test, positive="1" )[["overall"]]
y4_d <- confusionMatrix(XG_clf_predict_delta1, highnote_y_test, positive="1" )[["byClass"]]
#a default 0.5 threshold has the best f-measure albeit at a lesser accuracy than a 0.6 threshold. Going for f-measure as the defining metric to pick the right model
clf_results[nrow(clf_results) + 1,] <- list(Model = "XG Boost (Past)",
Accuracy = round (x4_d[["Accuracy"]],3),
Precision = round (y4_d[["Precision"]],3),
Recall = round (y4_d[["Recall"]],3),
F1 = round (y4_d[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x4_d[["Accuracy"]],3), "and F1 is ", round (y4_d[["F1"]],3) )
## Accuracy is 0.831 and F1 is 0.291
# Add results into cost_benefit_df dataframe for cost benefit analysis
a4_d <- confusionMatrix(XG_clf_predict_delta1, highnote_y_test, positive="1" )
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "XG Boost (Past)",
TP = a4_d[["table"]][4],
FN = a4_d[["table"]][3],
FP = a4_d[["table"]][2],
TN = a4_d[["table"]][1])
3.5 Neural Network Classification 3.5.1 NN for All Data
my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 7))
nn_clf_fit <- train(highnote_x_train_ohe,
highnote_y_train,
method = "nnet",
threshold=0.01,
stepmax = 100,
trace = F, linout = 0,
tuneGrid = my.grid)
print(nn_clf_fit)
## Neural Network
##
## 21784 samples
## 32 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 21784, 21784, 21784, 21784, 21784, 21784, ...
## Resampling results across tuning parameters:
##
## decay size Accuracy Kappa
## 0.1 5 0.7388745 0.4777467
## 0.1 7 0.7419210 0.4838560
## 0.5 5 0.7292708 0.4585945
## 0.5 7 0.7294157 0.4588638
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
# Plotting neural network
plotnet(nn_clf_fit$finalModel, y_names = "adopter")
# prediction on test data
nn_predict <- predict(nn_clf_fit, newdata = highnote_x_test_ohe)
# predict probabilities
nn_predict_prob <- predict(nn_clf_fit, newdata=highnote_x_test_ohe, type="prob")
# convert probability outcome into categorical ones.
y_pred_num_nn <- ifelse(nn_predict_prob[,2] > 0.6, "1", "0")
# Print Confusion matrix, Accuracy, Sensitivity etc
confusionMatrix(as.factor(y_pred_num_nn), as.factor(highnote_y_test), positive= "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22029 1118
## 1 3008 649
##
## Accuracy : 0.8461
## 95% CI : (0.8417, 0.8504)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1651
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.36729
## Specificity : 0.87986
## Pos Pred Value : 0.17747
## Neg Pred Value : 0.95170
## Prevalence : 0.06592
## Detection Rate : 0.02421
## Detection Prevalence : 0.13643
## Balanced Accuracy : 0.62357
##
## 'Positive' Class : 1
##
# Add results into clf_results data frame
x5 <- confusionMatrix(as.factor(y_pred_num_nn), as.factor(highnote_y_test), positive= "1")[["overall"]]
y5 <- confusionMatrix(as.factor(y_pred_num_nn), as.factor(highnote_y_test), positive ="1")[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Neural Network (All Time)",
Accuracy = round (x5[["Accuracy"]],3),
Precision = round (y5[["Precision"]],3),
Recall = round (y5[["Recall"]],3),
F1 = round (y5[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x5[["Accuracy"]],3), "and F1 is ", round (y5[["F1"]],3) )
## Accuracy is 0.846 and F1 is 0.239
# Add results into cost_benefit_df dataframe for cost benefit analysis
a5 <- confusionMatrix(nn_predict, highnote_y_test, positive ="1")
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Neural Network (All Time)",
TP = a5[["table"]][4],
FN = a5[["table"]][3],
FP = a5[["table"]][2],
TN = a5[["table"]][1])
3.5.2 NN for Data capturing Change/Delta in past 3 months
my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 7))
nn_clf_fit_delta1 <- train(highnote_x_train_delta1,
highnote_y_train,
method = "nnet",
threshold=0.01,
stepmax = 100,
trace = F, linout = 0,
tuneGrid = my.grid)
print(nn_clf_fit_delta1)
## Neural Network
##
## 21784 samples
## 22 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 21784, 21784, 21784, 21784, 21784, 21784, ...
## Resampling results across tuning parameters:
##
## decay size Accuracy Kappa
## 0.1 5 0.7061388 0.4121987
## 0.1 7 0.7095989 0.4191031
## 0.5 5 0.7004434 0.4007779
## 0.5 7 0.7006438 0.4011657
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
# Plotting neural network
plotnet(nn_clf_fit_delta1$finalModel, y_names = "adopter")
# prediction on test data
nn_predict_delta1 <- predict(nn_clf_fit_delta1, newdata = highnote_x_test_delta1)
# predict probabilities
nn_predict_prob_delta1 <- predict(nn_clf_fit_delta1, newdata=highnote_x_test_delta1, type="prob")
# Print Confusion matrix, Accuracy, Sensitivity etc
confusionMatrix(nn_predict_delta1, highnote_y_test, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 20523 1080
## 1 4514 687
##
## Accuracy : 0.7913
## 95% CI : (0.7864, 0.7962)
## No Information Rate : 0.9341
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1096
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.38879
## Specificity : 0.81971
## Pos Pred Value : 0.13209
## Neg Pred Value : 0.95001
## Prevalence : 0.06592
## Detection Rate : 0.02563
## Detection Prevalence : 0.19404
## Balanced Accuracy : 0.60425
##
## 'Positive' Class : 1
##
# Add results into clf_results dataframe
x5_d <- confusionMatrix(nn_predict_delta1, highnote_y_test, positive ="1")[["overall"]]
y5_d <- confusionMatrix(nn_predict_delta1, highnote_y_test, positive ="1")[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Neural Network (Past)",
Accuracy = round (x5_d[["Accuracy"]],3),
Precision = round (y5_d[["Precision"]],3),
Recall = round (y5_d[["Recall"]],3),
F1 = round (y5_d[["F1"]],3))
# Print Accuracy and F1 score
cat("Accuracy is ", round(x5_d[["Accuracy"]],3), "and F1 is ", round (y5_d[["F1"]],3) )
## Accuracy is 0.791 and F1 is 0.197
# Add results into cost_benefit_df dataframe for cost benefit analysis
a5_d <- confusionMatrix(nn_predict_delta1, highnote_y_test, positive ="1")
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Neural Network (Past)",
TP = a5_d[["table"]][4],
FN = a5_d[["table"]][3],
FP = a5_d[["table"]][2],
TN = a5_d[["table"]][1])
4.1 Compare Accuracy, Precision, Recall, and F-1 Score
# compare cost benefit
print(cost_benefit_df)
## Model TP FN FP TN
## 1 Logistic Regression (All Time) 720 1047 2677 22360
## 2 Logistic Regression (Past) 553 1214 2415 22622
## 3 Decision Tree (All Time) 1182 585 6881 18156
## 4 Decision Tree (Past) 1071 696 5591 19446
## 5 KNN (All Time) 810 957 8057 16980
## 6 KNN (Past) 802 965 6797 18240
## 7 XG Boost (All Time) 1050 717 3960 21077
## 8 XG Boost (Past) 932 835 3707 21330
## 9 Neural Network (All Time) 920 847 5049 19988
## 10 Neural Network (Past) 687 1080 4514 20523
# compare classification results
print(clf_results)
## Model Accuracy Precision Recall F1
## 1 Logistic Regression (All Time) 0.861 0.212 0.407 0.279
## 2 Logistic Regression (Past) 0.865 0.186 0.313 0.234
## 3 Decision Tree (All Time) 0.721 0.147 0.669 0.240
## 4 Decision Tree (Past) 0.765 0.161 0.606 0.254
## 5 KNN (All Time) 0.664 0.091 0.458 0.152
## 6 KNN (Past) 0.710 0.106 0.454 0.171
## 7 XG Boost (All Time) 0.869 0.245 0.473 0.323
## 8 XG Boost (Past) 0.831 0.201 0.527 0.291
## 9 Neural Network (All Time) 0.846 0.177 0.367 0.239
## 10 Neural Network (Past) 0.791 0.132 0.389 0.197
# Plot accuracy for all the Classification Models
ggplot(clf_results[1:10,] %>% arrange(desc(Accuracy)) %>%
mutate(Model=factor(Model, levels=Model) ),
aes(x = Model, y = Accuracy)) +
geom_bar(stat = "identity" , width=0.3, fill="steelblue") +
coord_cartesian(ylim = c(0.2, 1)) +
geom_hline(aes(yintercept = mean(Accuracy)),
colour = "green",linetype="dashed") +
ggtitle("Compare Accuracy for all Models") +
theme(plot.title = element_text(color="black", size=10, hjust = 0.5),
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
# Plot F-1 scores for all the Classification Models
ggplot(clf_results[1:10,] %>% arrange(desc(F1)) %>%
mutate(Model=factor(Model, levels=Model) ),
aes(x = Model, y = F1)) +
geom_bar(stat = "identity" , width=0.3, fill="steelblue") +
coord_cartesian(ylim = c(0.2, 0.4)) +
geom_hline(aes(yintercept = mean(Accuracy)),
colour = "green",linetype="dashed") +
ggtitle("Compare F-1 scores for all Models") +
theme(plot.title = element_text(color="black", size=10, hjust = 0.5),
axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
Based on the above, it seems the XGBoost (All Time) model is the most accurate model at an accuracy of 86.9%. Also, considering the dataset here is highly imbalanced, we need a class specific metric to choose the right model. For this purpose, we choose the model with the best F-measure, which is the XG Boost(All Time) model. The XG Boost(All Time) model also has the best Precision in positive prediction of all models.
4.2 Comparing ROC-AUC and Lift Curves for all models.
# List of predictions
preds_list <- list(glm_predict_prob[,2], glm_predict_prob_delta1[,2],
dtree_predict_prob[,2], dtree_predict_prob_delta1[,2],
knnProb[,2], knnProb_delta1[,2],
XG_clf_predict_prob[,2], XG_clf_predict_delta1_prob[,2],
nn_predict_prob[,2], nn_predict_prob_delta1[,2])
# List of actual values (same for all)
m <- length(preds_list)
actuals_list <- rep(list(highnote_y_test), m)
# Plot the ROC curves
pred <- prediction(preds_list, actuals_list)
rocs <- performance(pred, "tpr", "fpr")
# calculate AUC for all models
AUC_models <- performance(pred, "auc")
auc_lr = round(AUC_models@y.values[[1]], 3)
auc_lr_delta1 = round(AUC_models@y.values[[2]], 3)
auc_dt = round(AUC_models@y.values[[3]], 3)
auc_dt_delta1 = round(AUC_models@y.values[[4]], 3)
auc_knn = round(AUC_models@y.values[[5]], 3)
auc_knn_delta1 = round(AUC_models@y.values[[6]], 3)
auc_xg = round(AUC_models@y.values[[7]], 3)
auc_xg_delta1 = round(AUC_models@y.values[[8]], 3)
auc_nn = round(AUC_models@y.values[[9]], 3)
auc_nn_delta1 = round(AUC_models@y.values[[10]], 3)
# Plot the ROC curves
plot(rocs, col = as.list(1:m), main = "ROC-AUC Curves of Different Models (All Time and the Past)")
legend(x = "bottomright",
legend = c( paste0("Logistic Regression (All) - ", auc_lr),
paste0("Logistic Regression (Past) -", auc_lr_delta1),
paste0("Decision Tree (All) - ", auc_dt),
paste0("Decision Tree (Past) -", auc_dt_delta1),
paste0("KNN (All) - ", auc_knn),
paste0("KNN (Past) -", auc_knn_delta1),
paste0("XG Boost (All) - ", auc_xg),
paste0("XG Boost (Past) -", auc_xg_delta1),
paste0("Neural Net (All) - ", auc_nn),
paste0("Neural Net (Past) -", auc_nn_delta1)), fill = 1:m)
The thumb rule in case of identifying the best classifier on an ROC curve is the proximity of the classifier to the top left corner, which primarily tells us the ratio between true positive & false positive rates at different thresholds. On the basis of this the XG Boost(All Time) model outperforms all other classifiers across all thresholds.
AUC similarly is also the highest for the XG Boost(All Time) model as inferred visually here. Hence that’s an indication that the XG Boost(All Time) seems to be best among all classifiers.
Lift Curve essentially tells us how do the classifiers stack up against a scenario without a predictive model. This is more useful as a determining factor in scenarios where you are doing classification followed by ranking. In our case, as we need to identify the top 1000 most likely free users and give them a two-month trial in a bid to make them convert. Hence the better our classifier is at identifying the most important 1000 free users, the more it is useful for our use case.
lifts <- performance(pred, "lift", "rpp")
# Plot the Lift curves
plot(lifts, col = as.list(1:m), main = "Lift Curves of Different Models (All Time and the Past)")
legend(x = "topright",
legend = c( "Logistic Regression (All)",
"Logistic Regression (Past)",
"Decision Tree (All)",
"Decision Tree (Past)",
"KNN (All)",
"KNN (Past)",
"XG Boost (All)",
"XG Boost (Past)",
"Neural Net (All)",
"Neural Net (Past)"), fill = 1:m)
The total number of test instances in our case is 100k free users. XG
Boost(All) model outperforms all the other models here as well. Hence it
seems to be best model for our use case.
Based on the above metrics, and understanding the relevance of the above metrics to our use case, it seems that XG Boost(All) is the best performing classifier. It performs the best in terms of F-1 Scores, is the best model in terms of precision and delivers the highest lift as per the lift curves. The 3rd metric is especially important as we are combining classification with ranking and are only interested in top 1k ranked customers most likely to convert to premium.
Hence, we would now apply the XGBoost(All) model to predict the top 1000 users likely to respond positively to the two-month trial and convert to a premium customer.
# remove net_user id and adopter information for the free users data set
highnote_id <- highnote %>% select(c("net_user", "adopter"))
highnote_factors <- highnote %>% select(c("delta1_good_country", "good_country", "male"))
highnote_numeric <- highnote %>% select(-c("net_user", "adopter", "male", "delta1_good_country", "good_country"))
# do one-hot encoding for the free users data
highnote_factors_ohe_1 <- data.frame(highnote_factors)
highnote_factors_ohe <- dummy.data.frame(data=highnote_factors_ohe_1, sep=" ")
# Normalize numeric data
highnote_numeric_norm <- as.data.frame(lapply(highnote_numeric, normalize))
#Combine numeric & factor variables in one dataframe
highnote_all <- cbind(highnote_id,highnote_factors_ohe, highnote_numeric_norm)
# create a new data set that contains only free users
highnote_free <- highnote_all %>% filter(adopter==0)
highnote_free_x <- highnote_free %>% select(-c("net_user", "adopter"))
# predict the data
predict_free_prob <- predict(XG_clf_fit, newdata = highnote_free_x, type="prob")
prob_of_adopting <- predict_free_prob[,2]
# put net user id back to the table
highnote_free_id <- highnote_free %>% select(c("net_user"))
adopters <- cbind(highnote_free_id, prob_of_adopting)
# select the top 1,000 mostly likely free users who have the highest probabilities of converting to premium users
adopters_top1k <- head(adopters[order(-adopters$prob_of_adopting),], 1000)
# show these 1,000 users' net id and the corresponding probability of adopting
summary(adopters_top1k)
## net_user prob_of_adopting
## Length:1000 Min. :0.8800
## Class :character 1st Qu.:0.8991
## Mode :character Median :0.9164
## Mean :0.9218
## 3rd Qu.:0.9424
## Max. :0.9990
# export the data into a csv file
write.csv(x=adopters_top1k, file = "adopters_top1k.csv", row.names = TRUE)
successNum <- adopters_top1k %>% filter(prob_of_adopting>0.90) %>% tally()
successNum
## n
## 1 734
Answer: In the Highnote case, we are essentially providing a two-month free trial to free users with the highest likelihood of converting to a premium customer. As we are going to offer up the free trial to top 1000 users most likely to subscribe to premium services, we should use the lift to determine the correct ROI from the proposed model. For the XGBoost(All) model chosen here, the average probability of adoption for the top 1000 predicted customer is 92.1%, which means of the top 1000 customers, 921 are likely to adopt the premium service. Comparing this with the average conversion rate of free users to premium based on such campaigns historically gives us the lift. We can make the lift metric cost aware by multiplying the number of incremental customer gained due to the model with the profit from acquiring a new premium customer. Hence, this will give us the true ROI from the proposed model.