Situation

You are all familiar with the HighNote freemium monetization case. HighNote is an online community similar to Spotify (and LinkedIn and OkCupid) that allows both free users and premium users to co-exist. You covered this case at a high level using summary statistics in Module 1 with Prof Anindya Ghose. Now you have the chance to build on your knowledge using additional micro data at the individual level. This could potentially be useful to figure out who specifically to target to convert from free to fee (premium). ## Complication A general challenge in Freemium communities is getting people to pay premium when they can also use the service for free (at the expense of seeing ads say or not getting some features). However, free users are not as profitable as premium users because the fight for ad dollars from brands is intense. HighNote would like to use a data-driven approach to try to get more free users to become premium users. They are willing to offer a two month free trial, but they do not know who to give this promotion to. They have hired you as a consultant to help them design a targeting strategy.

Key Question

Who are the top 1000 most likely free users to convert to premium?


To answer this question, I will build different types of models and compare their performances. I will train these models based on two datasets. The first dataset will cointain only previous time period’s change variables which have “delta” suffix in main data table. The second dataset will cointain both past and current data. The models built using past data will have “past” suffix and the models built using all data will have “all” suffix. Models that I picked for this assignments are Log regression, KNN, Neural Network, XG boost, Ridge and Decision tree.


## The first step is importing required libraries
library("tidyverse")
library("skimr")
library("readxl") # used to read excel files
library("dplyr") # used for data munging 
library("FNN") # used for knn regression (knn.reg function)
library("caret") # used for various predictive models
library("class") # for using confusion matrix function
library("rpart.plot") # used to plot decision tree
library("rpart")  # used for Regression tree
library("glmnet") # used for Lasso and Ridge regression
library('NeuralNetTools') # used to plot Neural Networks
library("PRROC") # top plot ROC curve
library("ROCR") # top plot lift curve

starting with loading data

##loading data
HN_data_PostModule <- read_csv("HN_data_PostModule.csv", col_types = "fnfnnnnnnnnnnnfnnnnnnnnnnff")

##Changing NA factor values to UNK
HN_data_PostModule <- HN_data_PostModule %>%
  mutate(net_user = as.character(net_user)) %>%
  mutate(net_user = as.factor(recode(net_user, "NA" = "UNK")) )%>%
  mutate(male = as.character(male)) %>%
  mutate(male = as.factor(recode(male, "NA" = "UNK"))) %>%
  mutate(adopter = as.character(adopter)) %>%
  mutate(adopter = as.factor(recode(adopter, "NA" = "UNK"))) %>%
  mutate(good_country = as.character(good_country)) %>%
  mutate(good_country = as.factor(recode(good_country, "NA" = "UNK"))) %>%
  mutate(delta1_good_country = as.character(delta1_good_country)) %>%
  mutate(delta1_good_country = as.factor(recode(delta1_good_country, "NA" = "UNK"))) 


## Changing NA numbers to meadian or Mean depends on type of variables

HN_data_PostModule <- HN_data_PostModule %>%
  group_by(male) %>%
  mutate(age = ifelse(is.na(age), mean(age, na.rm = TRUE), age)) %>%
  mutate(friend_cnt = ifelse(is.na(friend_cnt), median(friend_cnt, na.rm = TRUE), friend_cnt)) %>%
  mutate(avg_friend_age = ifelse(is.na(avg_friend_age), mean(avg_friend_age, na.rm = TRUE), avg_friend_age)) %>%
  mutate(avg_friend_male = ifelse(is.na(avg_friend_male), mean(avg_friend_male, na.rm = TRUE), avg_friend_male)) %>%
  mutate(friend_country_cnt = ifelse(is.na(friend_country_cnt), median(friend_country_cnt, na.rm = TRUE), friend_country_cnt)) %>%
  mutate(subscriber_friend_cnt = ifelse(is.na(subscriber_friend_cnt), median(subscriber_friend_cnt, na.rm = TRUE), subscriber_friend_cnt)) %>%
  mutate(songsListened = ifelse(is.na(songsListened), median(songsListened, na.rm = TRUE), songsListened)) %>%
  mutate(lovedTracks = ifelse(is.na(lovedTracks), median(lovedTracks, na.rm = TRUE), lovedTracks)) %>%
  mutate(posts = ifelse(is.na(posts), median(posts, na.rm = TRUE), posts)) %>%
  mutate(playlists = ifelse(is.na(playlists), median(playlists, na.rm = TRUE), playlists)) %>%
  mutate(shouts = ifelse(is.na(shouts), median(shouts, na.rm = TRUE), shouts)) %>%
  mutate(tenure = ifelse(is.na(tenure), mean(tenure, na.rm = TRUE), tenure)) %>%
  
  mutate(delta1_friend_cnt = ifelse(is.na(delta1_friend_cnt), median(delta1_friend_cnt, na.rm = TRUE), delta1_friend_cnt)) %>%
  mutate(delta1_avg_friend_age = ifelse(is.na(delta1_avg_friend_age), mean(delta1_avg_friend_age, na.rm = TRUE), delta1_avg_friend_age)) %>%
  mutate(delta1_avg_friend_male = ifelse(is.na(delta1_avg_friend_male), mean(delta1_avg_friend_male, na.rm = TRUE), delta1_avg_friend_male)) %>%
  mutate(delta1_friend_country_cnt = ifelse(is.na(delta1_friend_country_cnt), median(delta1_friend_country_cnt, na.rm = TRUE), delta1_friend_country_cnt)) %>%
  mutate(delta1_subscriber_friend_cnt = ifelse(is.na(delta1_subscriber_friend_cnt), median(delta1_subscriber_friend_cnt, na.rm = TRUE), delta1_subscriber_friend_cnt)) %>%
  mutate(delta1_songsListened = ifelse(is.na(delta1_songsListened), median(delta1_songsListened, na.rm = TRUE), delta1_songsListened)) %>%
  mutate(delta1_lovedTracks = ifelse(is.na(delta1_lovedTracks), median(delta1_lovedTracks, na.rm = TRUE), delta1_lovedTracks)) %>%
  mutate(delta1_posts = ifelse(is.na(delta1_posts), median(delta1_posts, na.rm = TRUE), delta1_posts)) %>%
  mutate(delta1_playlists = ifelse(is.na(delta1_playlists), median(delta1_playlists, na.rm = TRUE), delta1_playlists)) %>%
  mutate(delta1_shouts = ifelse(is.na(delta1_shouts), median(delta1_shouts, na.rm = TRUE), delta1_shouts)) %>%

  ungroup()

## using log transform on skewed variables 
HN_data_PostModule$friend_cnt<-log(HN_data_PostModule$friend_cnt+1)
HN_data_PostModule$playlists<-log(HN_data_PostModule$playlists+1)
HN_data_PostModule$posts<-log(HN_data_PostModule$posts+1)
HN_data_PostModule$songsListened<-log(HN_data_PostModule$songsListened+1)

Create a function that normalizes columns since the scale for each column might be different.

## function to normalize data (0 to 1)


normalize<-function(vec){
  if (is.numeric(vec)) {
    vec = (vec - min(vec)) / (max(vec) - min(vec)) }
  return (vec)
}
##normalize data
HN_data_PostModule <- as.data.frame(lapply(HN_data_PostModule, normalize))

#skim(HN_data_PostModule)

picking past period and current period for review

HN_data_past<-HN_data_PostModule[,c('net_user', 'age',  'male', 'tenure',   'delta1_friend_cnt',    'delta1_avg_friend_age',    'delta1_avg_friend_male',   'delta1_friend_country_cnt',    'delta1_subscriber_friend_cnt', 'delta1_songsListened', 'delta1_lovedTracks',   'delta1_posts', 'delta1_playlists', 'delta1_shouts',    'delta1_good_country',  'adopter')]


## create Y and X data frames
HN_data_past_Y = HN_data_past %>% pull("adopter") 
HN_data_Y = HN_data_PostModule %>% pull("adopter") 

HN_data_past_X = HN_data_past %>% select(-c("adopter"))
HN_data_X = HN_data_PostModule %>% select(-c("adopter"))

##remove id from dataframe and store separate
HN_data_past_X_id<-HN_data_past_X %>% select(c("net_user"))
HN_data_X_id<-HN_data_X %>% select(c("net_user"))


HN_data_past_X<-HN_data_past_X %>% select(-c("net_user"))
HN_data_X<-HN_data_X %>% select(-c("net_user"))

split the data into trainig and validation

## 75% of the data is used for training and rest for testing
smp_size <- floor(0.75 * nrow(HN_data_past_X))

## randomly select row numbers for training data set
set.seed(12345)
train_ind <- sample(seq_len(nrow(HN_data_past_X)), size = smp_size)

## creating test and training sets for x
HN_data_x_train_past <- HN_data_past_X[train_ind, ]
HN_data_x_train <-      HN_data_X[train_ind, ]

HN_data_x_test_past <- HN_data_past_X[-train_ind, ]
HN_data_x_test      <- HN_data_X[-train_ind, ]


## creating test and training sets for y
HN_data_y_train_past <- HN_data_past_Y[train_ind]
HN_data_y_train <- HN_data_Y[train_ind]


HN_data_y_test_past <- HN_data_past_Y[-train_ind]
HN_data_y_test <- HN_data_Y[-train_ind]



## Check the proportions for the class between all 3 datasets.
round(prop.table(table(select(HN_data_past, adopter), exclude = NULL)), 4) * 100
## 
##     0     1 
## 93.27  6.73
round(prop.table(table(HN_data_y_train_past)), 4) * 100
## HN_data_y_train_past
##     0     1 
## 93.33  6.67
round(prop.table(table(HN_data_y_test_past)), 4) * 100
## HN_data_y_test_past
##     0     1 
## 93.11  6.89
## Since we want to have the equal probability of picking 0 and 1 we need to do SMOTE 


library("abind")

install.packages( "https://cran.r-project.org/src/contrib/Archive/DMwR/DMwR_0.4.1.tar.gz", repos=NULL, type="source" )
library("DMwR")
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
set.seed(1234)

##create the full training dataset with X and y variable
HN_data_past_train <-  cbind(HN_data_x_train_past, HN_data_y_train_past)
HN_data_train <-  cbind(HN_data_x_train, HN_data_y_train)

HN_train_balanced_past <- SMOTE(HN_data_y_train_past ~ ., data.frame(HN_data_past_train), perc.over = 100, perc.under = 200)

HN_train_balanced <- SMOTE(HN_data_y_train ~ ., data.frame(HN_data_train), perc.over = 100, perc.under = 200)

## Check the proportions 

round(prop.table(table(HN_train_balanced$HN_data_y_train)), 4) * 100
## 
##  0  1 
## 50 50
round(prop.table(table(HN_train_balanced_past$HN_data_y_train_past)), 4) * 100
## 
##  0  1 
## 50 50
##remove the Y column from the newly balanced training set
HN_data_x_train_past <- HN_train_balanced_past %>% select(-HN_data_y_train_past)
HN_data_x_train <- HN_train_balanced %>% select(-HN_data_y_train)

##store the Y column
HN_data_y_train_past <- HN_train_balanced_past %>% pull(HN_data_y_train_past) %>% as.factor()
HN_data_y_train <- HN_train_balanced %>% pull(HN_data_y_train) %>% as.factor()

Create data frame for cls and cost benefit values

## Create an empty data frame to store results from different models
clf_results <- data.frame(matrix(ncol = 5, nrow = 0))
names(clf_results) <- c("Model", "Accuracy", "Precision", "Recall", "F1")

## Create an empty data frame to store TP, TN, FP and FN values
cost_benefit_df <- data.frame(matrix(ncol = 5, nrow = 0))
names(cost_benefit_df) <- c("Model", "TP", "FN", "FP", "TN")


After preparing data, we can start with building models.


Logistic regression Past

## below model is logistic regression
yyy_fit <- train(HN_data_x_train_past,
                 HN_data_y_train_past, 
                 method = "glm",
                 family = "binomial",
                 preProc = c("center", "scale"))

##predict the model based on test data
yyy_predict <- predict(yyy_fit, newdata = HN_data_x_test_past, positive="1" )
glm_prob <- predict(yyy_fit, newdata = HN_data_x_test_past, type = "prob")



## Add results into clf_results dataframe
x1 <- confusionMatrix(yyy_predict, HN_data_y_test_past )[["overall"]]
y1 <- confusionMatrix(yyy_predict, HN_data_y_test_past)[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Log Regression", 
                                             Accuracy = round (x1[["Accuracy"]],3), 
                                            Precision = round (y1[["Precision"]],3), 
                                            Recall = round (y1[["Recall"]],3), 
                                            F1 = round (y1[["F1"]],3))

## Add results into cost_benefit_df dataframe 
a1 <- confusionMatrix(yyy_predict, HN_data_y_test_past)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Log Regression", 
                                             TP = a1[["table"]][4], 
                                             FN = a1[["table"]][3], 
                                             FP = a1[["table"]][2], 
                                             TN = a1[["table"]][1])

Logistic regression all

## below model is logistic regression
yyy_fit_2 <- train(HN_data_x_train,
                 HN_data_y_train, 
                 method = "glm",
                 family = "binomial",
                 preProc = c("center", "scale"))

##predict the model based on test data
yyy_predict_2 <- predict(yyy_fit_2, newdata = HN_data_x_test, positive="1" )
glm_prob_2 <- predict(yyy_fit_2, newdata = HN_data_x_test, type = "prob")
## Add results into clf_results dataframe
xa <- confusionMatrix(yyy_predict_2, HN_data_y_test )[["overall"]]
ya <- confusionMatrix(yyy_predict_2, HN_data_y_test)[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Log Regression 2", 
                                             Accuracy = round (xa[["Accuracy"]],3), 
                                            Precision = round (ya[["Precision"]],3), 
                                            Recall = round (ya[["Recall"]],3), 
                                            F1 = round (ya[["F1"]],3))

## Add results into cost_benefit_df dataframe 
aa <- confusionMatrix(yyy_predict_2, HN_data_y_test )

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Log Regression 2", 
                                             TP = aa[["table"]][4], 
                                             FN = aa[["table"]][3], 
                                             FP = aa[["table"]][2], 
                                             TN = aa[["table"]][1])

KNN Model past

# change y value to "yes/no"
HN_data_y_train_past <- as.factor(ifelse(HN_data_y_train_past =="1", "YES", "NO"))
HN_data_y_test_past <- as.factor(ifelse(HN_data_y_test_past =="1", "YES", "NO"))


## one-hot eccoding for train and test data

library( dummies)
## dummies-1.5.6 provided by Decision Patterns
HN_data_x_train_past <- data.frame(HN_data_x_train_past) #convert from a tibble to a data frame
HN_data_x_train_past <- dummy.data.frame(data=HN_data_x_train_past, sep="_")  #does one-hot encoding
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
HN_data_x_test_past <- data.frame(HN_data_x_test_past) #convert from a tibble to a data frame
HN_data_x_test_past <- dummy.data.frame(data=HN_data_x_test_past, sep="_")  #does one-hot encoding
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
##Fit KNN Model


##Cross validation 
cross_validation <- trainControl(## 10-fold CV
                                method = "repeatedcv",
                                number = 10,
                                ## repeated three times
                                repeats = 1,
                                summaryFunction=twoClassSummary, 
                                classProbs=TRUE)
# Hyperparamter tuning
# k = number of nrearest neighbours
Param_Grid <-  expand.grid( k = 1:9)

# fit the model to training data
knn_clf_fit <- train(HN_data_x_train_past,
                     HN_data_y_train_past,
                     method = "knn",
                     metric = "ROC",
                     tuneGrid = Param_Grid,
                     trControl = cross_validation )

## Plot accuracies for different k values
plot(knn_clf_fit)

## predict the model based on test data
knn_Predict <- predict(knn_clf_fit, newdata = HN_data_x_test_past,positive = "YES") 

knn_prob <- predict(knn_clf_fit, newdata = HN_data_x_test_past, type = "prob")


## Add results into clf_results dataframe
x2 <- confusionMatrix(knn_Predict, HN_data_y_test_past )[["overall"]]
y2 <- confusionMatrix(knn_Predict, HN_data_y_test_past)[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "KNN", 
                                             Accuracy = round (x2[["Accuracy"]],3), 
                                            Precision = round (y2[["Precision"]],3), 
                                            Recall = round (y2[["Recall"]],3), 
                                            F1 = round (y2[["F1"]],3))

## Add results into cost_benefit_df dataframe 
a2 <- confusionMatrix(knn_Predict, HN_data_y_test_past)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "KNN", 
                                             TP = a2[["table"]][4], 
                                             FN = a2[["table"]][3], 
                                             FP = a2[["table"]][2], 
                                             TN = a2[["table"]][1])

KNN Model all

## change y value to "yes/no"
HN_data_y_train <- as.factor(ifelse(HN_data_y_train =="1", "YES", "NO"))
HN_data_y_test <- as.factor(ifelse(HN_data_y_test =="1", "YES", "NO"))

## one-hot eccoding for train and test data


library( dummies)

HN_data_x_train <- data.frame(HN_data_x_train) #convert from a tibble to a data frame
HN_data_x_train <- dummy.data.frame(data=HN_data_x_train, sep="_")  #does one-hot encoding
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
HN_data_x_test <- data.frame(HN_data_x_test) #convert from a tibble to a data frame
HN_data_x_test <- dummy.data.frame(data=HN_data_x_test, sep="_")  #does one-hot encoding
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Cross validation 
cross_validation <- trainControl(## 10-fold CV
                                method = "repeatedcv",
                                number = 10,
                                ## repeated three times
                                repeats = 1,
                                summaryFunction=twoClassSummary, 
                                classProbs=TRUE)
# Hyperparamter tuning
# k = number of nrearest neighbours
Param_Grid <-  expand.grid( k = 1:9)

# fit the model to training data
knn_clf_fit_2 <- train(HN_data_x_train,
                     HN_data_y_train,
                     method = "knn",
                     metric = "ROC",
                     tuneGrid = Param_Grid,
                     trControl = cross_validation )

## Plot accuracies for different k values
plot(knn_clf_fit_2)

##predict the model based on test data
knn_Predict_2 <- predict(knn_clf_fit_2, newdata = HN_data_x_test,positive = "YES") 

knn_prob_2 <- predict(knn_clf_fit_2, newdata = HN_data_x_test, type = "prob")


## Add results into clf_results dataframe
xb <- confusionMatrix(knn_Predict_2, HN_data_y_test )[["overall"]]
yb <- confusionMatrix(knn_Predict_2, HN_data_y_test)[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "KNN 2", 
                                             Accuracy = round (xb[["Accuracy"]],3), 
                                            Precision = round (yb[["Precision"]],3), 
                                            Recall = round (yb[["Recall"]],3), 
                                            F1 = round (yb[["F1"]],3))

## Add results into cost_benefit_df dataframe 
ab <- confusionMatrix(knn_Predict_2, HN_data_y_test)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "KNN 2", 
                                             TP = ab[["table"]][4], 
                                             FN = ab[["table"]][3], 
                                             FP = ab[["table"]][2], 
                                             TN = ab[["table"]][1])

XGboost past

XG_clf_fit <- train(HN_data_x_train_past, 
                    HN_data_y_train_past,
                    method = "xgbTree",
                    preProc = c("center", "scale"))

## Predict on test data
XG_clf_predict <- predict(XG_clf_fit,HN_data_x_test_past, positive = "YES")

xg_prob <- predict(XG_clf_fit, newdata = HN_data_x_test_past, type = "prob")


## Add results into clf_results dataframe
x3 <- confusionMatrix(XG_clf_predict,  HN_data_y_test_past )[["overall"]]
y3 <- confusionMatrix(XG_clf_predict,  HN_data_y_test_past )[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "XG Boost", 
                                             Accuracy = round (x3[["Accuracy"]],3), 
                                            Precision = round (y3[["Precision"]],3), 
                                            Recall = round (y3[["Recall"]],3), 
                                            F1 = round (y3[["F1"]],3))


## Add results into cost_benefit_df dataframe 
a3 <- confusionMatrix(XG_clf_predict,  HN_data_y_test_past)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "XG Boost", 
                                             TP = a3[["table"]][4], 
                                             FN = a3[["table"]][3], 
                                             FP = a3[["table"]][2], 
                                             TN = a3[["table"]][1])

XGboost all

XG_clf_fit_2 <- train(HN_data_x_train, 
                    HN_data_y_train,
                    method = "xgbTree",
                    preProc = c("center", "scale"))

## Predict on test data
XG_clf_predict_2 <- predict(XG_clf_fit_2,HN_data_x_test, positive = "YES")

xg_prob_2 <- predict(XG_clf_fit_2, newdata = HN_data_x_test, type = "prob")



## Add results into clf_results dataframe
xc <- confusionMatrix(XG_clf_predict_2,  HN_data_y_test )[["overall"]]
yc <- confusionMatrix(XG_clf_predict_2,  HN_data_y_test )[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "XG Boost 2", 
                                             Accuracy = round (xc[["Accuracy"]],3), 
                                            Precision = round (yc[["Precision"]],3), 
                                            Recall = round (yc[["Recall"]],3), 
                                            F1 = round (yc[["F1"]],3))


## Add results into cost_benefit_df dataframe 
ac <- confusionMatrix(XG_clf_predict_2,  HN_data_y_test)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "XG Boost 2", 
                                             TP = ac[["table"]][4], 
                                             FN = ac[["table"]][3], 
                                             FP = ac[["table"]][2], 
                                             TN = ac[["table"]][1])

Neural Network classification past

## and size of Hidden layers
my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 7))

nn_clf_fit <- train(HN_data_x_train_past, 
                    HN_data_y_train_past,
                    method = "nnet",
                    trace = F,
                    tuneGrid = my.grid,
                    linout = 0,
                    stepmax = 100,
                    threshold = 0.01 )

## Plot Neural Network 
plotnet(nn_clf_fit$finalModel, y_names = "adopter")

## Predict on test data
nn_clf_predict <- predict(nn_clf_fit,HN_data_x_test_past, positive = "1")

NN_prob <- predict(nn_clf_fit, newdata = HN_data_x_test_past, type = "prob")



## Add results into clf_results dataframe
x4 <- confusionMatrix(nn_clf_predict,HN_data_y_test_past)[["overall"]]
y4 <- confusionMatrix(nn_clf_predict,HN_data_y_test_past)[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Neural Network", 
                                             Accuracy = round (x4[["Accuracy"]],3), 
                                            Precision = round (y4[["Precision"]],3), 
                                            Recall = round (y4[["Recall"]],3), 
                                            F1 = round (y4[["F1"]],3))

## Add results into cost_benefit_df dataframe 
a4 <- confusionMatrix(nn_clf_predict,HN_data_y_test_past)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Neural Network", 
                                             TP = a4[["table"]][4], 
                                             FN = a4[["table"]][3], 
                                             FP = a4[["table"]][2], 
                                             TN = a4[["table"]][1])

Neural Network classification all

## and size of Hidden layers
my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 7))

nn_clf_fit_2 <- train(HN_data_x_train, 
                    HN_data_y_train,
                    method = "nnet",
                    trace = F,
                    tuneGrid = my.grid,
                    linout = 0,
                    stepmax = 100,
                    threshold = 0.01 )
print(nn_clf_fit_2)
## Neural Network 
## 
## 21468 samples
##    32 predictor
##     2 classes: 'NO', 'YES' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 21468, 21468, 21468, 21468, 21468, 21468, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  Accuracy   Kappa    
##   0.1    5     0.7523900  0.5049620
##   0.1    7     0.7540660  0.5083174
##   0.5    5     0.7431458  0.4864760
##   0.5    7     0.7450599  0.4902684
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
## Plot Neural Network 
plotnet(nn_clf_fit_2$finalModel, y_names = "adopter")

## Predict on test data
nn_clf_predict_2 <- predict(nn_clf_fit_2,HN_data_x_test, positive = "1")

NN_prob_2 <- predict(nn_clf_fit_2, newdata = HN_data_x_test, type = "prob")


## Add results into clf_results dataframe
xd <- confusionMatrix(nn_clf_predict_2,HN_data_y_test)[["overall"]]
yd <- confusionMatrix(nn_clf_predict_2,HN_data_y_test)[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Neural Network 2", 
                                             Accuracy = round (xd[["Accuracy"]],3), 
                                            Precision = round (yd[["Precision"]],3), 
                                            Recall = round (yd[["Recall"]],3), 
                                            F1 = round (yd[["F1"]],3))

## Add results into cost_benefit_df dataframe 
ad <- confusionMatrix(nn_clf_predict_2,HN_data_y_test)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Neural Network", 
                                             TP = ad[["table"]][4], 
                                             FN = ad[["table"]][3], 
                                             FP = ad[["table"]][2], 
                                             TN = ad[["table"]][1])

Ridge regression past

## select range of lamdas 
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)

## Setting alpha = 0 implements ridge regression
ridge_cv <- cv.glmnet(HN_data_x_train_past %>% as.matrix(), HN_data_y_train_past %>% as.matrix(), 
                      alpha = 0, lambda = lambdas_to_try, family = 'binomial',
                      standardize = TRUE)

## plot MSE vs Log(lambda)
plot(ridge_cv)

## select the best cross-validated lambda
lambda_cv <- ridge_cv$lambda.min

cat("Best lambda for Ridge Regression (alpha = 0)  is", lambda_cv)
## Best lambda for Ridge Regression (alpha = 0)  is 0.001747528
## Train the model using best lamda
model_ridge_cv <- glmnet(HN_data_x_train_past %>% as.matrix(), HN_data_y_train_past %>% as.matrix(), 
                   alpha = 0, lambda = lambda_cv, family="binomial", standardize = TRUE)

## prediction on test data
ridge_predict_res <- predict(model_ridge_cv, HN_data_x_test_past %>% as.matrix(),type="response")

ridge_predict_class <- predict(model_ridge_cv, HN_data_x_test_past %>% as.matrix(),type="class")

## ridge_total<-cbind(ridge_predict_res,ridge_predict_prob)
ridge_predict_res_0<-ridge_predict_res
ridge_predict_res_1<-1-ridge_predict_res
ridge_prob<-cbind(ridge_predict_res_0,ridge_predict_res_1)
confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test_past,positive="YES")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    NO   YES
##        NO  19180   905
##        YES  5778   941
##                                           
##                Accuracy : 0.7507          
##                  95% CI : (0.7454, 0.7558)
##     No Information Rate : 0.9311          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1252          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.50975         
##             Specificity : 0.76849         
##          Pos Pred Value : 0.14005         
##          Neg Pred Value : 0.95494         
##              Prevalence : 0.06887         
##          Detection Rate : 0.03511         
##    Detection Prevalence : 0.25067         
##       Balanced Accuracy : 0.63912         
##                                           
##        'Positive' Class : YES             
## 
## Add results into clf_results dataframe


x5 <- confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test_past)[["overall"]]
y5 <- confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test_past)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <-  list(Model = "Ridge regression", 
                                             Accuracy = round (x5[["Accuracy"]],3), 
                                            Precision = round (y5[["Precision"]],3), 
                                            Recall = round (y5[["Recall"]],3), 
                                            F1 = round (y5[["F1"]],3))

## Add results into cost_benefit_df dataframe 
a5 <- confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test_past)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Ridge regression", 
                                             TP = a5[["table"]][4], 
                                             FN = a5[["table"]][3], 
                                             FP = a5[["table"]][2], 
                                             TN = a5[["table"]][1])

Ridge regression all

## select range of lamdas 
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)

## Setting alpha = 0 implements ridge regression
ridge_cv_2 <- cv.glmnet(HN_data_x_train %>% as.matrix(), HN_data_y_train %>% as.matrix(), 
                      alpha = 0, lambda = lambdas_to_try, family = 'binomial',
                      standardize = TRUE)

## plot MSE vs Log(lambda)
plot(ridge_cv_2)

## select the best cross-validated lambda
lambda_cv_2 <- ridge_cv_2$lambda.min

cat("Best lambda for Ridge Regression (alpha = 0)  is", lambda_cv_2)
## Best lambda for Ridge Regression (alpha = 0)  is 0.001
## Train the model using best lamda
model_ridge_cv_2 <- glmnet(HN_data_x_train %>% as.matrix(), HN_data_y_train %>% as.matrix(), 
                   alpha = 0, lambda = lambda_cv_2, family="binomial", standardize = TRUE)

## prediction on test data
ridge_predict_res_2 <- predict(model_ridge_cv_2, HN_data_x_test %>% as.matrix(),type="response")

ridge_predict_class_2 <- predict(model_ridge_cv_2, HN_data_x_test %>% as.matrix(),type="class")


## ridge_total<-cbind(ridge_predict_res,ridge_predict_prob)
ridge_predict_res_0<-ridge_predict_res_2
ridge_predict_res_1<-1-ridge_predict_res_2
ridge_prob_2<-cbind(ridge_predict_res_0,ridge_predict_res_1)
confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test,positive="YES")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    NO   YES
##        NO  19180   905
##        YES  5778   941
##                                           
##                Accuracy : 0.7507          
##                  95% CI : (0.7454, 0.7558)
##     No Information Rate : 0.9311          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1252          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.50975         
##             Specificity : 0.76849         
##          Pos Pred Value : 0.14005         
##          Neg Pred Value : 0.95494         
##              Prevalence : 0.06887         
##          Detection Rate : 0.03511         
##    Detection Prevalence : 0.25067         
##       Balanced Accuracy : 0.63912         
##                                           
##        'Positive' Class : YES             
## 
## Add results into clf_results dataframe

xe <- confusionMatrix(as.factor(ridge_predict_class_2), HN_data_y_test)[["overall"]]
ye <- confusionMatrix(as.factor(ridge_predict_class_2), HN_data_y_test)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <-  list(Model = "Ridge regression 2", 
                                             Accuracy = round (xe[["Accuracy"]],3), 
                                            Precision = round (ye[["Precision"]],3), 
                                            Recall = round (ye[["Recall"]],3), 
                                            F1 = round (ye[["F1"]],3))

ae <- confusionMatrix(as.factor(ridge_predict_class_2), HN_data_y_test)

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Ridge regression 2", 
                                             TP = ae[["table"]][4], 
                                             FN = ae[["table"]][3], 
                                             FP = ae[["table"]][2], 
                                             TN = ae[["table"]][1])

decision tree past

## Cross validation
cross_validation <- trainControl(## 10-fold CV
                                method = "repeatedcv",
                                number = 10,
                                ## repeated three times
                                repeats = 3)

Param_Grid <-  expand.grid(maxdepth = 2:10)

dtree_fit <- train(HN_data_x_train_past,
                   HN_data_y_train_past, 
                   method = "rpart2",
                   # split - criteria to split nodes
                   parms = list(split = "gini"),
                  tuneGrid = Param_Grid,
                   trControl = cross_validation,
                  # preProc -  perform listed pre-processing to predictor dataframe
                   preProc = c("center", "scale"))


## Predict on test data
dtree_predict <- predict(dtree_fit, newdata = HN_data_x_test_past)

dtree_prob <- predict(dtree_fit, newdata = HN_data_x_test_past, type = "prob")


## Add results into clf_results dataframe
x6 <- confusionMatrix(dtree_predict,  HN_data_y_test_past )[["overall"]]
y6 <- confusionMatrix(dtree_predict,  HN_data_y_test_past )[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Decision Tree", 
                                             Accuracy = round (x6[["Accuracy"]],3), 
                                            Precision = round (y6[["Precision"]],3), 
                                            Recall = round (y6[["Recall"]],3), 
                                            F1 = round (y6[["F1"]],3))

a6 <- confusionMatrix(dtree_predict,  HN_data_y_test_past )

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Decision Tree", 
                                             TP = a6[["table"]][4], 
                                             FN = a6[["table"]][3], 
                                             FP = a6[["table"]][2], 
                                             TN = a6[["table"]][1])

decision tree all

## Cross validation
cross_validation <- trainControl(## 10-fold CV
                                method = "repeatedcv",
                                number = 10,
                                ## repeated three times
                                repeats = 3)

Param_Grid <-  expand.grid(maxdepth = 2:10)

dtree_fit_2 <- train(HN_data_x_train,
                   HN_data_y_train, 
                   method = "rpart2",
                   # split - criteria to split nodes
                   parms = list(split = "gini"),
                  tuneGrid = Param_Grid,
                   trControl = cross_validation,
                  # preProc -  perform listed pre-processing to predictor dataframe
                   preProc = c("center", "scale"))




## Predict on test data
dtree_predict_2 <- predict(dtree_fit_2, newdata = HN_data_x_test)
dtree_prob_2 <- predict(dtree_fit_2, newdata = HN_data_x_test, type = "prob")



## Add results into clf_results dataframe
xg <- confusionMatrix(dtree_predict_2,  HN_data_y_test )[["overall"]]
yg <- confusionMatrix(dtree_predict_2,  HN_data_y_test )[["byClass"]]

clf_results[nrow(clf_results) + 1,] <-  list(Model = "Decision Tree2", 
                                             Accuracy = round (xg[["Accuracy"]],3), 
                                            Precision = round (yg[["Precision"]],3), 
                                            Recall = round (yg[["Recall"]],3), 
                                            F1 = round (yg[["F1"]],3))
ag <- confusionMatrix(dtree_predict_2,  HN_data_y_test  )

cost_benefit_df[nrow(cost_benefit_df) + 1,] <-  list(Model = "Decision Tree 2", 
                                             TP = ag[["table"]][4], 
                                             FN = ag[["table"]][3], 
                                             FP = ag[["table"]][2], 
                                             TN = ag[["table"]][1])

summarize performances

##Compare models outputs

print(cost_benefit_df)
##                 Model   TP   FN   FP    TN
## 1      Log Regression  940  906 5748 19210
## 2    Log Regression 2 1245  601 7681 17277
## 3                 KNN  777 1069 6471 18487
## 4               KNN 2  983  863 6717 18241
## 5            XG Boost  928  918 3951 21007
## 6          XG Boost 2 1075  771 4099 20859
## 7      Neural Network  932  914 6637 18321
## 8      Neural Network 1046  800 4982 19976
## 9    Ridge regression  941  905 5778 19180
## 10 Ridge regression 2 1247  599 7704 17254
## 11      Decision Tree 1048  798 5325 19633
## 12    Decision Tree 2 1145  701 6300 18658
print(clf_results)
##                 Model Accuracy Precision Recall    F1
## 1      Log Regression    0.752     0.955  0.770 0.852
## 2    Log Regression 2    0.691     0.966  0.692 0.807
## 3                 KNN    0.719     0.945  0.741 0.831
## 4               KNN 2    0.717     0.955  0.731 0.828
## 5            XG Boost    0.818     0.958  0.842 0.896
## 6          XG Boost 2    0.818     0.964  0.836 0.895
## 7      Neural Network    0.718     0.952  0.734 0.829
## 8    Neural Network 2    0.784     0.961  0.800 0.874
## 9    Ridge regression    0.751     0.955  0.768 0.852
## 10 Ridge regression 2    0.690     0.966  0.691 0.806
## 11      Decision Tree    0.772     0.961  0.787 0.865
## 12     Decision Tree2    0.739     0.964  0.748 0.842


Now that we have clf results for all models, we can compare F1 score and accuracy between different models on a chart.

Plot Accuracy & F1 score

## Plot accuracy for all the Classification Models

ggplot(clf_results[1:12,] %>% arrange(desc(Accuracy)) %>%
       mutate(Model=factor(Model, levels=Model) ), 
       aes(x = Model, y = Accuracy)) +
  geom_bar(stat = "identity" , width=0.3, fill="steelblue") + 
  coord_cartesian(ylim = c(0.2, 1)) +
  geom_hline(aes(yintercept = mean(Accuracy)),
             colour = "green",linetype="dashed") +
  ggtitle("Compare Accuracy for all Models") +
  theme(plot.title = element_text(color="black", size=10, hjust = 0.5),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

## Plot F1 score for all the Classification Models

ggplot(clf_results[1:12,] %>% arrange(desc(F1)) %>%
       mutate(Model=factor(Model, levels=Model) ), 
       aes(x = Model, y = F1)) +
  geom_bar(stat = "identity" , width=0.3, fill="steelblue") + 
  coord_cartesian(ylim = c(0.2, 1)) +
  geom_hline(aes(yintercept = mean(F1)),
             colour = "green",linetype="dashed") +
  ggtitle("Compare F1 Score for all Models") +
  theme(plot.title = element_text(color="black", size=10, hjust = 0.5),
        axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

looking at all models F1 scores and accuracy, XGboost 2 which is trained using all data has the highest F1 score and highest accuracy. However, accuracy is a good measure when the target variable classes in the data are nearly balanced. Considering this, I will compare ROC curves as well, because in the High Note scenario, comparing only the accuracy of different models can be misleading. The area under ROC can be an accurate measure now that we need to summarize the performance of different models.


## Plotting ROC curves

preds_list_past <- list(glm_prob[,2],knn_prob[,2],xg_prob[,2],NN_prob[,2],ridge_prob[,1],dtree_prob[,2])

preds_list_all <- list(glm_prob_2[,2],knn_prob_2[,2],xg_prob_2[,2],NN_prob_2[,2],ridge_prob_2[,1],dtree_prob_2[,2])

## List of actual values 
m <- length(preds_list_past)
actuals_list_past <- rep(list(HN_data_y_test_past), m)
actuals_list_all <- rep(list(HN_data_y_test), m)


## Plot the ROC curves
pred_past <- prediction(preds_list_past, actuals_list_past)
pred_all <- prediction(preds_list_all, actuals_list_all)

rocs_past <- performance(pred_past, "tpr", "fpr")
rocs_all <- performance(pred_all, "tpr", "fpr")

AUC_past_models <- performance(pred_past, "auc")
AUC_all_models <- performance(pred_all, "auc")

auc_lr = round(AUC_past_models@y.values[[1]], 3)
auc_lr2 = round(AUC_all_models@y.values[[1]], 3)
auc_knn= round(AUC_past_models@y.values[[2]], 3)
auc_knn2= round(AUC_all_models@y.values[[2]], 3)
auc_xg = round(AUC_past_models@y.values[[3]], 3)
auc_xg2 = round(AUC_all_models@y.values[[3]], 3)
auc_nn = round(AUC_past_models@y.values[[4]], 3)
auc_nn2 = round(AUC_all_models@y.values[[4]], 3)
auc_ridge = round(AUC_past_models@y.values[[5]], 3)
auc_ridge2 = round(AUC_all_models@y.values[[5]], 3)
auc_dtree = round(AUC_past_models@y.values[[6]], 3)
auc_dtree2 = round(AUC_all_models@y.values[[6]], 3)


# Plot the ROC curves for past data
plot(rocs_past, col = as.list(1:m), main = "ROC Curves of different models (past data)")
legend(x = "bottomright", 
       legend = c(
    paste0("log regression past_", auc_lr),
    paste0("knn past- ", auc_knn),
    paste0("XG boost past - ", auc_xg),
    paste0("Neural network past - ", auc_nn),
    paste0("ridge past - ", auc_ridge),
    paste0("dtree past - ", auc_dtree)
  
     ), 
    fill = 1:m)

# Plot the ROC curves for all data
plot(rocs_all, col = as.list(1:m), main = "ROC Curves of different models (all data)")
legend(x = "bottomright", 
       legend = c(
  
     paste0("log regression all_", auc_lr2),
    paste0("knn all - ", auc_knn2),
    paste0("XG boost all - ", auc_xg2),
    paste0("Neural network all - ", auc_nn2),
    paste0("ridge all- ", auc_ridge2),
    paste0("dtree all - ", auc_dtree2)
  
     ), 
    fill = 1:m)


After plotting ROC curves and comparing AUC values, it seems XGboos (all data) still has a better performance. So I will pick XG boos which I trained using all data as the final model to forecast the probability of switching to premium membership between free subscribers.


Now that I’ve picked my final model, I need to get back to the data and pick all free users in a new dataset.

pick all free users

HN_data_free_users<-HN_data_PostModule%>% filter(adopter==0)

## remove id and adopter

HN_data_free_users_test<-HN_data_free_users %>% select(-c("net_user","adopter"))
HN_data_free_users_id<-HN_data_free_users %>% select(c("net_user"))


#  one-hot encoding for data
HN_data_free_users_test <- data.frame(HN_data_free_users_test) #convert from a tibble to a data frame
HN_data_free_users_test_dummy <- dummy.data.frame(data=HN_data_free_users_test, sep="_")  #does one-hot encoding


Finally, we can forecast the probability of switching to premium for all free users and pick the top 1000 users who have the highest probability of switching.

# predict values on new data

predict_adopting_probability<- predict(XG_clf_fit_2, newdata = HN_data_free_users_test_dummy, type = "prob")
probability_of_adopting<-predict_adopting_probability[,2]

#adding id to the table
adopters_list<-cbind(HN_data_free_users_id, probability_of_adopting)
summary(adopters_list)
##            net_user     probability_of_adopting
##  #NAME?        :    9   Min.   :0.0006899      
##  ____foursticks:    1   1st Qu.:0.0616172      
##  ____mih       :    1   Median :0.1625387      
##  ___8          :    1   Mean   :0.2424639      
##  ___ashley_    :    1   3rd Qu.:0.3769027      
##  ___cake       :    1   Max.   :0.9956692      
##  (Other)       :99986
top_1000_adopters<- head(adopters_list[order(-adopters_list$probability_of_adopting), ],1000)


summary(top_1000_adopters)
##             net_user   probability_of_adopting
##  _underyourspell:  1   Min.   :0.8606         
##  -aceton-       :  1   1st Qu.:0.8766         
##  -bitchslap-    :  1   Median :0.8974         
##  -opium-        :  1   Mean   :0.9045         
##  ab11111111111  :  1   3rd Qu.:0.9274         
##  abysser        :  1   Max.   :0.9957         
##  (Other)        :994

How can we make sure that if we target the people we can compute the correct ROI from the proposed model? Describe your strategy of actually deploying the targeting strategy from the predictive model.

Targeting subscribers, just based on probability, may not always be the best approach. For example in the High Note scenario, we may argue that new free subscribers with high probability will indeed switch without any special promotion and the campaign will be more productive if we target only subscribers who “have high probability and are free users for more than a year” In this case, just focusing on probability as a benchmark, may not cover expectations of this campaign. To test this strategy, we can randomly pick two groups with similar probability who are free subscribers for less than a year. Then send the promotion to one group only and test if the promotion will motivate members of this group to switch to premium more than the group without any promotion. Another strategy to precisely compare the RIO, could be including the “expected value/cost analysis” in our targeting process. Looking at the High Note scenario, each targeted member will be offered a two-month free trial, which means there is a cost associated with the campaign. On other hand, if the member continues with a premium subscription, that will bring extra revenue to the High Note. Considering this we can calculate the true value/cost of each member by calculating:

 Pc(m).Vc(m)+[1-Pc(m)].Vnc(m)

in this function

 Pc(m), is the probability the targeted subscriber continues premium membership.

 Vc(m), the value that each targeted subscriber will bring to High Note if continues his premium membership, which is equal to the monthly subscription fee minus the revenue from ads.

 1-P(m), is the probability the targeted subscriber doesn’t continue his premium membership at the end of two months.

 Vnc, is the cost that each targeted subscriber will cause to High Note.

 Since we don’t want to lose money during this campaign, we will target the customer only if

Pc(m).Vc(m)+[1-Pc(m)].Vnc(m)>0

 As a result of the expected value analysis, we might skip targeting even high probability users because they are still not bringing the value anticipated from implementing the promotion to the organization.