You are all familiar with the HighNote freemium monetization case. HighNote is an online community similar to Spotify (and LinkedIn and OkCupid) that allows both free users and premium users to co-exist. You covered this case at a high level using summary statistics in Module 1 with Prof Anindya Ghose. Now you have the chance to build on your knowledge using additional micro data at the individual level. This could potentially be useful to figure out who specifically to target to convert from free to fee (premium). ## Complication A general challenge in Freemium communities is getting people to pay premium when they can also use the service for free (at the expense of seeing ads say or not getting some features). However, free users are not as profitable as premium users because the fight for ad dollars from brands is intense. HighNote would like to use a data-driven approach to try to get more free users to become premium users. They are willing to offer a two month free trial, but they do not know who to give this promotion to. They have hired you as a consultant to help them design a targeting strategy.
##loading data
HN_data_PostModule <- read_csv("HN_data_PostModule.csv", col_types = "fnfnnnnnnnnnnnfnnnnnnnnnnff")
##Changing NA factor values to UNK
HN_data_PostModule <- HN_data_PostModule %>%
mutate(net_user = as.character(net_user)) %>%
mutate(net_user = as.factor(recode(net_user, "NA" = "UNK")) )%>%
mutate(male = as.character(male)) %>%
mutate(male = as.factor(recode(male, "NA" = "UNK"))) %>%
mutate(adopter = as.character(adopter)) %>%
mutate(adopter = as.factor(recode(adopter, "NA" = "UNK"))) %>%
mutate(good_country = as.character(good_country)) %>%
mutate(good_country = as.factor(recode(good_country, "NA" = "UNK"))) %>%
mutate(delta1_good_country = as.character(delta1_good_country)) %>%
mutate(delta1_good_country = as.factor(recode(delta1_good_country, "NA" = "UNK")))
## Changing NA numbers to meadian or Mean depends on type of variables
HN_data_PostModule <- HN_data_PostModule %>%
group_by(male) %>%
mutate(age = ifelse(is.na(age), mean(age, na.rm = TRUE), age)) %>%
mutate(friend_cnt = ifelse(is.na(friend_cnt), median(friend_cnt, na.rm = TRUE), friend_cnt)) %>%
mutate(avg_friend_age = ifelse(is.na(avg_friend_age), mean(avg_friend_age, na.rm = TRUE), avg_friend_age)) %>%
mutate(avg_friend_male = ifelse(is.na(avg_friend_male), mean(avg_friend_male, na.rm = TRUE), avg_friend_male)) %>%
mutate(friend_country_cnt = ifelse(is.na(friend_country_cnt), median(friend_country_cnt, na.rm = TRUE), friend_country_cnt)) %>%
mutate(subscriber_friend_cnt = ifelse(is.na(subscriber_friend_cnt), median(subscriber_friend_cnt, na.rm = TRUE), subscriber_friend_cnt)) %>%
mutate(songsListened = ifelse(is.na(songsListened), median(songsListened, na.rm = TRUE), songsListened)) %>%
mutate(lovedTracks = ifelse(is.na(lovedTracks), median(lovedTracks, na.rm = TRUE), lovedTracks)) %>%
mutate(posts = ifelse(is.na(posts), median(posts, na.rm = TRUE), posts)) %>%
mutate(playlists = ifelse(is.na(playlists), median(playlists, na.rm = TRUE), playlists)) %>%
mutate(shouts = ifelse(is.na(shouts), median(shouts, na.rm = TRUE), shouts)) %>%
mutate(tenure = ifelse(is.na(tenure), mean(tenure, na.rm = TRUE), tenure)) %>%
mutate(delta1_friend_cnt = ifelse(is.na(delta1_friend_cnt), median(delta1_friend_cnt, na.rm = TRUE), delta1_friend_cnt)) %>%
mutate(delta1_avg_friend_age = ifelse(is.na(delta1_avg_friend_age), mean(delta1_avg_friend_age, na.rm = TRUE), delta1_avg_friend_age)) %>%
mutate(delta1_avg_friend_male = ifelse(is.na(delta1_avg_friend_male), mean(delta1_avg_friend_male, na.rm = TRUE), delta1_avg_friend_male)) %>%
mutate(delta1_friend_country_cnt = ifelse(is.na(delta1_friend_country_cnt), median(delta1_friend_country_cnt, na.rm = TRUE), delta1_friend_country_cnt)) %>%
mutate(delta1_subscriber_friend_cnt = ifelse(is.na(delta1_subscriber_friend_cnt), median(delta1_subscriber_friend_cnt, na.rm = TRUE), delta1_subscriber_friend_cnt)) %>%
mutate(delta1_songsListened = ifelse(is.na(delta1_songsListened), median(delta1_songsListened, na.rm = TRUE), delta1_songsListened)) %>%
mutate(delta1_lovedTracks = ifelse(is.na(delta1_lovedTracks), median(delta1_lovedTracks, na.rm = TRUE), delta1_lovedTracks)) %>%
mutate(delta1_posts = ifelse(is.na(delta1_posts), median(delta1_posts, na.rm = TRUE), delta1_posts)) %>%
mutate(delta1_playlists = ifelse(is.na(delta1_playlists), median(delta1_playlists, na.rm = TRUE), delta1_playlists)) %>%
mutate(delta1_shouts = ifelse(is.na(delta1_shouts), median(delta1_shouts, na.rm = TRUE), delta1_shouts)) %>%
ungroup()
## using log transform on skewed variables
HN_data_PostModule$friend_cnt<-log(HN_data_PostModule$friend_cnt+1)
HN_data_PostModule$playlists<-log(HN_data_PostModule$playlists+1)
HN_data_PostModule$posts<-log(HN_data_PostModule$posts+1)
HN_data_PostModule$songsListened<-log(HN_data_PostModule$songsListened+1)
## function to normalize data (0 to 1)
normalize<-function(vec){
if (is.numeric(vec)) {
vec = (vec - min(vec)) / (max(vec) - min(vec)) }
return (vec)
}
##normalize data
HN_data_PostModule <- as.data.frame(lapply(HN_data_PostModule, normalize))
#skim(HN_data_PostModule)
HN_data_past<-HN_data_PostModule[,c('net_user', 'age', 'male', 'tenure', 'delta1_friend_cnt', 'delta1_avg_friend_age', 'delta1_avg_friend_male', 'delta1_friend_country_cnt', 'delta1_subscriber_friend_cnt', 'delta1_songsListened', 'delta1_lovedTracks', 'delta1_posts', 'delta1_playlists', 'delta1_shouts', 'delta1_good_country', 'adopter')]
## create Y and X data frames
HN_data_past_Y = HN_data_past %>% pull("adopter")
HN_data_Y = HN_data_PostModule %>% pull("adopter")
HN_data_past_X = HN_data_past %>% select(-c("adopter"))
HN_data_X = HN_data_PostModule %>% select(-c("adopter"))
##remove id from dataframe and store separate
HN_data_past_X_id<-HN_data_past_X %>% select(c("net_user"))
HN_data_X_id<-HN_data_X %>% select(c("net_user"))
HN_data_past_X<-HN_data_past_X %>% select(-c("net_user"))
HN_data_X<-HN_data_X %>% select(-c("net_user"))
## 75% of the data is used for training and rest for testing
smp_size <- floor(0.75 * nrow(HN_data_past_X))
## randomly select row numbers for training data set
set.seed(12345)
train_ind <- sample(seq_len(nrow(HN_data_past_X)), size = smp_size)
## creating test and training sets for x
HN_data_x_train_past <- HN_data_past_X[train_ind, ]
HN_data_x_train <- HN_data_X[train_ind, ]
HN_data_x_test_past <- HN_data_past_X[-train_ind, ]
HN_data_x_test <- HN_data_X[-train_ind, ]
## creating test and training sets for y
HN_data_y_train_past <- HN_data_past_Y[train_ind]
HN_data_y_train <- HN_data_Y[train_ind]
HN_data_y_test_past <- HN_data_past_Y[-train_ind]
HN_data_y_test <- HN_data_Y[-train_ind]
## Check the proportions for the class between all 3 datasets.
round(prop.table(table(select(HN_data_past, adopter), exclude = NULL)), 4) * 100
##
## 0 1
## 93.27 6.73
round(prop.table(table(HN_data_y_train_past)), 4) * 100
## HN_data_y_train_past
## 0 1
## 93.33 6.67
round(prop.table(table(HN_data_y_test_past)), 4) * 100
## HN_data_y_test_past
## 0 1
## 93.11 6.89
## Since we want to have the equal probability of picking 0 and 1 we need to do SMOTE
library("abind")
install.packages( "https://cran.r-project.org/src/contrib/Archive/DMwR/DMwR_0.4.1.tar.gz", repos=NULL, type="source" )
library("DMwR")
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
set.seed(1234)
##create the full training dataset with X and y variable
HN_data_past_train <- cbind(HN_data_x_train_past, HN_data_y_train_past)
HN_data_train <- cbind(HN_data_x_train, HN_data_y_train)
HN_train_balanced_past <- SMOTE(HN_data_y_train_past ~ ., data.frame(HN_data_past_train), perc.over = 100, perc.under = 200)
HN_train_balanced <- SMOTE(HN_data_y_train ~ ., data.frame(HN_data_train), perc.over = 100, perc.under = 200)
## Check the proportions
round(prop.table(table(HN_train_balanced$HN_data_y_train)), 4) * 100
##
## 0 1
## 50 50
round(prop.table(table(HN_train_balanced_past$HN_data_y_train_past)), 4) * 100
##
## 0 1
## 50 50
##remove the Y column from the newly balanced training set
HN_data_x_train_past <- HN_train_balanced_past %>% select(-HN_data_y_train_past)
HN_data_x_train <- HN_train_balanced %>% select(-HN_data_y_train)
##store the Y column
HN_data_y_train_past <- HN_train_balanced_past %>% pull(HN_data_y_train_past) %>% as.factor()
HN_data_y_train <- HN_train_balanced %>% pull(HN_data_y_train) %>% as.factor()
## Create an empty data frame to store results from different models
clf_results <- data.frame(matrix(ncol = 5, nrow = 0))
names(clf_results) <- c("Model", "Accuracy", "Precision", "Recall", "F1")
## Create an empty data frame to store TP, TN, FP and FN values
cost_benefit_df <- data.frame(matrix(ncol = 5, nrow = 0))
names(cost_benefit_df) <- c("Model", "TP", "FN", "FP", "TN")
After preparing data, we can start with building models.
## below model is logistic regression
yyy_fit <- train(HN_data_x_train_past,
HN_data_y_train_past,
method = "glm",
family = "binomial",
preProc = c("center", "scale"))
##predict the model based on test data
yyy_predict <- predict(yyy_fit, newdata = HN_data_x_test_past, positive="1" )
glm_prob <- predict(yyy_fit, newdata = HN_data_x_test_past, type = "prob")
## Add results into clf_results dataframe
x1 <- confusionMatrix(yyy_predict, HN_data_y_test_past )[["overall"]]
y1 <- confusionMatrix(yyy_predict, HN_data_y_test_past)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Log Regression",
Accuracy = round (x1[["Accuracy"]],3),
Precision = round (y1[["Precision"]],3),
Recall = round (y1[["Recall"]],3),
F1 = round (y1[["F1"]],3))
## Add results into cost_benefit_df dataframe
a1 <- confusionMatrix(yyy_predict, HN_data_y_test_past)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Log Regression",
TP = a1[["table"]][4],
FN = a1[["table"]][3],
FP = a1[["table"]][2],
TN = a1[["table"]][1])
## below model is logistic regression
yyy_fit_2 <- train(HN_data_x_train,
HN_data_y_train,
method = "glm",
family = "binomial",
preProc = c("center", "scale"))
##predict the model based on test data
yyy_predict_2 <- predict(yyy_fit_2, newdata = HN_data_x_test, positive="1" )
glm_prob_2 <- predict(yyy_fit_2, newdata = HN_data_x_test, type = "prob")
## Add results into clf_results dataframe
xa <- confusionMatrix(yyy_predict_2, HN_data_y_test )[["overall"]]
ya <- confusionMatrix(yyy_predict_2, HN_data_y_test)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Log Regression 2",
Accuracy = round (xa[["Accuracy"]],3),
Precision = round (ya[["Precision"]],3),
Recall = round (ya[["Recall"]],3),
F1 = round (ya[["F1"]],3))
## Add results into cost_benefit_df dataframe
aa <- confusionMatrix(yyy_predict_2, HN_data_y_test )
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Log Regression 2",
TP = aa[["table"]][4],
FN = aa[["table"]][3],
FP = aa[["table"]][2],
TN = aa[["table"]][1])
# change y value to "yes/no"
HN_data_y_train_past <- as.factor(ifelse(HN_data_y_train_past =="1", "YES", "NO"))
HN_data_y_test_past <- as.factor(ifelse(HN_data_y_test_past =="1", "YES", "NO"))
## one-hot eccoding for train and test data
library( dummies)
## dummies-1.5.6 provided by Decision Patterns
HN_data_x_train_past <- data.frame(HN_data_x_train_past) #convert from a tibble to a data frame
HN_data_x_train_past <- dummy.data.frame(data=HN_data_x_train_past, sep="_") #does one-hot encoding
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
HN_data_x_test_past <- data.frame(HN_data_x_test_past) #convert from a tibble to a data frame
HN_data_x_test_past <- dummy.data.frame(data=HN_data_x_test_past, sep="_") #does one-hot encoding
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
##Fit KNN Model
##Cross validation
cross_validation <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 1,
summaryFunction=twoClassSummary,
classProbs=TRUE)
# Hyperparamter tuning
# k = number of nrearest neighbours
Param_Grid <- expand.grid( k = 1:9)
# fit the model to training data
knn_clf_fit <- train(HN_data_x_train_past,
HN_data_y_train_past,
method = "knn",
metric = "ROC",
tuneGrid = Param_Grid,
trControl = cross_validation )
## Plot accuracies for different k values
plot(knn_clf_fit)
## predict the model based on test data
knn_Predict <- predict(knn_clf_fit, newdata = HN_data_x_test_past,positive = "YES")
knn_prob <- predict(knn_clf_fit, newdata = HN_data_x_test_past, type = "prob")
## Add results into clf_results dataframe
x2 <- confusionMatrix(knn_Predict, HN_data_y_test_past )[["overall"]]
y2 <- confusionMatrix(knn_Predict, HN_data_y_test_past)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "KNN",
Accuracy = round (x2[["Accuracy"]],3),
Precision = round (y2[["Precision"]],3),
Recall = round (y2[["Recall"]],3),
F1 = round (y2[["F1"]],3))
## Add results into cost_benefit_df dataframe
a2 <- confusionMatrix(knn_Predict, HN_data_y_test_past)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "KNN",
TP = a2[["table"]][4],
FN = a2[["table"]][3],
FP = a2[["table"]][2],
TN = a2[["table"]][1])
## change y value to "yes/no"
HN_data_y_train <- as.factor(ifelse(HN_data_y_train =="1", "YES", "NO"))
HN_data_y_test <- as.factor(ifelse(HN_data_y_test =="1", "YES", "NO"))
## one-hot eccoding for train and test data
library( dummies)
HN_data_x_train <- data.frame(HN_data_x_train) #convert from a tibble to a data frame
HN_data_x_train <- dummy.data.frame(data=HN_data_x_train, sep="_") #does one-hot encoding
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
HN_data_x_test <- data.frame(HN_data_x_test) #convert from a tibble to a data frame
HN_data_x_test <- dummy.data.frame(data=HN_data_x_test, sep="_") #does one-hot encoding
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Cross validation
cross_validation <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 1,
summaryFunction=twoClassSummary,
classProbs=TRUE)
# Hyperparamter tuning
# k = number of nrearest neighbours
Param_Grid <- expand.grid( k = 1:9)
# fit the model to training data
knn_clf_fit_2 <- train(HN_data_x_train,
HN_data_y_train,
method = "knn",
metric = "ROC",
tuneGrid = Param_Grid,
trControl = cross_validation )
## Plot accuracies for different k values
plot(knn_clf_fit_2)
##predict the model based on test data
knn_Predict_2 <- predict(knn_clf_fit_2, newdata = HN_data_x_test,positive = "YES")
knn_prob_2 <- predict(knn_clf_fit_2, newdata = HN_data_x_test, type = "prob")
## Add results into clf_results dataframe
xb <- confusionMatrix(knn_Predict_2, HN_data_y_test )[["overall"]]
yb <- confusionMatrix(knn_Predict_2, HN_data_y_test)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "KNN 2",
Accuracy = round (xb[["Accuracy"]],3),
Precision = round (yb[["Precision"]],3),
Recall = round (yb[["Recall"]],3),
F1 = round (yb[["F1"]],3))
## Add results into cost_benefit_df dataframe
ab <- confusionMatrix(knn_Predict_2, HN_data_y_test)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "KNN 2",
TP = ab[["table"]][4],
FN = ab[["table"]][3],
FP = ab[["table"]][2],
TN = ab[["table"]][1])
XG_clf_fit <- train(HN_data_x_train_past,
HN_data_y_train_past,
method = "xgbTree",
preProc = c("center", "scale"))
## Predict on test data
XG_clf_predict <- predict(XG_clf_fit,HN_data_x_test_past, positive = "YES")
xg_prob <- predict(XG_clf_fit, newdata = HN_data_x_test_past, type = "prob")
## Add results into clf_results dataframe
x3 <- confusionMatrix(XG_clf_predict, HN_data_y_test_past )[["overall"]]
y3 <- confusionMatrix(XG_clf_predict, HN_data_y_test_past )[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "XG Boost",
Accuracy = round (x3[["Accuracy"]],3),
Precision = round (y3[["Precision"]],3),
Recall = round (y3[["Recall"]],3),
F1 = round (y3[["F1"]],3))
## Add results into cost_benefit_df dataframe
a3 <- confusionMatrix(XG_clf_predict, HN_data_y_test_past)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "XG Boost",
TP = a3[["table"]][4],
FN = a3[["table"]][3],
FP = a3[["table"]][2],
TN = a3[["table"]][1])
XG_clf_fit_2 <- train(HN_data_x_train,
HN_data_y_train,
method = "xgbTree",
preProc = c("center", "scale"))
## Predict on test data
XG_clf_predict_2 <- predict(XG_clf_fit_2,HN_data_x_test, positive = "YES")
xg_prob_2 <- predict(XG_clf_fit_2, newdata = HN_data_x_test, type = "prob")
## Add results into clf_results dataframe
xc <- confusionMatrix(XG_clf_predict_2, HN_data_y_test )[["overall"]]
yc <- confusionMatrix(XG_clf_predict_2, HN_data_y_test )[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "XG Boost 2",
Accuracy = round (xc[["Accuracy"]],3),
Precision = round (yc[["Precision"]],3),
Recall = round (yc[["Recall"]],3),
F1 = round (yc[["F1"]],3))
## Add results into cost_benefit_df dataframe
ac <- confusionMatrix(XG_clf_predict_2, HN_data_y_test)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "XG Boost 2",
TP = ac[["table"]][4],
FN = ac[["table"]][3],
FP = ac[["table"]][2],
TN = ac[["table"]][1])
## and size of Hidden layers
my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 7))
nn_clf_fit <- train(HN_data_x_train_past,
HN_data_y_train_past,
method = "nnet",
trace = F,
tuneGrid = my.grid,
linout = 0,
stepmax = 100,
threshold = 0.01 )
## Plot Neural Network
plotnet(nn_clf_fit$finalModel, y_names = "adopter")
## Predict on test data
nn_clf_predict <- predict(nn_clf_fit,HN_data_x_test_past, positive = "1")
NN_prob <- predict(nn_clf_fit, newdata = HN_data_x_test_past, type = "prob")
## Add results into clf_results dataframe
x4 <- confusionMatrix(nn_clf_predict,HN_data_y_test_past)[["overall"]]
y4 <- confusionMatrix(nn_clf_predict,HN_data_y_test_past)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Neural Network",
Accuracy = round (x4[["Accuracy"]],3),
Precision = round (y4[["Precision"]],3),
Recall = round (y4[["Recall"]],3),
F1 = round (y4[["F1"]],3))
## Add results into cost_benefit_df dataframe
a4 <- confusionMatrix(nn_clf_predict,HN_data_y_test_past)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Neural Network",
TP = a4[["table"]][4],
FN = a4[["table"]][3],
FP = a4[["table"]][2],
TN = a4[["table"]][1])
## and size of Hidden layers
my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 7))
nn_clf_fit_2 <- train(HN_data_x_train,
HN_data_y_train,
method = "nnet",
trace = F,
tuneGrid = my.grid,
linout = 0,
stepmax = 100,
threshold = 0.01 )
print(nn_clf_fit_2)
## Neural Network
##
## 21468 samples
## 32 predictor
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 21468, 21468, 21468, 21468, 21468, 21468, ...
## Resampling results across tuning parameters:
##
## decay size Accuracy Kappa
## 0.1 5 0.7523900 0.5049620
## 0.1 7 0.7540660 0.5083174
## 0.5 5 0.7431458 0.4864760
## 0.5 7 0.7450599 0.4902684
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 7 and decay = 0.1.
## Plot Neural Network
plotnet(nn_clf_fit_2$finalModel, y_names = "adopter")
## Predict on test data
nn_clf_predict_2 <- predict(nn_clf_fit_2,HN_data_x_test, positive = "1")
NN_prob_2 <- predict(nn_clf_fit_2, newdata = HN_data_x_test, type = "prob")
## Add results into clf_results dataframe
xd <- confusionMatrix(nn_clf_predict_2,HN_data_y_test)[["overall"]]
yd <- confusionMatrix(nn_clf_predict_2,HN_data_y_test)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Neural Network 2",
Accuracy = round (xd[["Accuracy"]],3),
Precision = round (yd[["Precision"]],3),
Recall = round (yd[["Recall"]],3),
F1 = round (yd[["F1"]],3))
## Add results into cost_benefit_df dataframe
ad <- confusionMatrix(nn_clf_predict_2,HN_data_y_test)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Neural Network",
TP = ad[["table"]][4],
FN = ad[["table"]][3],
FP = ad[["table"]][2],
TN = ad[["table"]][1])
## select range of lamdas
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
## Setting alpha = 0 implements ridge regression
ridge_cv <- cv.glmnet(HN_data_x_train_past %>% as.matrix(), HN_data_y_train_past %>% as.matrix(),
alpha = 0, lambda = lambdas_to_try, family = 'binomial',
standardize = TRUE)
## plot MSE vs Log(lambda)
plot(ridge_cv)
## select the best cross-validated lambda
lambda_cv <- ridge_cv$lambda.min
cat("Best lambda for Ridge Regression (alpha = 0) is", lambda_cv)
## Best lambda for Ridge Regression (alpha = 0) is 0.001747528
## Train the model using best lamda
model_ridge_cv <- glmnet(HN_data_x_train_past %>% as.matrix(), HN_data_y_train_past %>% as.matrix(),
alpha = 0, lambda = lambda_cv, family="binomial", standardize = TRUE)
## prediction on test data
ridge_predict_res <- predict(model_ridge_cv, HN_data_x_test_past %>% as.matrix(),type="response")
ridge_predict_class <- predict(model_ridge_cv, HN_data_x_test_past %>% as.matrix(),type="class")
## ridge_total<-cbind(ridge_predict_res,ridge_predict_prob)
ridge_predict_res_0<-ridge_predict_res
ridge_predict_res_1<-1-ridge_predict_res
ridge_prob<-cbind(ridge_predict_res_0,ridge_predict_res_1)
confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test_past,positive="YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 19180 905
## YES 5778 941
##
## Accuracy : 0.7507
## 95% CI : (0.7454, 0.7558)
## No Information Rate : 0.9311
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1252
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.50975
## Specificity : 0.76849
## Pos Pred Value : 0.14005
## Neg Pred Value : 0.95494
## Prevalence : 0.06887
## Detection Rate : 0.03511
## Detection Prevalence : 0.25067
## Balanced Accuracy : 0.63912
##
## 'Positive' Class : YES
##
## Add results into clf_results dataframe
x5 <- confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test_past)[["overall"]]
y5 <- confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test_past)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Ridge regression",
Accuracy = round (x5[["Accuracy"]],3),
Precision = round (y5[["Precision"]],3),
Recall = round (y5[["Recall"]],3),
F1 = round (y5[["F1"]],3))
## Add results into cost_benefit_df dataframe
a5 <- confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test_past)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Ridge regression",
TP = a5[["table"]][4],
FN = a5[["table"]][3],
FP = a5[["table"]][2],
TN = a5[["table"]][1])
## select range of lamdas
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
## Setting alpha = 0 implements ridge regression
ridge_cv_2 <- cv.glmnet(HN_data_x_train %>% as.matrix(), HN_data_y_train %>% as.matrix(),
alpha = 0, lambda = lambdas_to_try, family = 'binomial',
standardize = TRUE)
## plot MSE vs Log(lambda)
plot(ridge_cv_2)
## select the best cross-validated lambda
lambda_cv_2 <- ridge_cv_2$lambda.min
cat("Best lambda for Ridge Regression (alpha = 0) is", lambda_cv_2)
## Best lambda for Ridge Regression (alpha = 0) is 0.001
## Train the model using best lamda
model_ridge_cv_2 <- glmnet(HN_data_x_train %>% as.matrix(), HN_data_y_train %>% as.matrix(),
alpha = 0, lambda = lambda_cv_2, family="binomial", standardize = TRUE)
## prediction on test data
ridge_predict_res_2 <- predict(model_ridge_cv_2, HN_data_x_test %>% as.matrix(),type="response")
ridge_predict_class_2 <- predict(model_ridge_cv_2, HN_data_x_test %>% as.matrix(),type="class")
## ridge_total<-cbind(ridge_predict_res,ridge_predict_prob)
ridge_predict_res_0<-ridge_predict_res_2
ridge_predict_res_1<-1-ridge_predict_res_2
ridge_prob_2<-cbind(ridge_predict_res_0,ridge_predict_res_1)
confusionMatrix(as.factor(ridge_predict_class), HN_data_y_test,positive="YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 19180 905
## YES 5778 941
##
## Accuracy : 0.7507
## 95% CI : (0.7454, 0.7558)
## No Information Rate : 0.9311
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1252
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.50975
## Specificity : 0.76849
## Pos Pred Value : 0.14005
## Neg Pred Value : 0.95494
## Prevalence : 0.06887
## Detection Rate : 0.03511
## Detection Prevalence : 0.25067
## Balanced Accuracy : 0.63912
##
## 'Positive' Class : YES
##
## Add results into clf_results dataframe
xe <- confusionMatrix(as.factor(ridge_predict_class_2), HN_data_y_test)[["overall"]]
ye <- confusionMatrix(as.factor(ridge_predict_class_2), HN_data_y_test)[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Ridge regression 2",
Accuracy = round (xe[["Accuracy"]],3),
Precision = round (ye[["Precision"]],3),
Recall = round (ye[["Recall"]],3),
F1 = round (ye[["F1"]],3))
ae <- confusionMatrix(as.factor(ridge_predict_class_2), HN_data_y_test)
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Ridge regression 2",
TP = ae[["table"]][4],
FN = ae[["table"]][3],
FP = ae[["table"]][2],
TN = ae[["table"]][1])
## Cross validation
cross_validation <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 3)
Param_Grid <- expand.grid(maxdepth = 2:10)
dtree_fit <- train(HN_data_x_train_past,
HN_data_y_train_past,
method = "rpart2",
# split - criteria to split nodes
parms = list(split = "gini"),
tuneGrid = Param_Grid,
trControl = cross_validation,
# preProc - perform listed pre-processing to predictor dataframe
preProc = c("center", "scale"))
## Predict on test data
dtree_predict <- predict(dtree_fit, newdata = HN_data_x_test_past)
dtree_prob <- predict(dtree_fit, newdata = HN_data_x_test_past, type = "prob")
## Add results into clf_results dataframe
x6 <- confusionMatrix(dtree_predict, HN_data_y_test_past )[["overall"]]
y6 <- confusionMatrix(dtree_predict, HN_data_y_test_past )[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Decision Tree",
Accuracy = round (x6[["Accuracy"]],3),
Precision = round (y6[["Precision"]],3),
Recall = round (y6[["Recall"]],3),
F1 = round (y6[["F1"]],3))
a6 <- confusionMatrix(dtree_predict, HN_data_y_test_past )
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Decision Tree",
TP = a6[["table"]][4],
FN = a6[["table"]][3],
FP = a6[["table"]][2],
TN = a6[["table"]][1])
## Cross validation
cross_validation <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated three times
repeats = 3)
Param_Grid <- expand.grid(maxdepth = 2:10)
dtree_fit_2 <- train(HN_data_x_train,
HN_data_y_train,
method = "rpart2",
# split - criteria to split nodes
parms = list(split = "gini"),
tuneGrid = Param_Grid,
trControl = cross_validation,
# preProc - perform listed pre-processing to predictor dataframe
preProc = c("center", "scale"))
## Predict on test data
dtree_predict_2 <- predict(dtree_fit_2, newdata = HN_data_x_test)
dtree_prob_2 <- predict(dtree_fit_2, newdata = HN_data_x_test, type = "prob")
## Add results into clf_results dataframe
xg <- confusionMatrix(dtree_predict_2, HN_data_y_test )[["overall"]]
yg <- confusionMatrix(dtree_predict_2, HN_data_y_test )[["byClass"]]
clf_results[nrow(clf_results) + 1,] <- list(Model = "Decision Tree2",
Accuracy = round (xg[["Accuracy"]],3),
Precision = round (yg[["Precision"]],3),
Recall = round (yg[["Recall"]],3),
F1 = round (yg[["F1"]],3))
ag <- confusionMatrix(dtree_predict_2, HN_data_y_test )
cost_benefit_df[nrow(cost_benefit_df) + 1,] <- list(Model = "Decision Tree 2",
TP = ag[["table"]][4],
FN = ag[["table"]][3],
FP = ag[["table"]][2],
TN = ag[["table"]][1])
##Compare models outputs
print(cost_benefit_df)
## Model TP FN FP TN
## 1 Log Regression 940 906 5748 19210
## 2 Log Regression 2 1245 601 7681 17277
## 3 KNN 777 1069 6471 18487
## 4 KNN 2 983 863 6717 18241
## 5 XG Boost 928 918 3951 21007
## 6 XG Boost 2 1075 771 4099 20859
## 7 Neural Network 932 914 6637 18321
## 8 Neural Network 1046 800 4982 19976
## 9 Ridge regression 941 905 5778 19180
## 10 Ridge regression 2 1247 599 7704 17254
## 11 Decision Tree 1048 798 5325 19633
## 12 Decision Tree 2 1145 701 6300 18658
print(clf_results)
## Model Accuracy Precision Recall F1
## 1 Log Regression 0.752 0.955 0.770 0.852
## 2 Log Regression 2 0.691 0.966 0.692 0.807
## 3 KNN 0.719 0.945 0.741 0.831
## 4 KNN 2 0.717 0.955 0.731 0.828
## 5 XG Boost 0.818 0.958 0.842 0.896
## 6 XG Boost 2 0.818 0.964 0.836 0.895
## 7 Neural Network 0.718 0.952 0.734 0.829
## 8 Neural Network 2 0.784 0.961 0.800 0.874
## 9 Ridge regression 0.751 0.955 0.768 0.852
## 10 Ridge regression 2 0.690 0.966 0.691 0.806
## 11 Decision Tree 0.772 0.961 0.787 0.865
## 12 Decision Tree2 0.739 0.964 0.748 0.842
Now that we have clf results for all models, we can compare F1 score and accuracy between different models on a chart.
## Plot accuracy for all the Classification Models
ggplot(clf_results[1:12,] %>% arrange(desc(Accuracy)) %>%
mutate(Model=factor(Model, levels=Model) ),
aes(x = Model, y = Accuracy)) +
geom_bar(stat = "identity" , width=0.3, fill="steelblue") +
coord_cartesian(ylim = c(0.2, 1)) +
geom_hline(aes(yintercept = mean(Accuracy)),
colour = "green",linetype="dashed") +
ggtitle("Compare Accuracy for all Models") +
theme(plot.title = element_text(color="black", size=10, hjust = 0.5),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Plot F1 score for all the Classification Models
ggplot(clf_results[1:12,] %>% arrange(desc(F1)) %>%
mutate(Model=factor(Model, levels=Model) ),
aes(x = Model, y = F1)) +
geom_bar(stat = "identity" , width=0.3, fill="steelblue") +
coord_cartesian(ylim = c(0.2, 1)) +
geom_hline(aes(yintercept = mean(F1)),
colour = "green",linetype="dashed") +
ggtitle("Compare F1 Score for all Models") +
theme(plot.title = element_text(color="black", size=10, hjust = 0.5),
axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
looking at all models F1 scores and accuracy, XGboost 2 which is trained using all data has the highest F1 score and highest accuracy. However, accuracy is a good measure when the target variable classes in the data are nearly balanced. Considering this, I will compare ROC curves as well, because in the High Note scenario, comparing only the accuracy of different models can be misleading. The area under ROC can be an accurate measure now that we need to summarize the performance of different models.
## Plotting ROC curves
preds_list_past <- list(glm_prob[,2],knn_prob[,2],xg_prob[,2],NN_prob[,2],ridge_prob[,1],dtree_prob[,2])
preds_list_all <- list(glm_prob_2[,2],knn_prob_2[,2],xg_prob_2[,2],NN_prob_2[,2],ridge_prob_2[,1],dtree_prob_2[,2])
## List of actual values
m <- length(preds_list_past)
actuals_list_past <- rep(list(HN_data_y_test_past), m)
actuals_list_all <- rep(list(HN_data_y_test), m)
## Plot the ROC curves
pred_past <- prediction(preds_list_past, actuals_list_past)
pred_all <- prediction(preds_list_all, actuals_list_all)
rocs_past <- performance(pred_past, "tpr", "fpr")
rocs_all <- performance(pred_all, "tpr", "fpr")
AUC_past_models <- performance(pred_past, "auc")
AUC_all_models <- performance(pred_all, "auc")
auc_lr = round(AUC_past_models@y.values[[1]], 3)
auc_lr2 = round(AUC_all_models@y.values[[1]], 3)
auc_knn= round(AUC_past_models@y.values[[2]], 3)
auc_knn2= round(AUC_all_models@y.values[[2]], 3)
auc_xg = round(AUC_past_models@y.values[[3]], 3)
auc_xg2 = round(AUC_all_models@y.values[[3]], 3)
auc_nn = round(AUC_past_models@y.values[[4]], 3)
auc_nn2 = round(AUC_all_models@y.values[[4]], 3)
auc_ridge = round(AUC_past_models@y.values[[5]], 3)
auc_ridge2 = round(AUC_all_models@y.values[[5]], 3)
auc_dtree = round(AUC_past_models@y.values[[6]], 3)
auc_dtree2 = round(AUC_all_models@y.values[[6]], 3)
# Plot the ROC curves for past data
plot(rocs_past, col = as.list(1:m), main = "ROC Curves of different models (past data)")
legend(x = "bottomright",
legend = c(
paste0("log regression past_", auc_lr),
paste0("knn past- ", auc_knn),
paste0("XG boost past - ", auc_xg),
paste0("Neural network past - ", auc_nn),
paste0("ridge past - ", auc_ridge),
paste0("dtree past - ", auc_dtree)
),
fill = 1:m)
# Plot the ROC curves for all data
plot(rocs_all, col = as.list(1:m), main = "ROC Curves of different models (all data)")
legend(x = "bottomright",
legend = c(
paste0("log regression all_", auc_lr2),
paste0("knn all - ", auc_knn2),
paste0("XG boost all - ", auc_xg2),
paste0("Neural network all - ", auc_nn2),
paste0("ridge all- ", auc_ridge2),
paste0("dtree all - ", auc_dtree2)
),
fill = 1:m)
After plotting ROC curves and comparing AUC values, it seems XGboos (all data) still has a better performance. So I will pick XG boos which I trained using all data as the final model to forecast the probability of switching to premium membership between free subscribers.
Now that I’ve picked my final model, I need to get back to the data and pick all free users in a new dataset.
HN_data_free_users<-HN_data_PostModule%>% filter(adopter==0)
## remove id and adopter
HN_data_free_users_test<-HN_data_free_users %>% select(-c("net_user","adopter"))
HN_data_free_users_id<-HN_data_free_users %>% select(c("net_user"))
# one-hot encoding for data
HN_data_free_users_test <- data.frame(HN_data_free_users_test) #convert from a tibble to a data frame
HN_data_free_users_test_dummy <- dummy.data.frame(data=HN_data_free_users_test, sep="_") #does one-hot encoding
Finally, we can forecast the probability of switching to premium for all free users and pick the top 1000 users who have the highest probability of switching.
# predict values on new data
predict_adopting_probability<- predict(XG_clf_fit_2, newdata = HN_data_free_users_test_dummy, type = "prob")
probability_of_adopting<-predict_adopting_probability[,2]
#adding id to the table
adopters_list<-cbind(HN_data_free_users_id, probability_of_adopting)
summary(adopters_list)
## net_user probability_of_adopting
## #NAME? : 9 Min. :0.0006899
## ____foursticks: 1 1st Qu.:0.0616172
## ____mih : 1 Median :0.1625387
## ___8 : 1 Mean :0.2424639
## ___ashley_ : 1 3rd Qu.:0.3769027
## ___cake : 1 Max. :0.9956692
## (Other) :99986
top_1000_adopters<- head(adopters_list[order(-adopters_list$probability_of_adopting), ],1000)
summary(top_1000_adopters)
## net_user probability_of_adopting
## _underyourspell: 1 Min. :0.8606
## -aceton- : 1 1st Qu.:0.8766
## -bitchslap- : 1 Median :0.8974
## -opium- : 1 Mean :0.9045
## ab11111111111 : 1 3rd Qu.:0.9274
## abysser : 1 Max. :0.9957
## (Other) :994
Targeting subscribers, just based on probability, may not always be the best approach. For example in the High Note scenario, we may argue that new free subscribers with high probability will indeed switch without any special promotion and the campaign will be more productive if we target only subscribers who “have high probability and are free users for more than a year” In this case, just focusing on probability as a benchmark, may not cover expectations of this campaign. To test this strategy, we can randomly pick two groups with similar probability who are free subscribers for less than a year. Then send the promotion to one group only and test if the promotion will motivate members of this group to switch to premium more than the group without any promotion. Another strategy to precisely compare the RIO, could be including the “expected value/cost analysis” in our targeting process. Looking at the High Note scenario, each targeted member will be offered a two-month free trial, which means there is a cost associated with the campaign. On other hand, if the member continues with a premium subscription, that will bring extra revenue to the High Note. Considering this we can calculate the true value/cost of each member by calculating:
Pc(m).Vc(m)+[1-Pc(m)].Vnc(m)
in this function
Pc(m), is the probability the targeted subscriber continues premium membership.
Vc(m), the value that each targeted subscriber will bring to High Note if continues his premium membership, which is equal to the monthly subscription fee minus the revenue from ads.
1-P(m), is the probability the targeted subscriber doesn’t continue his premium membership at the end of two months.
Vnc, is the cost that each targeted subscriber will cause to High Note.
Since we don’t want to lose money during this campaign, we will target the customer only if
Pc(m).Vc(m)+[1-Pc(m)].Vnc(m)>0
As a result of the expected value analysis, we might skip targeting even high probability users because they are still not bringing the value anticipated from implementing the promotion to the organization.