In this paper, I work on the MIT Digital Marketing Analytics ongoing use case, the High Note case.
I base my work on the study realised by G. Oestreicher-Singer and L. Zalmanson, “Content or Community? A Digital Business Strategy for Content Providers in the Social Age,” MIS Quarterly 37, no. 2 (June 2013): 591-616.
I try to get a predictive equation of the likelihood of a customer to become subscriber based on demographic, content consumption, content participation, community participation and social characteristics.
I do so first on a balanced dataset between adopters and non adopters. This balancing process is realized randomly.
In a second step, I look at the following question : is the fact taht subscribers have more friends a causal reason why they’re subscribers, or is that just correlation ?
I run a propensity score matching before running again a logistic regression in order to prove that this relation between subscribers and subscribing friends is causal.
Oestreicher et al., in 2013, summarized all the studies undertaken on social levels of participation. In order to make my analysis of the High Note case, I used their theory and classified the available data accordingly :
I was not able to find variables for the Community leadership stage. Instead, I added :
Let’s have a look to the data and compare the means :
| Type_of_Metric | Variable | Free_mean | Free_median | Free_sd | Adopters_mean | Adopters_median | Adopters_sd |
|---|---|---|---|---|---|---|---|
| Content Consumption | #songsListened | 11,919.30 | 3,023.00 | 23,437.23 | 25,959.55 | 13,018.00 | 40,438.58 |
| Content Organization | #lovedTracks | 67.06 | 7.00 | 228.10 | 226.13 | 83.00 | 674.96 |
| #playlists | 0.49 | 0.00 | 1.52 | 1.15 | 1.00 | 22.97 | |
| Friends | #friends | 11.07 | 3.00 | 42.90 | 28.38 | 9.00 | 93.42 |
| Subscriber Friends | #subscriber friends | 0.27 | 0.00 | 1.79 | 1.25 | 0.00 | 4.60 |
| Community Participation | #posts | 2.84 | 0.00 | 70.89 | 16.72 | 0.00 | 247.75 |
| #shouts | 17.14 | 2.00 | 116.57 | 73.45 | 3.00 | 915.28 | |
| Demographics | age | 24.22 | 23.00 | 6.78 | 26.30 | 25.00 | 7.24 |
| gender | 0.62 | 1.00 | 0.49 | 0.72 | 1.00 | 0.45 | |
| tenure | 39.41 | 38.00 | 19.24 | 41.51 | 40.00 | 19.76 | |
| Location | Good Country | 0.37 | 0.00 | 0.48 | 0.32 | 0.00 | 0.46 |
| Type_of_Metric | Variable | Free_mean | Adopters_mean | Ratio |
|---|---|---|---|---|
| Content Consumption | #songsListened | 11,919.30 | 25,959.55 | 2.18 |
| Content Organization | #lovedTracks | 67.06 | 226.13 | 3.37 |
| #playlists | 0.49 | 1.15 | 2.38 | |
| Friends | #friends | 11.07 | 28.38 | 2.56 |
| Subscriber Friends | #subscriber friends | 0.27 | 1.25 | 4.67 |
| Community Participation | #posts | 2.84 | 16.72 | 5.89 |
| #shouts | 17.14 | 73.45 | 4.29 | |
| Demographics | age | 24.22 | 26.30 | 1.09 |
| gender | 0.62 | 0.72 | 1.17 | |
| tenure | 39.41 | 41.51 | 1.05 | |
| Location | Good Country | 0.37 | 0.32 | 0.85 |
Content Consumption
Subscribers consume 118% more music than do their nonpaying peers.
Content Organization
On average, subscribers create 138% more playlists and they choose to mark 237% more tracks as loved.
Community Participation
Subscribers are substantially more active in the site’s community: compared with nonpaying users, paying subscribers write 489% more posts and send 329% more shouts.
Social
Moreover, paying subscribers have more friends listed on their pages. The average non-paying user has 11 friends, the average subscriber has 28 friends, that is, subscribers have on average 156% more friends.
Service adoption decisions of consumers may be influenced by the actions of their peers (Choi et al. 2009). Indeed, the average subscriber has 1.25 subscriber friends, compared to only 0.27 subscriber friends for the average nonpaying user.
Demographics
It seems there si no significant difference in activity levels or in propensity to subscribe based on gender.
However, subscribers are on average 2 years older than nonpaying users. Given the relatively small subscription fee of $3 per month, it could be a consequence of income differences.
Interestingly, subscribers make their subscription decisions after using the site for 41 months (3.4 years!) on average. Therefore, the conversion process requires a lot of patience !
Data cleaning
The dataset consists of 107,213 observations of 38 variables.
I concentrate my analysis on the 12 variable previously selected.
I therefore reduce the data to the minimum required and I’m looking to data non available (NA) in percentage.
select <- select(data, c("male", "age", "tenure", "good_country", "friend_cnt", "subscriber_friend_cnt",
"songsListened", "lovedTracks", "playlists", "posts", "shouts", "adopter"))
p <- function(x) {sum(is.na(x))/length(x)*100}
apply(select,2,p)
## male age tenure
## 36.3295495882 47.4373443519 0.0298471267
## good_country friend_cnt subscriber_friend_cnt
## 36.5207577439 0.0009327227 0.0009327227
## songsListened lovedTracks playlists
## 0.0000000000 0.0000000000 0.0000000000
## posts shouts adopter
## 0.0000000000 1.7973566638 0.0000000000
There are 47% missing values in age and 36% in gender (male). Therefore, let’s remove all observations with NA in variables.
select <- na.exclude(select)
The dataset is now 48,708 observations of 12 variables.
Correlation analysis
| male | age | tenure | good_country | friend_cnt | subscriber_friend_cnt | songsListened | lovedTracks | playlists | posts | shouts | adopter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| male | 1 | |||||||||||
| age | 0.17 | 1 | ||||||||||
| tenure | 0.09 | 0.28 | 1 | |||||||||
| good_country | 0.01 | 0.11 | 0.12 | 1 | ||||||||
| friend_cnt | -0.01 | -0.04 | 0.01 | -0.04 | 1 | |||||||
| subscriber_friend_cnt | 0.01 | 0.06 | 0.02 | 0.01 | 0.78 | 1 | ||||||
| songsListened | 0.11 | 0.01 | 0.24 | 0.02 | 0.22 | 0.14 | 1 | |||||
| lovedTracks | 0.02 | 0.05 | 0.01 | 0.02 | 0.2 | 0.18 | 0.24 | 1 | ||||
| playlists | -0.01 | 0.11 | 0.07 | 0 | 0.05 | 0.08 | 0.07 | 0.13 | 1 | |||
| posts | 0.01 | 0 | 0.04 | 0 | 0.05 | 0.06 | 0.09 | 0.06 | 0.02 | 1 | ||
| shouts | -0.02 | -0.02 | 0.02 | -0.02 | 0.19 | 0.13 | 0.12 | 0.09 | 0.02 | 0.12 | 1 | |
| adopter | 0.06 | 0.08 | 0.02 | -0.04 | 0.09 | 0.11 | 0.14 | 0.17 | 0.08 | 0.04 | 0.05 | 1 |
The subscription process is not highly correlated with any variable, which is a good thing for our analysis.
Looking further, the subscription seems more correlated to the first stage of the ladder of participation (correlation of 0.14 for songsListened and 0.17 for lovedTracks) rather than the higher stage (0.08 for playlits, 0.04 for posts and 0.05 for shouts).
Nevertheless, the adoption process seems correlated (0.11) to the number of subscriber friends. I will try later to distinguish whether this is a simple correlation or a causality.
Balance between adopters and non adopters
But for now, we have only 3,864 adopters for 44,844 non adopters in the dataset (8.62%). Therefore, in order to get better result in the regression analysis to come, I first have to reduce the number of non adopters in the dataset.
I target a 25%-75% ratio between adopters and non-adopters.
I keep the 3,864 adopters and I deliberately under-samples the non adopter users, randomly selecting 11,566 non adopters observations with the following code.
I finish by merging the adopters and selected non-adopters in a training dataset.
set.seed(1234)
ind <- sample(2, nrow(select_non_adopters), replace = T, prob = c(0.7415, 0.2585))
under_non_adopters <- select_non_adopters[ind==2,]
train <- rbind(select_adopters,under_non_adopters)
Correcting the standard error on the intercept
According to Manski & Lerman, 1977, I need to correct the intercept estimation by substracting a constant equal to \(log(S_i/P_i)\) where \(S_i\) is the percentage of observations \(i\) in the sample and \(P_i\) is the percentage in the population.
This constant is : 1.0647907.
Transform categorical data into factors for regression to work later on
train$male <- as.factor(train$male)
train$adopter <- as.factor(train$adopter)
train$good_country <- as.factor(train$good_country)
I am looking for a logistic regression in order to find the equation of \(Y_i\) where : \(Y_i = \alpha_0 + \alpha_1songsListened + \alpha_2lovedTracks_i + \alpha_3playlists_i + \alpha_4posts_i + \alpha_5shouts_i +\alpha_6{friendcnt}_i\) + \(\alpha_7subscriberfriendcnt_i + \alpha_8age_i + \alpha_9male_i + \alpha_{10}tenure_i + \alpha_{11}goodcountry_i + \epsilon_i = V_i + \epsilon_i\)
And with the probability \(P_i\) that a customer \(i\) becomes a paying customer is given by : \(P_i = \frac{exp(V_i)}{1+exp(V_i)}\)
mymodel <- glm(adopter ~ songsListened + lovedTracks + playlists + posts + shouts +
friend_cnt + subscriber_friend_cnt + age + male + tenure + good_country ,
data = train, family = 'binomial')
summary(mymodel)
##
## Call:
## glm(formula = adopter ~ songsListened + lovedTracks + playlists +
## posts + shouts + friend_cnt + subscriber_friend_cnt + age +
## male + tenure + good_country, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.7137 -0.5870 0.0094 2.2167
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4595750513 0.0806942070 -30.480 < 0.0000000000000002
## songsListened 0.0000104973 0.0000007425 14.137 < 0.0000000000000002
## lovedTracks 0.0010328594 0.0000772159 13.376 < 0.0000000000000002
## playlists 0.1178354030 0.0187757918 6.276 0.000000000348
## posts 0.0007909951 0.0003799579 2.082 0.03736
## shouts 0.0001146605 0.0001031491 1.112 0.26631
## friend_cnt -0.0018685280 0.0005237641 -3.567 0.00036
## subscriber_friend_cnt 0.2174699255 0.0174254655 12.480 < 0.0000000000000002
## age 0.0374079789 0.0029242981 12.792 < 0.0000000000000002
## male1 0.4284249910 0.0449937898 9.522 < 0.0000000000000002
## tenure -0.0069803465 0.0011023016 -6.333 0.000000000241
## good_country1 -0.4047090008 0.0436723956 -9.267 < 0.0000000000000002
##
## (Intercept) ***
## songsListened ***
## lovedTracks ***
## playlists ***
## posts *
## shouts
## friend_cnt ***
## subscriber_friend_cnt ***
## age ***
## male1 ***
## tenure ***
## good_country1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17368 on 15429 degrees of freedom
## Residual deviance: 15496 on 15418 degrees of freedom
## AIC: 15520
##
## Number of Fisher Scoring iterations: 6
The result is the following : \(Y_i\) = -3.5243658 + 0.0000105 \(songsListened\) + 0.0010329 \(lovedTracks_i\) + 0.1178354 \(playlists_i\) + 0.000791 \(posts_i\) + 0.0001147 \(shouts_i\) -0.0018685 \(friendcnt_i\) + 0.2174699 \(subscriberfriendcnt_i\) + 0.037408 \(age_i\) + 0.428425 \(male_i\) -0.0069803 \(tenure_i\) -0.404709 \(goodcountry_i\) + \(\epsilon_i\) = \(V_i\) + \(\epsilon_i\)
Remember, I correct the intercept coefficient by a constant.
Now let’s discuss the impact of variable to the likelihood of a customer \(i\) to become a subscriber.
Content Consumption
Content consumption, in terms of songs listened to has a positive and significant association with the subscription decision but only with a low effect on the subscription decision.
Looking at our full model, the effect of creating a playlist is equal to that of listening to around 11,000 more songs.
Content Organization
The activity of marking tracks as loved and creating playlists, are positively correlated with subscription behavior (odds ratio = 1.001 for each track marked as loved, and odds ratio = 1.125 for each playlist created).
Community Participation
Unfortunately, posting a comment and sending shouts do not have a significant association with the subscription decision.
Therefore, this doesn’t allow me to confirm the ladder of participation theory.
It also means that the website does not provide enough Content participation and Content Leadership features.
As building these features is costly and without immediate ROI, I would suggest building a social media campaign in order to build a page on Facebook and Twitter. This is easy and would only require small adjustments of the website in order to ask premium customers to share their likes and playlist there.
Nevertheless, I would recommend, in the long term, building features that would allow premium customer to more and more interact with the website : groups, blog entries and forums. According to Oestreicher et al., making social experience central to the content website’s would help build a strong relationship with customers, and therefore increase customer Lifetime Value.
Demographics
The age and the gender of the user are positively associated with the likelihood of subscription (odds ratio = 1.038 for each additional year, and odds ratio = 1.535 if the user is male).
Finally, the number of weeks since the user started using the website is found to be negatively associated with the subscription decision. The more users stay freemium the less likely they are to become premium users.
Location
Surprisingly, the likehood of subscription is negatively associated with a user’s location in the USA, the UK or Germany. It might be interesting to look further at this data in order to identify if there is a location where the web service is more appreciated.
Social Influence
As expected, the number of subscriber friends is associated with a strong positive effect on the user’s likelihood to become a premium member.
Interestingly, the number of friends without a subscription has a small negative influence on the subscription likelihood.
Furthermore, is this just a correlation or a causality ? * In the case of causality, viral marketing would be a strong tool to boost subscription (give an incentive to existing clients if they convince free users through an affiliation program). But these program cost a lot and lower the customer lifetime value drastically. * Otherwise, a safer and cheaper way is to invest in network targeting in order to use the homophily principle to convince friends of premium customer to subscribe.
Eventually, I run a propensity score matching in order to decide whether the number of subscriber friends is causal to the likelihood of becoming a premium member or if its simply a correlation.
Running matchit on all the data with the Nearest Neighbour method, I obtain a balance of matched data.
library(MatchIt)
match <- matchit(adopter ~ songsListened + lovedTracks + playlists + posts + shouts +
friend_cnt + subscriber_friend_cnt + age + male + tenure + good_country,
data = train, method = 'nearest')
summary(match)
##
## Call:
## matchit(formula = adopter ~ songsListened + lovedTracks + playlists +
## posts + shouts + friend_cnt + subscriber_friend_cnt + age +
## male + tenure + good_country, data = train, method = "nearest")
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff
## distance 0.3493 0.2174 0.1196 0.1319
## songsListened 32053.5411 16306.1022 26269.8789 15747.4390
## lovedTracks 254.2637 85.2726 273.7272 168.9911
## playlists 0.8993 0.5468 0.9839 0.3526
## posts 19.6330 3.6587 37.8340 15.9744
## shouts 101.2474 26.6532 141.4194 74.5942
## friend_cnt 36.8468 16.5819 58.5347 20.2649
## subscriber_friend_cnt 1.5199 0.3764 3.1390 1.1436
## age 26.2798 24.1599 6.7822 2.1199
## male0 0.2632 0.3784 0.4850 -0.1152
## male1 0.7368 0.6216 0.4850 0.1152
## tenure 44.7935 43.2401 19.8933 1.5534
## good_country1 0.3007 0.3667 0.4819 -0.0660
## eQQ Med eQQ Mean eQQ Max
## distance 0.0943 0.1319 0.3877
## songsListened 13068.5000 15720.5626 396822.0000
## lovedTracks 90.0000 168.8028 3508.0000
## playlists 0.0000 0.3489 80.0000
## posts 0.0000 15.7896 6585.0000
## shouts 4.0000 73.9058 58136.0000
## friend_cnt 9.0000 19.8794 1168.0000
## subscriber_friend_cnt 0.0000 1.1082 33.0000
## age 2.0000 2.1297 5.0000
## male0 0.0000 0.1152 1.0000
## male1 0.0000 0.1152 1.0000
## tenure 2.0000 1.5577 9.0000
## good_country1 0.0000 0.0660 1.0000
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff
## distance 0.3493 0.3149 0.1541 0.0344
## songsListened 32053.5411 28557.5499 37984.8272 3495.9912
## lovedTracks 254.2637 175.9633 446.2267 78.3005
## playlists 0.8993 0.7350 1.4431 0.1643
## posts 19.6330 7.6123 62.1898 12.0207
## shouts 101.2474 48.4860 226.5427 52.7614
## friend_cnt 36.8468 26.9156 91.7784 9.9312
## subscriber_friend_cnt 1.5199 0.8652 5.3599 0.6548
## age 26.2798 27.1638 8.7437 -0.8841
## male0 0.2632 0.1969 0.3977 0.0663
## male1 0.7368 0.8031 0.3977 -0.0663
## tenure 44.7935 44.4762 19.5731 0.3173
## good_country1 0.3007 0.2676 0.4428 0.0331
## eQQ Med eQQ Mean eQQ Max
## distance 0.0012 0.0344 0.1949
## songsListened 3127.0000 3537.4156 396822.0000
## lovedTracks 65.0000 80.3905 2302.0000
## playlists 0.0000 0.1654 80.0000
## posts 0.0000 12.0207 6585.0000
## shouts 2.0000 52.7707 58136.0000
## friend_cnt 5.0000 10.1972 1168.0000
## subscriber_friend_cnt 0.0000 0.6661 22.0000
## age 1.0000 1.1335 9.0000
## male0 0.0000 0.0663 1.0000
## male1 0.0000 0.0663 1.0000
## tenure 1.0000 0.6610 5.0000
## good_country1 0.0000 0.0331 1.0000
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 73.9143 98.7089 73.9058 49.7370
## songsListened 77.7996 76.0722 77.4982 0.0000
## lovedTracks 53.6659 27.7778 52.3761 34.3786
## playlists 53.3863 0.0000 52.5964 0.0000
## posts 24.7501 0.0000 23.8695 0.0000
## shouts 29.2688 50.0000 28.5973 0.0000
## friend_cnt 50.9933 44.4444 48.7047 0.0000
## subscriber_friend_cnt 42.7438 0.0000 39.8879 33.3333
## age 58.2971 50.0000 46.7736 -80.0000
## male0 42.4649 0.0000 42.4719 0.0000
## male1 42.4649 0.0000 42.4719 0.0000
## tenure 79.5743 50.0000 57.5677 44.4444
## good_country1 49.7733 0.0000 49.8039 0.0000
##
## Sample sizes:
## Control Treated
## All 11566 3864
## Matched 3864 3864
## Unmatched 7702 0
## Discarded 0 0
plot(match,type = "jitter")
## [1] "To identify the units, use first mouse button; to stop, use second."
## integer(0)
plot(match,type = "hist")
As all the QQ values (median, mean and max) are much smaller after matching than before matching, I can consider that the matching was successful. This is confirmed by the graphs : the distribution is similar between the treated and control groups.
The dataset with only the matched data is therefore used to run again a logistic regression with the same variables. The result is the following.
m.data <- match.data(match, group = "all")
mymodel2 <- glm(adopter ~ songsListened + lovedTracks + playlists + posts + shouts +
friend_cnt + subscriber_friend_cnt + age + male + tenure + good_country,
data = m.data, family = 'binomial')
summary(mymodel2)
##
## Call:
## glm(formula = adopter ~ songsListened + lovedTracks + playlists +
## posts + shouts + friend_cnt + subscriber_friend_cnt + age +
## male + tenure + good_country, family = "binomial", data = m.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.1971 -1.1317 -0.3154 1.1792 1.6239
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.43494910709 0.09734208844 4.468 0.0000078861 ***
## songsListened 0.00000004679 0.00000066729 0.070 0.944099
## lovedTracks 0.00029241358 0.00006243000 4.684 0.0000028152 ***
## playlists 0.04304060818 0.01652440512 2.605 0.009196 **
## posts 0.00058746523 0.00031519516 1.864 0.062348 .
## shouts 0.00010146898 0.00009556274 1.062 0.288324
## friend_cnt -0.00151117848 0.00051034215 -2.961 0.003065 **
## subscriber_friend_cnt 0.08190044350 0.01453631911 5.634 0.0000000176 ***
## age -0.01737307357 0.00317685231 -5.469 0.0000000453 ***
## male1 -0.30147482811 0.05607412461 -5.376 0.0000000760 ***
## tenure 0.00124677329 0.00126806718 0.983 0.325505
## good_country1 0.18637373158 0.05239767797 3.557 0.000375 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10713 on 7727 degrees of freedom
## Residual deviance: 10508 on 7716 degrees of freedom
## AIC: 10532
##
## Number of Fisher Scoring iterations: 5
The result of the logistic regression is now the following : \(Y_i\) = -0.6298416 + 0 \(songsListened\) + 0.0002924 \(lovedTracks_i\) + 0.0430406 \(playlists_i\) + 0.0005875 \(posts_i\) + 0.0001015 \(shouts_i\) -0.0015112 \(friendcnt_i\) + 0.0819004 \(subscriberfriendcnt_i\) + -0.0173731 \(age_i\) + -0.3014748 \(male_i\) 0.0012468 \(tenure_i\) 0.1863737 \(goodcountry_i\) + \(\epsilon_i\) = \(V_i\) + \(\epsilon_i\)
Discussion
With the matched data, the number of songs listened to is no more significant which is quite intuitive as the mean time spent on the website before becoming a premium member is higher than 3 years.
The Community Participation data (posts and shouts) are also not significant to predict a customer to become a premium member. This is less intuitive and weakens the ladder of participation theory.
On the contrary, the number of subscriber friends is still associated with a strong positive effect on the user’s likelihood to become a premium member : the estimated coefficient (0.0819) has a significant association with the subscription decision and the with odd ratio is 1.085. Therefore, I can conclude that there is a causal relationship between having subscriber friends and becoming a subscriber. Hence, viral marketing might be a good solution to increase the number of subscribers.
Nevertheless, this estimated coefficient was 0.2174 without propensity score matching, i.e. 265% more than when running the regression on matched data.
This means that, in terms of customer lifetime value, viral marketing budget might be divided by 2,65 comparing to the initial discussion.
Furthermore, this is only backtesting statistics. The best way to confirm this causality would be to test it with actions in real. I would suggest to create three groups with similar caracteristics (demographics, social, consumption, participation…):
Only the results would give us a confirmation that in the link between the likelihood of becoming a premium customer and the number of subscriber friends is made of two balanced factors : homophily (\(2/3\)) AND causality (\(1/3\)).