1 Introduction

In this paper, I work on the MIT Digital Marketing Analytics ongoing use case, the High Note case.

I base my work on the study realised by G. Oestreicher-Singer and L. Zalmanson, “Content or Community? A Digital Business Strategy for Content Providers in the Social Age,” MIS Quarterly 37, no. 2 (June 2013): 591-616.

I try to get a predictive equation of the likelihood of a customer to become subscriber based on demographic, content consumption, content participation, community participation and social characteristics.

I do so first on a balanced dataset between adopters and non adopters. This balancing process is realized randomly.

In a second step, I look at the following question : is the fact taht subscribers have more friends a causal reason why they’re subscribers, or is that just correlation ?

I run a propensity score matching before running again a logistic regression in order to prove that this relation between subscribers and subscribing friends is causal.

2 The Ladder of participation

Oestreicher et al., in 2013, summarized all the studies undertaken on social levels of participation. In order to make my analysis of the High Note case, I used their theory and classified the available data accordingly :

I was not able to find variables for the Community leadership stage. Instead, I added :

Let’s have a look to the data and compare the means :

Type_of_Metric Variable Free_mean Free_median Free_sd Adopters_mean Adopters_median Adopters_sd
Content Consumption #songsListened 11,919.30 3,023.00 23,437.23 25,959.55 13,018.00 40,438.58
Content Organization #lovedTracks 67.06 7.00 228.10 226.13 83.00 674.96
#playlists 0.49 0.00 1.52 1.15 1.00 22.97
Friends #friends 11.07 3.00 42.90 28.38 9.00 93.42
Subscriber Friends #subscriber friends 0.27 0.00 1.79 1.25 0.00 4.60
Community Participation #posts 2.84 0.00 70.89 16.72 0.00 247.75
#shouts 17.14 2.00 116.57 73.45 3.00 915.28
Demographics age 24.22 23.00 6.78 26.30 25.00 7.24
gender 0.62 1.00 0.49 0.72 1.00 0.45
tenure 39.41 38.00 19.24 41.51 40.00 19.76
Location Good Country 0.37 0.00 0.48 0.32 0.00 0.46
Type_of_Metric Variable Free_mean Adopters_mean Ratio
Content Consumption #songsListened 11,919.30 25,959.55 2.18
Content Organization #lovedTracks 67.06 226.13 3.37
#playlists 0.49 1.15 2.38
Friends #friends 11.07 28.38 2.56
Subscriber Friends #subscriber friends 0.27 1.25 4.67
Community Participation #posts 2.84 16.72 5.89
#shouts 17.14 73.45 4.29
Demographics age 24.22 26.30 1.09
gender 0.62 0.72 1.17
tenure 39.41 41.51 1.05
Location Good Country 0.37 0.32 0.85

Content Consumption

Subscribers consume 118% more music than do their nonpaying peers.

Content Organization

On average, subscribers create 138% more playlists and they choose to mark 237% more tracks as loved.

Community Participation

Subscribers are substantially more active in the site’s community: compared with nonpaying users, paying subscribers write 489% more posts and send 329% more shouts.

Social

Moreover, paying subscribers have more friends listed on their pages. The average non-paying user has 11 friends, the average subscriber has 28 friends, that is, subscribers have on average 156% more friends.

Service adoption decisions of consumers may be influenced by the actions of their peers (Choi et al. 2009). Indeed, the average subscriber has 1.25 subscriber friends, compared to only 0.27 subscriber friends for the average nonpaying user.

Demographics

It seems there si no significant difference in activity levels or in propensity to subscribe based on gender.

However, subscribers are on average 2 years older than nonpaying users. Given the relatively small subscription fee of $3 per month, it could be a consequence of income differences.

Interestingly, subscribers make their subscription decisions after using the site for 41 months (3.4 years!) on average. Therefore, the conversion process requires a lot of patience !

3 Data analysis

Data cleaning

The dataset consists of 107,213 observations of 38 variables.

I concentrate my analysis on the 12 variable previously selected.

I therefore reduce the data to the minimum required and I’m looking to data non available (NA) in percentage.

select <- select(data, c("male", "age", "tenure", "good_country", "friend_cnt", "subscriber_friend_cnt", 
                         "songsListened", "lovedTracks", "playlists", "posts", "shouts", "adopter"))
p <- function(x) {sum(is.na(x))/length(x)*100}
apply(select,2,p)
##                  male                   age                tenure 
##         36.3295495882         47.4373443519          0.0298471267 
##          good_country            friend_cnt subscriber_friend_cnt 
##         36.5207577439          0.0009327227          0.0009327227 
##         songsListened           lovedTracks             playlists 
##          0.0000000000          0.0000000000          0.0000000000 
##                 posts                shouts               adopter 
##          0.0000000000          1.7973566638          0.0000000000

There are 47% missing values in age and 36% in gender (male). Therefore, let’s remove all observations with NA in variables.

select <- na.exclude(select)

The dataset is now 48,708 observations of 12 variables.

Correlation analysis

male age tenure good_country friend_cnt subscriber_friend_cnt songsListened lovedTracks playlists posts shouts adopter
male 1
age 0.17 1
tenure 0.09 0.28 1
good_country 0.01 0.11 0.12 1
friend_cnt -0.01 -0.04 0.01 -0.04 1
subscriber_friend_cnt 0.01 0.06 0.02 0.01 0.78 1
songsListened 0.11 0.01 0.24 0.02 0.22 0.14 1
lovedTracks 0.02 0.05 0.01 0.02 0.2 0.18 0.24 1
playlists -0.01 0.11 0.07 0 0.05 0.08 0.07 0.13 1
posts 0.01 0 0.04 0 0.05 0.06 0.09 0.06 0.02 1
shouts -0.02 -0.02 0.02 -0.02 0.19 0.13 0.12 0.09 0.02 0.12 1
adopter 0.06 0.08 0.02 -0.04 0.09 0.11 0.14 0.17 0.08 0.04 0.05 1

The subscription process is not highly correlated with any variable, which is a good thing for our analysis.

Looking further, the subscription seems more correlated to the first stage of the ladder of participation (correlation of 0.14 for songsListened and 0.17 for lovedTracks) rather than the higher stage (0.08 for playlits, 0.04 for posts and 0.05 for shouts).

Nevertheless, the adoption process seems correlated (0.11) to the number of subscriber friends. I will try later to distinguish whether this is a simple correlation or a causality.

Balance between adopters and non adopters

But for now, we have only 3,864 adopters for 44,844 non adopters in the dataset (8.62%). Therefore, in order to get better result in the regression analysis to come, I first have to reduce the number of non adopters in the dataset.

I target a 25%-75% ratio between adopters and non-adopters.

I keep the 3,864 adopters and I deliberately under-samples the non adopter users, randomly selecting 11,566 non adopters observations with the following code.

I finish by merging the adopters and selected non-adopters in a training dataset.

set.seed(1234)
ind <- sample(2, nrow(select_non_adopters), replace = T, prob = c(0.7415, 0.2585))
under_non_adopters <- select_non_adopters[ind==2,]
train <- rbind(select_adopters,under_non_adopters)

Correcting the standard error on the intercept

According to Manski & Lerman, 1977, I need to correct the intercept estimation by substracting a constant equal to \(log(S_i/P_i)\) where \(S_i\) is the percentage of observations \(i\) in the sample and \(P_i\) is the percentage in the population.

This constant is : 1.0647907.

Transform categorical data into factors for regression to work later on

train$male <- as.factor(train$male)
train$adopter <- as.factor(train$adopter)
train$good_country <- as.factor(train$good_country)

4 Logistic regression model on the selected data

I am looking for a logistic regression in order to find the equation of \(Y_i\) where : \(Y_i = \alpha_0 + \alpha_1songsListened + \alpha_2lovedTracks_i + \alpha_3playlists_i + \alpha_4posts_i + \alpha_5shouts_i +\alpha_6{friendcnt}_i\) + \(\alpha_7subscriberfriendcnt_i + \alpha_8age_i + \alpha_9male_i + \alpha_{10}tenure_i + \alpha_{11}goodcountry_i + \epsilon_i = V_i + \epsilon_i\)

And with the probability \(P_i\) that a customer \(i\) becomes a paying customer is given by : \(P_i = \frac{exp(V_i)}{1+exp(V_i)}\)

mymodel <- glm(adopter ~ songsListened + lovedTracks + playlists + posts + shouts +
                       friend_cnt + subscriber_friend_cnt + age + male + tenure + good_country ,
               data = train, family = 'binomial')
summary(mymodel)
## 
## Call:
## glm(formula = adopter ~ songsListened + lovedTracks + playlists + 
##     posts + shouts + friend_cnt + subscriber_friend_cnt + age + 
##     male + tenure + good_country, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.7137  -0.5870   0.0094   2.2167  
## 
## Coefficients:
##                            Estimate    Std. Error z value             Pr(>|z|)
## (Intercept)           -2.4595750513  0.0806942070 -30.480 < 0.0000000000000002
## songsListened          0.0000104973  0.0000007425  14.137 < 0.0000000000000002
## lovedTracks            0.0010328594  0.0000772159  13.376 < 0.0000000000000002
## playlists              0.1178354030  0.0187757918   6.276       0.000000000348
## posts                  0.0007909951  0.0003799579   2.082              0.03736
## shouts                 0.0001146605  0.0001031491   1.112              0.26631
## friend_cnt            -0.0018685280  0.0005237641  -3.567              0.00036
## subscriber_friend_cnt  0.2174699255  0.0174254655  12.480 < 0.0000000000000002
## age                    0.0374079789  0.0029242981  12.792 < 0.0000000000000002
## male1                  0.4284249910  0.0449937898   9.522 < 0.0000000000000002
## tenure                -0.0069803465  0.0011023016  -6.333       0.000000000241
## good_country1         -0.4047090008  0.0436723956  -9.267 < 0.0000000000000002
##                          
## (Intercept)           ***
## songsListened         ***
## lovedTracks           ***
## playlists             ***
## posts                 *  
## shouts                   
## friend_cnt            ***
## subscriber_friend_cnt ***
## age                   ***
## male1                 ***
## tenure                ***
## good_country1         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17368  on 15429  degrees of freedom
## Residual deviance: 15496  on 15418  degrees of freedom
## AIC: 15520
## 
## Number of Fisher Scoring iterations: 6

The result is the following : \(Y_i\) = -3.5243658 + 0.0000105 \(songsListened\) + 0.0010329 \(lovedTracks_i\) + 0.1178354 \(playlists_i\) + 0.000791 \(posts_i\) + 0.0001147 \(shouts_i\) -0.0018685 \(friendcnt_i\) + 0.2174699 \(subscriberfriendcnt_i\) + 0.037408 \(age_i\) + 0.428425 \(male_i\) -0.0069803 \(tenure_i\) -0.404709 \(goodcountry_i\) + \(\epsilon_i\) = \(V_i\) + \(\epsilon_i\)

Remember, I correct the intercept coefficient by a constant.

Now let’s discuss the impact of variable to the likelihood of a customer \(i\) to become a subscriber.

Content Consumption

Content consumption, in terms of songs listened to has a positive and significant association with the subscription decision but only with a low effect on the subscription decision.

Looking at our full model, the effect of creating a playlist is equal to that of listening to around 11,000 more songs.

Content Organization

The activity of marking tracks as loved and creating playlists, are positively correlated with subscription behavior (odds ratio = 1.001 for each track marked as loved, and odds ratio = 1.125 for each playlist created).

Community Participation

Unfortunately, posting a comment and sending shouts do not have a significant association with the subscription decision.

Therefore, this doesn’t allow me to confirm the ladder of participation theory.

It also means that the website does not provide enough Content participation and Content Leadership features.

As building these features is costly and without immediate ROI, I would suggest building a social media campaign in order to build a page on Facebook and Twitter. This is easy and would only require small adjustments of the website in order to ask premium customers to share their likes and playlist there.

Nevertheless, I would recommend, in the long term, building features that would allow premium customer to more and more interact with the website : groups, blog entries and forums. According to Oestreicher et al., making social experience central to the content website’s would help build a strong relationship with customers, and therefore increase customer Lifetime Value.

Demographics

The age and the gender of the user are positively associated with the likelihood of subscription (odds ratio = 1.038 for each additional year, and odds ratio = 1.535 if the user is male).

Finally, the number of weeks since the user started using the website is found to be negatively associated with the subscription decision. The more users stay freemium the less likely they are to become premium users.

Location

Surprisingly, the likehood of subscription is negatively associated with a user’s location in the USA, the UK or Germany. It might be interesting to look further at this data in order to identify if there is a location where the web service is more appreciated.

Social Influence

As expected, the number of subscriber friends is associated with a strong positive effect on the user’s likelihood to become a premium member.

Interestingly, the number of friends without a subscription has a small negative influence on the subscription likelihood.

Furthermore, is this just a correlation or a causality ? * In the case of causality, viral marketing would be a strong tool to boost subscription (give an incentive to existing clients if they convince free users through an affiliation program). But these program cost a lot and lower the customer lifetime value drastically. * Otherwise, a safer and cheaper way is to invest in network targeting in order to use the homophily principle to convince friends of premium customer to subscribe.

5 Propensity Score Matching

Eventually, I run a propensity score matching in order to decide whether the number of subscriber friends is causal to the likelihood of becoming a premium member or if its simply a correlation.

Running matchit on all the data with the Nearest Neighbour method, I obtain a balance of matched data.

library(MatchIt)
match <- matchit(adopter ~ songsListened + lovedTracks + playlists + posts + shouts +
                         friend_cnt + subscriber_friend_cnt + age + male + tenure + good_country,
                 data = train, method = 'nearest')
summary(match)
## 
## Call:
## matchit(formula = adopter ~ songsListened + lovedTracks + playlists + 
##     posts + shouts + friend_cnt + subscriber_friend_cnt + age + 
##     male + tenure + good_country, data = train, method = "nearest")
## 
## Summary of balance for all data:
##                       Means Treated Means Control SD Control  Mean Diff
## distance                     0.3493        0.2174     0.1196     0.1319
## songsListened            32053.5411    16306.1022 26269.8789 15747.4390
## lovedTracks                254.2637       85.2726   273.7272   168.9911
## playlists                    0.8993        0.5468     0.9839     0.3526
## posts                       19.6330        3.6587    37.8340    15.9744
## shouts                     101.2474       26.6532   141.4194    74.5942
## friend_cnt                  36.8468       16.5819    58.5347    20.2649
## subscriber_friend_cnt        1.5199        0.3764     3.1390     1.1436
## age                         26.2798       24.1599     6.7822     2.1199
## male0                        0.2632        0.3784     0.4850    -0.1152
## male1                        0.7368        0.6216     0.4850     0.1152
## tenure                      44.7935       43.2401    19.8933     1.5534
## good_country1                0.3007        0.3667     0.4819    -0.0660
##                          eQQ Med   eQQ Mean     eQQ Max
## distance                  0.0943     0.1319      0.3877
## songsListened         13068.5000 15720.5626 396822.0000
## lovedTracks              90.0000   168.8028   3508.0000
## playlists                 0.0000     0.3489     80.0000
## posts                     0.0000    15.7896   6585.0000
## shouts                    4.0000    73.9058  58136.0000
## friend_cnt                9.0000    19.8794   1168.0000
## subscriber_friend_cnt     0.0000     1.1082     33.0000
## age                       2.0000     2.1297      5.0000
## male0                     0.0000     0.1152      1.0000
## male1                     0.0000     0.1152      1.0000
## tenure                    2.0000     1.5577      9.0000
## good_country1             0.0000     0.0660      1.0000
## 
## 
## Summary of balance for matched data:
##                       Means Treated Means Control SD Control Mean Diff
## distance                     0.3493        0.3149     0.1541    0.0344
## songsListened            32053.5411    28557.5499 37984.8272 3495.9912
## lovedTracks                254.2637      175.9633   446.2267   78.3005
## playlists                    0.8993        0.7350     1.4431    0.1643
## posts                       19.6330        7.6123    62.1898   12.0207
## shouts                     101.2474       48.4860   226.5427   52.7614
## friend_cnt                  36.8468       26.9156    91.7784    9.9312
## subscriber_friend_cnt        1.5199        0.8652     5.3599    0.6548
## age                         26.2798       27.1638     8.7437   -0.8841
## male0                        0.2632        0.1969     0.3977    0.0663
## male1                        0.7368        0.8031     0.3977   -0.0663
## tenure                      44.7935       44.4762    19.5731    0.3173
## good_country1                0.3007        0.2676     0.4428    0.0331
##                         eQQ Med  eQQ Mean     eQQ Max
## distance                 0.0012    0.0344      0.1949
## songsListened         3127.0000 3537.4156 396822.0000
## lovedTracks             65.0000   80.3905   2302.0000
## playlists                0.0000    0.1654     80.0000
## posts                    0.0000   12.0207   6585.0000
## shouts                   2.0000   52.7707  58136.0000
## friend_cnt               5.0000   10.1972   1168.0000
## subscriber_friend_cnt    0.0000    0.6661     22.0000
## age                      1.0000    1.1335      9.0000
## male0                    0.0000    0.0663      1.0000
## male1                    0.0000    0.0663      1.0000
## tenure                   1.0000    0.6610      5.0000
## good_country1            0.0000    0.0331      1.0000
## 
## Percent Balance Improvement:
##                       Mean Diff. eQQ Med eQQ Mean  eQQ Max
## distance                 73.9143 98.7089  73.9058  49.7370
## songsListened            77.7996 76.0722  77.4982   0.0000
## lovedTracks              53.6659 27.7778  52.3761  34.3786
## playlists                53.3863  0.0000  52.5964   0.0000
## posts                    24.7501  0.0000  23.8695   0.0000
## shouts                   29.2688 50.0000  28.5973   0.0000
## friend_cnt               50.9933 44.4444  48.7047   0.0000
## subscriber_friend_cnt    42.7438  0.0000  39.8879  33.3333
## age                      58.2971 50.0000  46.7736 -80.0000
## male0                    42.4649  0.0000  42.4719   0.0000
## male1                    42.4649  0.0000  42.4719   0.0000
## tenure                   79.5743 50.0000  57.5677  44.4444
## good_country1            49.7733  0.0000  49.8039   0.0000
## 
## Sample sizes:
##           Control Treated
## All         11566    3864
## Matched      3864    3864
## Unmatched    7702       0
## Discarded       0       0
plot(match,type = "jitter")

## [1] "To identify the units, use first mouse button; to stop, use second."
## integer(0)
plot(match,type = "hist")

As all the QQ values (median, mean and max) are much smaller after matching than before matching, I can consider that the matching was successful. This is confirmed by the graphs : the distribution is similar between the treated and control groups.

The dataset with only the matched data is therefore used to run again a logistic regression with the same variables. The result is the following.

m.data <- match.data(match, group = "all")
mymodel2 <- glm(adopter ~ songsListened + lovedTracks + playlists + posts + shouts +
                        friend_cnt + subscriber_friend_cnt + age + male + tenure + good_country,
               data = m.data, family = 'binomial')
summary(mymodel2)
## 
## Call:
## glm(formula = adopter ~ songsListened + lovedTracks + playlists + 
##     posts + shouts + friend_cnt + subscriber_friend_cnt + age + 
##     male + tenure + good_country, family = "binomial", data = m.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.1971  -1.1317  -0.3154   1.1792   1.6239  
## 
## Coefficients:
##                             Estimate     Std. Error z value     Pr(>|z|)    
## (Intercept)            0.43494910709  0.09734208844   4.468 0.0000078861 ***
## songsListened          0.00000004679  0.00000066729   0.070     0.944099    
## lovedTracks            0.00029241358  0.00006243000   4.684 0.0000028152 ***
## playlists              0.04304060818  0.01652440512   2.605     0.009196 ** 
## posts                  0.00058746523  0.00031519516   1.864     0.062348 .  
## shouts                 0.00010146898  0.00009556274   1.062     0.288324    
## friend_cnt            -0.00151117848  0.00051034215  -2.961     0.003065 ** 
## subscriber_friend_cnt  0.08190044350  0.01453631911   5.634 0.0000000176 ***
## age                   -0.01737307357  0.00317685231  -5.469 0.0000000453 ***
## male1                 -0.30147482811  0.05607412461  -5.376 0.0000000760 ***
## tenure                 0.00124677329  0.00126806718   0.983     0.325505    
## good_country1          0.18637373158  0.05239767797   3.557     0.000375 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10713  on 7727  degrees of freedom
## Residual deviance: 10508  on 7716  degrees of freedom
## AIC: 10532
## 
## Number of Fisher Scoring iterations: 5

The result of the logistic regression is now the following : \(Y_i\) = -0.6298416 + 0 \(songsListened\) + 0.0002924 \(lovedTracks_i\) + 0.0430406 \(playlists_i\) + 0.0005875 \(posts_i\) + 0.0001015 \(shouts_i\) -0.0015112 \(friendcnt_i\) + 0.0819004 \(subscriberfriendcnt_i\) + -0.0173731 \(age_i\) + -0.3014748 \(male_i\) 0.0012468 \(tenure_i\) 0.1863737 \(goodcountry_i\) + \(\epsilon_i\) = \(V_i\) + \(\epsilon_i\)

Discussion

With the matched data, the number of songs listened to is no more significant which is quite intuitive as the mean time spent on the website before becoming a premium member is higher than 3 years.

The Community Participation data (posts and shouts) are also not significant to predict a customer to become a premium member. This is less intuitive and weakens the ladder of participation theory.

On the contrary, the number of subscriber friends is still associated with a strong positive effect on the user’s likelihood to become a premium member : the estimated coefficient (0.0819) has a significant association with the subscription decision and the with odd ratio is 1.085. Therefore, I can conclude that there is a causal relationship between having subscriber friends and becoming a subscriber. Hence, viral marketing might be a good solution to increase the number of subscribers.

Nevertheless, this estimated coefficient was 0.2174 without propensity score matching, i.e. 265% more than when running the regression on matched data.

This means that, in terms of customer lifetime value, viral marketing budget might be divided by 2,65 comparing to the initial discussion.

Furthermore, this is only backtesting statistics. The best way to confirm this causality would be to test it with actions in real. I would suggest to create three groups with similar caracteristics (demographics, social, consumption, participation…):

Only the results would give us a confirmation that in the link between the likelihood of becoming a premium customer and the number of subscriber friends is made of two balanced factors : homophily (\(2/3\)) AND causality (\(1/3\)).