require(dplyr)
require(knitr)
require(ggplot2)
require(lubridate) # for date and time 
require(rjson) #for loading json file 
require(jsonlite) # for loading json file 
require(readr) # for speeding up reading the csv file
require(parallel) # to speed up some codes 
require(tidyjson)
require(purrr)
require(plotly)
require(smooth)
require(forecast)
require(gplot) # for baloonplot
require(tidyr) # for spreading df 
require(alluvial)
require(venneuler)
require(caTools) # for sample split
require(caret) # for confusionMatrix
require(ROCR) # for roc curve
require(parallel) # for parallel computing
# dyn.load(paste0(system2('/usr/libexec/java_home', stdout = TRUE), '/jre/lib/server/libjvm.dylib'))
# library(rJava)

# install.packages("venneuler")
# require(VennDiagram)
# install.packages("VennDiagram")
#require(data.table)
#install.packages("jsonlite")

Introduction

This is an analytical report on mobile gaming data of one of the top10 companies in the field. I recieved this data in the summer 2017 as part of the application process of the company for senior data science position, however I did not send the report back, because I was travelling at that time and did not have possibility to even play with the data.

This report is based on understanding the data after cleaning it, and then trying to find patterns, regularities and irregularities in the data. The ultimate goal is finding insight into the data, and generating new hypotheses and questions for sake of business insights. This is why the last section of the report is dedicated to business insights.

The data includes three datasets, transactions, interactions, and users. Unfortunatelly, these datasets are not totally related, so for instance the users datasets includes the users that are not in the transactions and interactions! Being so mars big part of the analysis, and possibility of extraction of great insights from the data. Hence, such unsatisified curiousities are asked in the business insight section.

Reading the Data

The starting point of our analysis is reading the datasets. Two datasets are provided in csv format, and one in json.

interactions_data <- read_csv(file = "/Users/Shaahin/Downloads/wetransfer-f4f0aa/interactions.csv" )

interactions_data$interaction <- factor(interactions_data$interaction)

# dimension
dim(interactions_data)
# structure of the data 
str(interactions_data)
#head 
kable(head(interactions_data))
# levels
levels(interactions_data$interaction)
# number of uniqe players
interactions_data %>%
        select(user_id) %>%
        unique() %>%
       count()
# frequency of interactions by players
interactions_data %>% 
        select(user_id) %>%
        group_by(user_id) %>%
        mutate(freq = n()) %>%
        unique() %>%
        arrange(desc(freq))

glimpse(interactions_data)

interactions_data$year <- factor(year(interactions_data$time))
interactions_data$month <- factor(month(interactions_data$time))
interactions_data$wday  <- factor(wday(interactions_data$time, label = TRUE))
interactions_data$hour <- factor(hour(interactions_data$time))
interactions_data$minute <- factor(minute(interactions_data$time))
interactions_data$second <- factor( second(interactions_data$time))

glimpse(interactions_data)

## Observations: 9,980,862
## Variables: 9
## $ user_id     <int> 60552216, 60552216, 60552216, 59522538, 60551821, ...
## $ interaction <fct> NA, singleplayer_mission, item_interaction, item_i...
## $ time        <dttm> 2017-03-07 08:25:45, 2017-03-06 08:30:21, 2017-03...
## $ year        <fct> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20...
## $ month       <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ wday        <ord> Tue, Mon, Sun, Sat, Sun, Wed, Fri, Sat, Wed, Fri, ...
## $ hour        <fct> 8, 8, 11, 14, 4, 9, 14, 4, 11, 18, 10, 10, 2, 9, 2...
## $ minute      <fct> 25, 30, 54, 23, 36, 47, 9, 11, 4, 9, 0, 23, 54, 32...
## $ second      <fct> 45, 21, 11, 45, 28, 47, 16, 55, 24, 53, 12, 12, 5,...

The interactions dataset includes around 10,000,000 records, mainly composed of the type of interaction that a specific player did at a specific time.

# users <- jsonlite::fromJSON(txt = "/Users/Shaahin/Downloads/wetransfer-f4f0aa/users.json")

users <- read_csv("/Users/Shaahin/Downloads/wetransfer-f4f0aa/users_df.csv")


users$platform <- factor(users$platform)
users$country <- factor(users$country)

users %>%
        glimpse()

## Observations: 1,368,355
## Variables: 5
## $ user_id      <int> 905912, 1065185, 799911, 1132235, 1080677, 662457...
## $ install_date <date> 2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01, ...
## $ platform     <fct> platform_2, platform_1, platform_2, platform_1, p...
## $ country      <fct> country_2, country_2, NA, country_1, country_6, c...
## $ nickname     <chr> "nickname_852668912434373", "nickname_64724137213...

The users dataset was in json format, and it took long for me to transform it into a tidy dataframe. The dataset includes around 1.3 milion users and their country and platfrom. However, these users are not related to the users in the transactions or interactions datasets!

transactions <- read_csv("/Users/Shaahin/Downloads/wetransfer-f4f0aa/transactions.csv")


transactions$concept <- factor(transactions$concept)

transactions$year <- factor(year(transactions$time))
transactions$month <- factor(month(transactions$time))
transactions$wday  <- factor(wday(transactions$time, label = TRUE))
transactions$hour <- factor(hour(transactions$time))
transactions$minute <- factor(minute(transactions$time))
transactions$second <- factor( second(transactions$time))

glimpse(transactions)

## Observations: 74,558
## Variables: 10
## $ user_id <int> 60685086, 61131996, 61858418, 60344003, 61072701, 6106...
## $ time    <dttm> 2017-03-03 20:14:13, 2017-03-20 19:22:18, 2017-03-14 ...
## $ concept <fct> NA, item_03, item_01, boost_12, currency_pack_07, curr...
## $ eur     <dbl> 240927.18, 2.07, 3.99, 12.05, 2.74, 2.99, 11.99, 5.99,...
## $ year    <fct> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, ...
## $ month   <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ wday    <ord> Fri, Mon, Tue, Wed, Sun, Thu, Wed, Sun, Mon, Sun, Thu,...
## $ hour    <fct> 20, 19, 17, 23, 18, 23, 2, 11, 21, 20, 6, 5, 22, 20, 2...
## $ minute  <fct> 14, 22, 37, 42, 9, 52, 39, 15, 33, 49, 9, 13, 5, 17, 4...
## $ second  <fct> 13, 18, 53, 51, 30, 44, 28, 49, 23, 32, 16, 56, 3, 43,...

# write_csv(x = transactions ,
#           path = "/Users/Shaahin/Downloads/wetransfer-f4f0aa/transactions.csv")

Around 75,000 transactions are gathered in the transactions dataset. This dataset has considerable intersection with the interactions dataset, thanks data provider!

Preliminary Exploration of Datasets

I am curious about transactions data. What is more natural than having a histogram of payments?

# glimpse(transactions)
# 
# head(transactions)

transactions %>%
        arrange(desc(eur)) %>%
        ggplot() +
        geom_histogram(aes(eur), bins = 100)

(quantile(x = transactions$eur , probs = seq(0,1,0.1)))

##           0%          10%          20%          30%          40% 
##      -126.99         2.71         2.99         3.65         4.41 
##          50%          60%          70%          80%          90% 
##         5.99         6.99        11.21        12.99        23.99 
##         100% 
## 106449754.65

transactions %>%
        select(eur) %>%
        summary()

##       eur           
##  Min.   :     -127  
##  1st Qu.:        3  
##  Median :        6  
##  Mean   :    71718  
##  3rd Qu.:       12  
##  Max.   :106449755

# concept not available 
t<- transactions %>%
        filter(is.na(concept)) %>%
        dplyr::select(eur)
quantile(t$eur,probs = seq(0,1,0.1))

##            0%           10%           20%           30%           40% 
##      -126.990        -2.223         2.950         2.990         3.990 
##           50%           60%           70%           80%           90% 
##         5.990         6.990        11.990        18.472       120.112 
##          100% 
## 106449754.650

# concept available 
tt <-  transactions %>%
        filter(!is.na(concept)) %>%
        dplyr::select(eur)

quantile(tt$eur,probs = seq(0,1,0.1))

##      0%     10%     20%     30%     40%     50%     60%     70%     80% 
##   0.070   2.797   2.990   3.750   4.570   5.990   6.990  11.090  12.930 
##     90%    100% 
##  23.030 181.770

The distribution of payments is so skewed that it is literally a single bar. For such skewed distributions, histograms won’t budge the raw data.

What to do ? I go for checking the quantiles. It is seen that there are very strange values in the purchases, from negative purchases to astronomical spendings. Possibly both are outliers, right?

Some of the payments are for the items that have missing values in the dataset. I got curious whether these non-existed concepts are related to irrational payments. So I divided the dataset into two, based on the missing values of concept variable, and the findings endorsed my guess.

So, whenever concept variable is not available, we have very strange eur values, from negative payments to 106m euros! It is better to work with complete cases then, however these strange transactions should be investigated further.

transactions<- transactions %>%
        mutate(suspicious = is.na(concept))

transactions %>%
        glimpse()

## Observations: 74,558
## Variables: 11
## $ user_id    <int> 60685086, 61131996, 61858418, 60344003, 61072701, 6...
## $ time       <dttm> 2017-03-03 20:14:13, 2017-03-20 19:22:18, 2017-03-...
## $ concept    <fct> NA, item_03, item_01, boost_12, currency_pack_07, c...
## $ eur        <dbl> 240927.18, 2.07, 3.99, 12.05, 2.74, 2.99, 11.99, 5....
## $ year       <fct> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201...
## $ month      <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ wday       <ord> Fri, Mon, Tue, Wed, Sun, Thu, Wed, Sun, Mon, Sun, T...
## $ hour       <fct> 20, 19, 17, 23, 18, 23, 2, 11, 21, 20, 6, 5, 22, 20...
## $ minute     <fct> 14, 22, 37, 42, 9, 52, 39, 15, 33, 49, 9, 13, 5, 17...
## $ second     <fct> 13, 18, 53, 51, 30, 44, 28, 49, 23, 32, 16, 56, 3, ...
## $ suspicious <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...

Now the dataset seems okay with the new suspicious variable. The deeper exploration is done in separate sections.

summary(transactions)

# how many unique users do transactions
transactions %>%
        filter(suspicious == FALSE ) %>%
        select(user_id) %>%
        distinct() %>%
        count()

#how many transactions they do 
transactions %>%
        filter(suspicious == FALSE ) %>%
        count()

DAU - KPIs

In this section some of the KPIs are calculated, and analyzed.

DAU: Daily Active Users. DAU is the number of unique users that start at least one session in your app on any given day.

This is a time-series. Also I calculate ADAU.

dau_data <- interactions_data %>% 
       mutate(day = day(time)) %>%
        select(user_id,day,wday) %>%
        distinct() %>% 
        group_by(day,wday) %>%
        count()

# plot(dau_data, type = "b")
# fit.ex4 <- lm(data = dau_data , formula = n ~ day )
# lines(fitted(fit.ex4),col="blue")
# sma(data = dau_data$n , order = 2 ) 
   
dau_g <- dau_data %>% 
        ggplot() +
        geom_point(aes(x = day , y = n)) + 
        #scale_color_brewer(palette = "Dark2") + 
        geom_line(aes(x = day, y = n )) + 
        geom_hline(aes(yintercept = mean(n)), color = "green") + 
        geom_smooth(aes(x = day, y = n),
                    method = "lm" , color = "blue", se = FALSE) + 
        ylab(label = "Number of Active Users") + 
        scale_x_continuous(breaks=seq(0,20,1)) +
        theme_light()

ggplotly(dau_g)

# lm_model <- (lm(data = dau_data , formula = n~ day ))
# lm_model_string <- paste("DAU=",lm_model$coefficients[1],
#                          lm_model$coefficients[2],"DAY"

As it is seen, the trend is downward. So the DAU is being reduced relatively fast. It seems that we have almost lost 20,000 active users from the first Saturday to the third one.

EDA - Interactions Dataset

It is a good point to do some exploratory data analysis on the interactions data.

Uni-Var Exploration

glimpse(interactions_data)

## Observations: 9,980,862
## Variables: 9
## $ user_id     <int> 60552216, 60552216, 60552216, 59522538, 60551821, ...
## $ interaction <fct> NA, singleplayer_mission, item_interaction, item_i...
## $ time        <dttm> 2017-03-07 08:25:45, 2017-03-06 08:30:21, 2017-03...
## $ year        <fct> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20...
## $ month       <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ wday        <ord> Tue, Mon, Sun, Sat, Sun, Wed, Fri, Sat, Wed, Fri, ...
## $ hour        <fct> 8, 8, 11, 14, 4, 9, 14, 4, 11, 18, 10, 10, 2, 9, 2...
## $ minute      <fct> 25, 30, 54, 23, 36, 47, 9, 11, 4, 9, 0, 23, 54, 32...
## $ second      <fct> 45, 21, 11, 45, 28, 47, 16, 55, 24, 53, 12, 12, 5,...

# frequency of each player 
t <- interactions_data %>%
        group_by(user_id) %>%
        count(user_id, sort = TRUE) 

summary(t$n)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    3.000    7.385    7.000 1177.000

quantile(t$n, probs = seq(0,1,0.1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1    1    1    1    2    3    4    6    9   17 1177

Since the frequency distribution of users’ interactions is so skewed, the barplot or histogram is not helpful. So the above data shows the 6 main statistics, and quantiles of the users’ interactions frequency.

It is interesting that 50% of our users have 3 or fewer interactions.

g_interactions <- interactions_data %>%
        group_by(interaction) %>%
        count(interaction, sort = TRUE) %>%
        ggplot() +
        geom_col(aes(y = n , x = reorder(interaction,n)),
                 fill = "skyblue") + 
        xlab(label = "Interaction Type")  + 
        theme_light() 

ggplotly(g_interactions)

Single-player mission is by far the most papular interaction. Theare also a considerable number of interactions with missing type. This must be addressed by data engineers.

# weekdays
g_wday <- interactions_data %>%
        #group_by(wday) %>%
        count(wday, sort = FALSE) %>% 
        ggplot() +
        geom_col(aes(y = n , x = (wday)),
                 fill = "skyblue") + 
        xlab(label = "Weekdays") + 
        ylab(label = "Number of Interactions") +  
        theme_light() 

ggplotly(g_wday)

# hours 
g_hours <- interactions_data %>%
        #group_by(wday) %>%
        count(hour, sort = FALSE) %>% 
        ggplot() +
        geom_col(aes(y = n , x = (hour)),
                 fill = "skyblue") + 
        xlab(label = "Hours") + 
        ylab(label = "Number of Interactions") + 
        theme_light() 

ggplotly(g_hours)

The most interactions happen in weekends, on Saturday followed by Sunday. The least on Tuesday.

The rush hours of interactions are 12 to 16. At 22-23 we have the least interactions.

Bi-variate Exploration

t <- interactions_data %>%
        select(wday,hour) %>% 
        table() %>% 
        data.frame()

g_wday_hour <- t %>%
        ggplot() + 
        geom_line(aes(x = hour , y = Freq , group = wday, color = wday)) + 
        scale_color_brewer(palette = "Dark2") +
        ylab("Number of Interactions") + 
        theme_light()

ggplotly(g_wday_hour)

# balloonplot(x = t, label = FALSE, main = "Wday vs Hour")

So more or less, the distributions of the number of interactions over time are similar in different week days. No special pattern.

t<- interactions_data %>% 
        select(interaction,wday) %>% 
        table() %>%
        data.frame()

g_interact_wday <- t %>%
        ggplot() + 
        geom_bar(aes(x = wday, y = Freq, fill = interaction),
                 stat = "identity") + 
         ylab("Number of Interactions") + 
        theme_light()

ggplotly(g_interact_wday)

# with line chart 
g_interact_wday_line <- t %>%
        ggplot() + 
        geom_line(aes(x = wday, y = Freq ,
                      group = interaction , color =interaction )) + 
        scale_color_brewer(palette = "Dark2") +
        ylab("Number of Interactions") + 
        theme_light()

ggplotly(g_interact_wday_line)

Most of the interactions are “singleplayer”, then “item interaction”, and then “multiplayer”. It seems that the multiplayer has less fluctuation, comparing to singleplayer. The number of singleplayers drops drastically in Tuesday and goes up steeply afterwards. It is an interesting pattern for sure, maybe the type of users of these two are different, maybe multiplayer is less demanding so it is more compatible with the middle of the week.

t<- interactions_data %>% 
        select(interaction,hour) %>% 
        table() %>%
        data.frame()

g_interact_hour <- t %>%
        ggplot() + 
        geom_bar(aes(x = hour, y = Freq, fill = interaction),
                 stat = "identity") + 
         ylab("Number of Interactions") + 
        theme_light()

ggplotly(g_interact_hour)

# with line chart 
g_interact_hour_line <- t %>%
        ggplot() + 
        geom_line(aes(x = hour, y = Freq ,
                      group = interaction , color =interaction )) + 
        scale_color_brewer(palette = "Dark2") +
        ylab("Number of Interactions") + 
        theme_light()

ggplotly(g_interact_hour_line)

No specific pattern based on the 24 hours. It seems to me that players play both single and multiplayer, but they are more inclined towards the singleplayer.

t<- interactions_data %>% 
        select(interaction,hour) %>% 
        table() %>%
        data.frame()

g_heatmap <- ggplot(t, aes(x = hour, y = interaction )) +
          geom_tile(aes(fill = Freq) , color = "white") +
          scale_fill_gradient(low = "blue", high = "red") +
          ylab("interactions") +
          xlab("hours") +
          theme(legend.title = element_text(size = 10),
                legend.text = element_text(size = 12),
                plot.title = element_text(size=16),
                axis.title=element_text(size=14,face="bold"),
                axis.text.x = element_text(angle = 90, hjust = 1)) +
          labs(fill = "Frequency")

ggplotly(g_heatmap)

Single-player is by far more popular than multi-player. Is it something expected beforehand?

Tri-variate Exploration

For exploration of the relations of three categorical variables, we can use three comparable plots, i.e. faceting. Also it is possible to use heatmaps.

t <- interactions_data %>% 
        group_by(interaction,wday,hour) %>%
        count()
        
g_wday_hour_int <- t %>% 
        ggplot() +
        geom_line(aes(x=hour,y=n  ,group = wday, color = wday)) + 
        facet_grid(interaction ~ . ) + 
        scale_color_brewer(palette = "Dark2") +
        theme_light()
        
ggplotly(g_wday_hour_int)

The interesting point in the above facet lineplot is the distribution of NAs. The rest is more or less similar to what we expected.

t <- interactions_data %>%
        group_by(user_id,interaction) %>% 
        count() %>%
        spread(key = interaction, value = n, fill = 0 ) %>%
        mutate(item_interaction = ifelse(test = item_interaction != 0,
                                         yes = 1, no = 0) ) %>%
        mutate(multiplayer_mission =ifelse( test = multiplayer_mission != 0,
                      yes = 1, no = 0) ) %>%
        mutate(singleplayer_mission =ifelse( test = singleplayer_mission != 0,
                                             yes = 1, no = 0) )

t$item_interaction <- factor(t$item_interaction)
t$multiplayer_mission <- factor(t$multiplayer_mission)
t$singleplayer_mission <- factor(t$singleplayer_mission)

alluvial_df <- t %>%
        group_by(item_interaction,multiplayer_mission,singleplayer_mission) %>%
        count()

alluvial_df <- alluvial_df[,c(3,2,1,4)]

alluvial(alluvial_df[,1:3], freq = alluvial_df$n, 
         col = ifelse(alluvial_df$singleplayer_mission == 1, "gold", "grey"),
         border = ifelse(alluvial_df$singleplayer_mission == 1, "gold", "grey"),
         cex = 0.7)

The alluvial diagram is quite interesting. The majority of players, play single mission-multiplayer, multiplayer is played by a relative majority, and a relative majority do item interactions. This is the interactions isolated from each other. But combining them, we have 8 groups of players. The majority play single and multi and do interaction with items. The second largest group play single, do not play multi, and do not interact with items. The third largest group play single, do not play multi, but interact with items.

This is very important to learn the behaviour of our players. This diagram would be much more useful when augmented with response variables such as churn. Then we can understand the churn happens by which group of gamers more than others.

PAY KPIs

Paying active users. The number of unique active users who pay for the game, on each specific date.

pau_data <- transactions %>% 
        filter(suspicious == FALSE) %>% 
        mutate(day = day(time)) %>% 
        select(user_id, day) %>%
        distinct() %>% 
        group_by(day) %>% 
        count()

g_pau <- pau_data %>%
        ggplot() +
        geom_point(aes(x = day , y = n ))+ 
        geom_line(aes(x = day , y = n )) + 
        geom_hline(aes(yintercept = mean(n)), color = "green") + 
        geom_smooth(aes(x = day, y = n),
                    method = "lm" , color = "blue", se = FALSE) + 
        ylab(label = "PAU") + 
        scale_x_continuous(breaks=seq(0,20,1)) +
        theme_light()

ggplotly(g_pau)

So the trend of paying active users is downward. I used transactions dataset, since each paying user is also a daily active user. The numbers are the unique payers per day. Specialy after day 11, the trend plunged quickly, eventhough at day 18,19 it raised a little.

NDAU and NPAU - KPI

For new daily active users, and new paying active users.

ndau_data <- interactions_data %>% 
        mutate(day = day(time)) %>%
        select(user_id, day) %>%
        arrange(day) %>% 
        distinct(user_id, .keep_all = TRUE) %>% 
        group_by(day) %>% 
        count()


g_ndau <- ndau_data %>%
        ggplot() +
        geom_point(aes(x = day , y = n ))+ 
        geom_line(aes(x = day , y = n )) + 
        geom_hline(aes(yintercept = mean(n)), color = "green") + 
        geom_smooth(aes(x = day, y = n),
                    method = "lm" , color = "blue", se = FALSE) + 
        ylab(label = "NDAU") + 
        scale_x_continuous(breaks=seq(0,20,1)) +
        theme_light()

ggplotly(g_ndau)

New daily active users are still joining the game, but the trend is downward and worrying.

But what about NPAU ?

npau_data <- transactions %>%
                filter(suspicious== FALSE) %>%
                mutate(day = day(time)) %>% 
                arrange(day) %>% 
                select(user_id,day) %>% 
                distinct(user_id,.keep_all = TRUE) %>% 
                group_by(day) %>% 
                count()

g_npau <- npau_data %>%
        ggplot() +
        geom_point(aes(x = day , y = n ))+ 
        geom_line(aes(x = day , y = n )) + 
        geom_hline(aes(yintercept = mean(n)), color = "green") + 
        geom_smooth(aes(x = day, y = n),
                    method = "lm" , color = "blue", se = FALSE) + 
        ylab(label = "NPAU") + 
        scale_x_continuous(breaks=seq(0,20,1)) +
        theme_light()

ggplotly(g_npau)

The number of unique payers have gone down drastically from the first day. Scary!

EDA - Transactions DataSet

Uni-Variate

How many transactions each user did? The below shows the frequency of transactions per user in quantile format.

transaction_quantile <- transactions %>%
                                mutate(user_id = factor(user_id)) %>%
                                group_by(user_id) %>%
                                count(sort = TRUE) 

                

quantile(x = transaction_quantile$n , probs = seq(0,1,0.1) )

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1    1    1    1    1    1    2    2    3    5  295

Since the distribution is very right skewed, then the bar plot will be less informative than a simple quantile table. As it can be seen, 60% of the paying users only purchased once, and 90% fewer than 6 purchases did.

I think it would be helpful for later purposes to classify the users based on the frequency of interactions. The three categories that I define are: casual, enthusiast, and avid. Below 10 transactions is considered as casual, between 10 and 20 is enthusiast and above 20 transactions is called avid.

#transactions$user_id <- factor(transactions$user_id)
transactions$user_id <- as.integer(as.character(transactions$user_id))
transaction_quantile$user_id <- as.integer(as.character(transaction_quantile$user_id))

transactions_tagged <- transaction_quantile %>% 
        mutate(quantile = cut(n, 
                              breaks = c(1,10,20,295),
                              include.lowest = TRUE) 
               ) %>% 
        inner_join(transactions,by = "user_id")

#head(transactions_tagged)

prop.table(table(transactions_tagged$quantile))

## 
##    [1,10]   (10,20]  (20,295] 
## 0.7644787 0.1054213 0.1301001

So I tagged the top 13% of the paying users as “avid”. They are not all whales in the meaning of payments, so there should be another classification based on the payments.

quantile(transactions_tagged$eur[transactions_tagged$suspicious==FALSE])

##     0%    25%    50%    75%   100% 
##   0.07   3.02   5.99  11.99 181.77

transactions_tagged <- transactions_tagged %>%
      filter(suspicious==FALSE) %>% 
        group_by(user_id) %>%
        mutate(tot_payment = sum(eur)) %>% 
         mutate(payment = cut(tot_payment, 
                              breaks = c(0,8,100,4200),
                              include.lowest = TRUE) 
               ) %>%
        ungroup()

quantile(transactions_tagged$tot_payment[transactions_tagged$suspicious==FALSE])

##      0%     25%     50%     75%    100% 
##    0.99    7.98   26.97  106.94 4150.67

prop.table(table(transactions_tagged$payment))

## 
##         [0,8]       (8,100] (100,4.2e+03] 
##     0.2533339     0.4861890     0.2604770

According to the quantiles of the transactions payments, I categorize the users into three categories: low, middle, and high. The high could be named possibly as whales, or at least part of it. The middle strats from 8 euros, and whale starts from 100 euros. Probably calling 25% of the users as whales is not very accurate, but it serves our purpose for now.

# g_concept_na <- transactions_tagged %>%
#         group_by(concept) %>%
#         count(concept, sort = TRUE) %>%
#         ggplot() + 
#         geom_col(aes(x = reorder(concept,nn) , y = nn)) + 
#         theme_light() + 
#         theme(axis.text.x = element_text(angle = 90, hjust = 1))
#        
# ggplotly(g_concept_na)

g_concept <- transactions_tagged %>%
        filter(suspicious==FALSE) %>%
        group_by(concept) %>%
        count(concept, sort = TRUE) %>%
        ggplot() + 
        geom_col(aes(x = reorder(concept,nn) , y = nn)) + 
        theme_light() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))+ 
        ggtitle("item frequency")
       
ggplotly(g_concept)

As it is seen, the most frequent purchase is NA, something that has to be investigated further by the data engineers. As I remove the NA the most frequent concept is item_1, item_3 and boost_12. These are good clues for further promotion.

g_trans_wday <- transactions_tagged %>% 
                        filter(suspicious==FALSE) %>% 
                        group_by(wday) %>%
                        count() %>%
                        ggplot() + 
                        geom_col(aes(x = wday , y = nn )) + 
                        theme_light() +
        ggtitle("frequency of transactions")

ggplotly(g_trans_wday)

There is not that much new information in the frequency of the purchases per week days. The minimum has happened on Tuesdays, and we knew it from previous EDA section that the number of players in Tuesdays are minimum as well.

g_trans_eur <- transactions_tagged %>% 
                        filter(suspicious == FALSE) %>% 
                        ggplot() +
                        geom_histogram(aes(x = eur),binwidth = 2) +
                        theme_light()

ggplotly(g_trans_eur)

As it is seen, most of the purchases are in the bracket of 2-4 euros. We previously evaluated this variable through quantiles and so.

Now let’s check the frequency of transactions per day, and average payment per day.

g_trans_day <- transactions_tagged %>%
                mutate(day = day(time)) %>% 
                filter(suspicious==FALSE) %>% 
                group_by(day) %>% 
                count() %>% 
                ggplot() + 
                geom_point(aes(x = day , y = nn ))+ 
                geom_line(aes(x = day , y = nn )) + 
                geom_hline(aes(yintercept = mean(nn)), color = "green") + 
                geom_smooth(aes(x = day, y = nn),
                            method = "lm" , color = "blue", se = FALSE) + 
                ylab(label = "Transaction per day") + 
                scale_x_continuous(breaks=seq(0,20,1)) +
                theme_light() + 
        ggtitle("Total Transactions Per Day")
        
ggplotly(g_trans_day)

g_trans_avg_rev <- transactions_tagged %>%
                mutate(day = day(time)) %>% 
                filter(suspicious==FALSE) %>%
                select(day,eur) %>% 
                group_by(day) %>% 
                summarise(avg_rev = mean(eur)) %>% 
                 ggplot() + 
                geom_point(aes(x = day , y = avg_rev ))+ 
                geom_line(aes(x = day , y = avg_rev )) + 
                geom_hline(aes(yintercept = mean(avg_rev)), color = "green") + 
                geom_smooth(aes(x = day, y = avg_rev),
                            method = "lm" , color = "blue", se = FALSE) + 
                ylab(label = "Average Purchase") + 
                scale_x_continuous(breaks=seq(0,20,1)) +
                theme_light() + 
        ggtitle("Average Payment Per Day")
             
                
ggplotly(g_trans_avg_rev)

As it is clear, the game is declining according to the number of transactions per each day. While the average purchases per day does not show that much of decline, the number of transactions shows a clear drop. It can be concluded that the large purchases are more common at the end of the series, and small purchases at the start of it.

Revenues KPIs

How has total revenue per day changed since the first day?

g_trans_tot_rev <- transactions_tagged %>%
                mutate(day = day(time)) %>% 
                filter(suspicious==FALSE) %>%
                select(day,eur) %>% 
                group_by(day) %>% 
                summarise(tot_rev = sum(eur)) %>% 
                 ggplot() + 
                geom_point(aes(x = day , y = tot_rev ))+ 
                geom_line(aes(x = day , y = tot_rev )) + 
                geom_hline(aes(yintercept = mean(tot_rev)), color = "green") + 
                geom_smooth(aes(x = day, y = tot_rev),
                            method = "lm" , color = "blue", se = FALSE) + 
                ylab(label = "Revenue") + 
                scale_x_continuous(breaks=seq(0,20,1)) +
                theme_light() +
        ggtitle("Total Revenue Per Day")
             
                
ggplotly(g_trans_tot_rev)

Total revenue per day shows decline. Our game is not in a good path. But what about the average revenue per active user?

tot_r_pday <- transactions_tagged %>%
        filter(suspicious == FALSE) %>% 
        mutate(day = day(time)) %>% 
        group_by(day) %>%
        summarise(tot_r = sum(eur))


arpu_data <- interactions_data %>% 
                mutate(day = day(time)) %>% 
                group_by(day) %>% 
                select(day,user_id) %>% 
                dplyr::distinct() %>%
                summarise(active_users = n()) %>% 
                inner_join(y = tot_r_pday, by = "day")

g_arpu <- arpu_data %>% 
        mutate(arpu = tot_r/active_users) %>%
        ggplot() +
        geom_point(aes(x = day, y = arpu)) + 
        geom_line(aes(x = day , y = arpu )) + 
        geom_hline(aes(yintercept = mean(arpu)), color = "green") + 
                geom_smooth(aes(x = day, y = arpu),
                            method = "lm" , color = "blue", se = FALSE) + 
                ylab(label = "ARPU") + 
                scale_x_continuous(breaks=seq(0,20,1)) +
                theme_light() + 
        ggtitle("Average Revenue Per Active User")

ggplotly(g_arpu)

The ARPU is computed by division of total revenue per day into number of unique users of that day. It is seen that this KPI is declining and the trend is downward, regardless of the increase in the last few days of the time series.

Since this index is a compound index, the decline may be due to increase of the denominator, i.e. number of active users, or decrease of the nominator, i.e. total revenue. However, we have previously figured out that the number of active users is declining, so the ARPU trend is due to reduction of total revenue per day. This finiding is compatible with our previous explorations.

In general, I rather go for simple indices rather than compound ones.

user_per_day <- transactions_tagged %>% 
                        filter(suspicious == FALSE) %>% 
                        mutate(day = day(time)) %>% 
                        group_by(day) %>% 
                        select(user_id, day ) %>% 
                        distinct() %>% 
                        count()

tot_rev <- transactions_tagged %>% 
                filter(suspicious == FALSE) %>% 
                mutate(day = day(time)) %>% 
                group_by(day) %>% 
                summarise(daily_rev = sum(eur))

g_arppu <- user_per_day %>% 
                inner_join(y = tot_rev , by = "day") %>% 
                mutate(arppu = daily_rev / n ) %>% 
                ggplot() + 
                geom_point(aes(x = day, y = arppu)) + 
                geom_line(aes(x = day , y = arppu )) + 
                geom_hline(aes(yintercept = mean(arppu)), color = "green") + 
                        geom_smooth(aes(x = day, y = arppu),
                                    method = "lm" , color = "blue", se = FALSE) + 
                        ylab(label = "ARPPU") + 
                        scale_x_continuous(breaks=seq(0,20,1)) +
                        theme_light() +
        ggtitle("ARPPU")

ggplotly(g_arppu)

The ARPPU is not that much delining. It means that even though we are losing users, eventhough the revenue is going down in total, but since the number of paying user is shrinking as well, the ARPPU is almost steady. However, it does not show any positive sign.

EDA - Users

In this section, I am going to make sense of the third and last table: users.

users %>% 
        summary()

##     user_id         install_date              platform     
##  Min.   :      1   Min.   :2017-03-01   platform_1:728685  
##  1st Qu.: 342090   1st Qu.:2017-03-03   platform_2:522120  
##  Median : 684178   Median :2017-03-07   platform_3:103866  
##  Mean   : 684178   Mean   :2017-03-08   NA's      : 13684  
##  3rd Qu.:1026266   3rd Qu.:2017-03-13                      
##  Max.   :1368355   Max.   :2017-03-20                      
##  NA's   :1368                                              
##       country         nickname        
##  country_1:422897   Length:1368355    
##  country_2:322064   Class :character  
##  country_3:179486   Mode  :character  
##  country_4: 67287                     
##  country_5: 51494                     
##  country_6: 51456                     
##  NA's     :273671

users <- users %>%
        mutate(install_day = day(install_date))

The first variable to touch is platform

g_platform <- users %>% 
                count(platform,sort = TRUE) %>% 
                ggplot() + 
                geom_col(aes(x = reorder(platform,n) , y = n))  + 
                theme_light() + 
                theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
                xlab("User Platforms") + 
                ylab("Number of Users")


ggplotly(g_platform)

So the platform 1 is the most popular, followed by platform2 and platform3.

g_country <- users %>%
                count(country, sort = TRUE) %>% 
                ggplot() + 
                        geom_col(aes(x = reorder(country,n) , y = n))  + 
                        theme_light() + 
                        theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
                        xlab("User Country") + 
                        ylab("Number of Users")

ggplotly(g_country)

The countries are sorted by the belonged number of users.

At last, installs per day.

g_install <- users %>%
                        count(install_day) %>%
                        ggplot() + 
                        geom_point(aes(x = install_day, y = n)) + 
                        geom_line(aes(x = install_day, y = n)) + 
                        geom_hline(aes(yintercept = mean(n)), color = "green") + 
                                        geom_smooth(aes(x = install_day, y = n),
                                                    method = "lm" ,
                                                    color = "blue",
                                                    se = FALSE) + 
                                        ylab(label = "Install Occurances") + 
                                        scale_x_continuous(breaks=seq(0,20,1)) +
                                        theme_light() +
        ggtitle("Installations Per Day")

ggplotly(g_install)

As it was expected, the number of installs has gone down in two distinct phase, first drastically after the first few days, then slightly after the 6th day.

What is the role of advertisments here? how much the game is visible? Does the game has invitation feature with reward?

Retention KPI

It is very important for us to see a user returns to play our game each day. Retention is in general an index for quantifying so.

# for day 2 
retention_data <- interactions_data %>%
                        mutate(day = day(time)) %>%
                        select(user_id, day) %>%
                        group_by(day ) %>%
                        distinct()

new_users <- interactions_data %>% 
                        mutate(day = day(time)) %>% 
                        select(user_id, day) %>%
                        arrange(day) %>% 
                        distinct(user_id, .keep_all = TRUE) %>% 
                        group_by(day)
                        

retention_final <- data.frame(day = NA, retention = NA)        

#can be speeded up using parallel package 
for (d in 1:19) {
        d_i <- new_users %>% filter(day == d)
        d_ii <- retention_data %>% filter(day == (d+1))
        
        retention_final[d,1] <- d 
        retention_final[d,2] <- mean(d_i$user_id %in% d_ii$user_id)
}


g_1day_ret <- retention_final %>% 
        mutate(retention = round(retention * 100,2) ) %>% 
        ggplot() + 
        geom_point(aes(x = day , y = retention)) + 
        geom_line(aes(x = day , y = retention )) + 
        geom_hline(aes(yintercept = mean(retention)), color = "green") + 
                                        geom_smooth(aes(x = day,
                                                        y = retention),
                                                    method = "lm" ,
                                                    color = "blue",
                                                    se = FALSE) + 
                                        ylab(label = "Retention %") + 
                                        scale_x_continuous(breaks=seq(0,20,1)) +
                                        scale_y_continuous(breaks=seq(20,50,5)) + 
                                        theme_light() +
        ggtitle("Retention Day1")


ggplotly(g_1day_ret)

The retention day1 shows that after loss of 15% of the users in the first day, who did not come back to the game the day after, the retention rate has been more or less steady, more inclined to decline.

retention_final <- data.frame(day = NA, retention = NA)        

#can be speeded up using parallel package 
for (d in 1:17) {
        d_i <- new_users %>% filter(day == d)
        d_ii <- retention_data %>% filter(day == (d+3))
        
        retention_final[d,1] <- d 
        retention_final[d,2] <- mean(d_i$user_id %in% d_ii$user_id)
}


g_3day_ret <- retention_final %>% 
        mutate(retention = round(retention * 100,2) ) %>% 
        ggplot() + 
        geom_point(aes(x = day , y = retention)) + 
        geom_line(aes(x = day , y = retention )) + 
        geom_hline(aes(yintercept = mean(retention)), color = "green") + 
                                        geom_smooth(aes(x = day,
                                                        y = retention),
                                                    method = "lm" ,
                                                    color = "blue",
                                                    se = FALSE) + 
                                        ylab(label = "Retention %") + 
                                        scale_x_continuous(breaks=seq(0,20,1)) +
                                        scale_y_continuous(breaks=seq(30,70,5)) + 
                                        theme_light() +
        ggtitle("Retention Day3")


ggplotly(g_3day_ret)

Again evaluation of 3day retention shows that we have lost many users at the launch day and the day after. Is it because of bugs?

retention_final <- data.frame(day = NA, retention = NA)        

#can be speeded up using parallel package 
for (d in 1:13) {
        d_i <- new_users %>% filter(day == d)
        d_ii <- retention_data %>% filter(day == (d+7) )
        
        retention_final[d,1] <- d 
        retention_final[d,2] <- mean(d_i$user_id %in% d_ii$user_id)
}


g_7day_ret <- retention_final %>% 
        mutate(retention = round(retention * 100,2) ) %>% 
        ggplot() + 
        geom_point(aes(x = day , y = retention)) + 
        geom_line(aes(x = day , y = retention )) + 
        geom_hline(aes(yintercept = mean(retention)), color = "green") + 
                                        geom_smooth(aes(x = day,
                                                        y = retention),
                                                    method = "lm" ,
                                                    color = "blue",
                                                    se = FALSE) + 
                                        ylab(label = "Retention %") + 
                                        scale_x_continuous(breaks=seq(0,20,1)) +
                                        scale_y_continuous(breaks=seq(40,75,5)) + 
                                        theme_light() +
        ggtitle("Retention Day7")


ggplotly(g_7day_ret)

The pattern on day1, day3, and day7 retentions are similar. The highest retention is related to the players of the launch day. Is it because of any promotion? because of their advantage over the users who joined the game afterwards?

It is important to me to investigate the last session of the players before they abandon the game. Is there any specific reason that causes them to leave the game or it is because the game loses its attraction?

Churn can help here.

Churn KPI

I defined a churned user, a user who has not played the game for 10 consecutive days. It is possible to define this metric in various ways.

churn_final <- data.frame(day = NA, churn = NA)        

#can be speeded up using parallel package 
for (d in 1:10) {
        d_i <- new_users %>% filter(day == d)
        d_ii <- retention_data %>% filter(day > d & day <= (d+10))
        
        churn_final[d,1] <- d 
        churn_final[d,2] <- mean(!(d_i$user_id %in% d_ii$user_id))
}


g_10day_churn <- churn_final %>% 
        mutate(churn = round(churn * 100,2) ) %>% 
        ggplot() + 
        geom_point(aes(x = day , y = churn)) + 
        geom_line(aes(x = day , y = churn )) + 
        geom_hline(aes(yintercept = mean(churn)), color = "green") + 
                                        geom_smooth(aes(x = day,
                                                        y = churn),
                                                    method = "lm" ,
                                                    color = "blue",
                                                    se = FALSE) + 
                                        ylab(label = "Churn %") + 
                                        scale_x_continuous(breaks=seq(0,20,1)) +
                                        scale_y_continuous(breaks=seq(20,55,5)) + 
                                        theme_light() +
        ggtitle("10Days Churn")


ggplotly(g_10day_churn)

To validate the result, we can check the 10days retention.

retention_final <- data.frame(day = NA, retention = NA)        

#can be speeded up using parallel package 
for (d in 1:10) {
        d_i <- new_users %>% filter(day == d)
        d_ii <- retention_data %>% filter(day <= (d+10) & day > d)
        
        retention_final[d,1] <- d 
        retention_final[d,2] <- mean(d_i$user_id %in% d_ii$user_id)
}


g_10day_ret <- retention_final %>% 
        mutate(retention = round(retention * 100,2) ) %>% 
        ggplot() + 
        geom_point(aes(x = day , y = retention)) + 
        geom_line(aes(x = day , y = retention )) + 
        geom_hline(aes(yintercept = mean(retention)), color = "green") + 
                                        geom_smooth(aes(x = day,
                                                        y = retention),
                                                    method = "lm" ,
                                                    color = "blue",
                                                    se = FALSE) + 
                                        ylab(label = "Retention %") + 
                                        scale_x_continuous(breaks=seq(0,20,1)) +
                                        scale_y_continuous(breaks=seq(20,80,5)) + 
                                        theme_light() +
        ggtitle("Retention 10days")


ggplotly(g_10day_ret)

10days churn is the lowest for the first couple of days, then it increases to ~50% and remains steady. This trend points to the previously asked question that what is specific for the first day users that they have the lowest churn rate, and the highest retention rate?

Further Investigation

Unfortunately, the users datasets has nothing in common with the transaction and interactions datasets!

So what I can do is focusing on the two latter datasets.

concept_avg_price <- transactions_tagged %>% 
                                ungroup() %>% 
                                filter(suspicious == FALSE) %>%
                                group_by(concept) %>% 
                                summarise(price = mean(eur))

concept_price_data <- transactions_tagged %>% 
                        ungroup() %>% 
                        filter(suspicious == FALSE) %>%
                        count(concept) %>% 
                        inner_join(concept_avg_price, by = "concept") 
        
g_concept_price <- concept_price_data %>% 
                        ggplot() + 
                        geom_point(aes(x = price , y = nn), alpha = 0.5) + 
                        theme_light() + 
                        geom_text(aes(x = price , y = nn, label = concept )
                                  , alpha = 0) + 
                        ylab("Purchase Frequency of the Concept") + 
                        xlab("Concept Price") + 
        ggtitle("Price vs Purchase Frequency")
        
                        
ggplotly(g_concept_price)

So identical concepts have different prices. I used the mean of price of each concept. The scatterplot shows that concepts with price above 15euros are not very popular. However, expensive concepts may be more profitable.

churn_vec <- vector()

for (d in 1:10) {
        d_i <- new_users %>% filter(day == d)
        d_ii <- retention_data %>% filter(day > d & day <= (d+10))
        
        
        c <- d_i %>% 
                        filter(!(d_i$user_id %in% d_ii$user_id)) 
        
        churn_vec <- append(x = churn_vec , values =  c$user_id) 
        }

# first and last session day 

first_interaction <- interactions_data %>% 
                        ungroup() %>%
                        mutate(day = day(time)) %>% 
                        arrange(day) %>% 
                        select(user_id,interaction,day) %>%
                        distinct(user_id, .keep_all = TRUE)

last_interaction <- interactions_data %>% 
                        ungroup() %>%
                        mutate(day = day(time)) %>% 
                        arrange(desc(day)) %>% 
                        select(user_id,interaction,day) %>%
                        distinct(user_id, .keep_all = TRUE)

churn_10days_users <- first_interaction %>% 
        filter(day <=10) 

churn_10days_users <- churn_10days_users$user_id
#length(churn_10days_users)

interactions_churn_data<- interactions_data %>% 
        filter(user_id %in% churn_10days_users) %>%
        mutate(churn = ifelse(user_id %in% churn_vec, TRUE, FALSE))

# test to see the churn rate
# interactions_churn_data %>%
#         select(user_id, churn) %>%
#         distinct() %>%
#         summarise(mean(churn))

So we have tagged the users on interactions_data whether they churned or not. Now we can investigate the relation between interaction type and churn in general, and the last interaction and churn in particular. Moreover, using transactions data, we can do further investigation about relation between churn and interaction’s variables, and finally build a predictive model of churn.

churn_heatmap_df<- interactions_churn_data %>% 
                distinct(user_id, .keep_all = TRUE) %>% 
                select(interaction,churn) %>% 
                table() %>%
                prop.table(margin = 1) %>% 
                data.frame()

g_churn_heat <- ggplot(churn_heatmap_df, aes(x = interaction, y = churn )) +
          geom_tile(aes(fill = Freq) , color = "white") +
          scale_fill_gradient(low = "blue", high = "red") +
          ylab("churn") +
          xlab("interactions") +
          theme(legend.title = element_text(size = 10),
                legend.text = element_text(size = 12),
                plot.title = element_text(size=16),
                axis.title=element_text(size=14,face="bold"),
                axis.text.x = element_text(angle = 90, hjust = 1)) +
          labs(fill = "Frequency")

ggplotly(g_churn_heat)

The heatmap does not give us decisive information. It shows that the probability of churn is constant, knowing that the user has interacted with item,or multiplayer, or singleplayer.

I guess we can get further information with allovual diagram.

alluvial_df <- interactions_churn_data %>%
        group_by(user_id,interaction) %>% 
        count() %>%
        
        spread(key = interaction, value = n, fill = 0 ) %>%
        
        mutate(item_interaction = ifelse(test = item_interaction != 0,
                                         yes = 1, no = 0) ) %>%
        mutate(multiplayer_mission =ifelse( test = multiplayer_mission != 0,
                      yes = 1, no = 0) ) %>%
        mutate(singleplayer_mission =ifelse( test = singleplayer_mission != 0,
                                             yes = 1, no = 0) ) %>%
        distinct()

alluvial_df <- alluvial_df %>% 
        inner_join( y = interactions_churn_data %>% 
                            select(user_id,churn) %>% 
                            distinct(), 
                    by = "user_id")

alluvial_df$item_interaction <- factor(alluvial_df$item_interaction)
alluvial_df$multiplayer_mission <- factor(alluvial_df$multiplayer_mission)
alluvial_df$singleplayer_mission <- factor(alluvial_df$singleplayer_mission)

alluvial_df <- alluvial_df %>%
        group_by(item_interaction,multiplayer_mission,
                 singleplayer_mission,churn) %>%
        count()

alluvial_df <- alluvial_df[,c(3,2,1,4,5)]

alluvial(alluvial_df[,1:4], freq = alluvial_df$n, 
         col = ifelse(alluvial_df$churn == TRUE, "gold", "grey"),
         border = ifelse(alluvial_df$churn == TRUE, "gold", "grey"),
         cex = 0.7)

The alluvial diagram shows some interesting patterns. For instance, absolute majority of the players who play multiplayer do not churn in 10days. On the other hand, absolute majority of the users who do not play signleplayer will churn in 10days.

# interactions_churn_data %>%
#         select(interaction,churn) %>% 
#         count(interaction,churn)

last_session_churn <- last_interaction %>%
                filter(day <= 10) %>%
                distinct() %>% 
                filter(user_id %in% churn_10days_users) %>%
                mutate(churn = ifelse(user_id %in% churn_vec, TRUE, FALSE)) %>% 
                 select(interaction,churn) %>% 
                count(interaction,churn)

last_session_churn<- last_session_churn[1:6,]   
colnames(last_session_churn) <- c("last interaction","churn","n")
alluvial(last_session_churn[,1:2], freq = last_session_churn$n,
         col = ifelse(last_session_churn$churn == TRUE, "gold", "grey"),
         border = ifelse(last_session_churn$churn == TRUE, "gold", "grey"),
         cex = 0.7)

# last_session_churn <- first_interaction %>%
#                 filter(day <= 10) %>%
#                 distinct() %>% 
#                 filter(user_id %in% churn_10days_users) %>%
#                 mutate(churn = ifelse(user_id %in% churn_vec, TRUE, FALSE)) %>% 
#                  select(interaction,churn) %>% 
#                 count(interaction,churn)
# 
# last_session_churn<- last_session_churn[1:6,]   
# colnames(last_session_churn) <- c("last interaction","churn","n")
# alluvial(last_session_churn[,1:2], freq = last_session_churn$n,
#          col = ifelse(last_session_churn$churn == TRUE, "gold", "grey"),
#          border = ifelse(last_session_churn$churn == TRUE, "gold", "grey"),
#          cex = 0.7)

The above plot shows the relation between the last interaction, and churn. It seems that there is no specific pattern, and no relation between the last interaction type and leaving the game. However, further investigation would help. For instance, in which level of the single-player users leave the game?

# transactions_tagged %>%
#         inner_join(interactions_churn_data)


alluvial_df <- interactions_churn_data %>%
        group_by(user_id,interaction) %>% 
        count() %>%
        
        spread(key = interaction, value = n, fill = 0 ) %>%
        
        mutate(item_interaction = ifelse(test = item_interaction != 0,
                                         yes = 1, no = 0) ) %>%
        mutate(multiplayer_mission =ifelse( test = multiplayer_mission != 0,
                      yes = 1, no = 0) ) %>%
        mutate(singleplayer_mission =ifelse( test = singleplayer_mission != 0,
                                             yes = 1, no = 0) ) %>%
        distinct()

alluvial_df <- alluvial_df %>% 
        inner_join( y = interactions_churn_data %>% 
                            select(user_id,churn) %>% 
                            distinct(), 
                    by = "user_id")

a<- alluvial_df %>% 
        inner_join(y = (transactions_tagged %>%
                                filter(suspicious == FALSE)%>%
                                select(user_id,quantile,payment ))
                   , by = "user_id")

a$item_interaction <- factor(a$item_interaction)
a$multiplayer_mission <- factor(a$multiplayer_mission)
a$singleplayer_mission <- factor(a$singleplayer_mission)

a <- a %>%
        group_by(item_interaction,multiplayer_mission,
                 singleplayer_mission,churn,quantile) %>%
        count()

alluvial(a[,c(3,2,1,5,4)], freq = a$n,
         col = ifelse(a$churn == TRUE, "gold", "grey"),
         border = ifelse(a$churn == TRUE, "gold", "grey"),
         cex = 0.7)

Seemingly there is no clear relation between interaction, quatile and churn. No pattern is detected.

Now some heatmaps to reveal the paying patterns of our users.

t<- transactions_tagged %>% 
        filter(suspicious== FALSE) %>%
        select(quantile,payment) %>% 
        table() %>%
        data.frame()

g_heatmap <- ggplot(t, aes(x = quantile, y = payment )) +
          geom_tile(aes(fill = Freq) , color = "white") +
          scale_fill_gradient(low = "blue", high = "red") +
          ylab("payment") +
          xlab("quantile") +
          theme(legend.title = element_text(size = 10),
                legend.text = element_text(size = 12),
                plot.title = element_text(size=16),
                axis.title=element_text(size=14,face="bold"),
                axis.text.x = element_text(angle = 90, hjust = 1)) +
          labs(fill = "Frequency") + 
        ggtitle("Freq of Purchase vs. Revenue")

ggplotly(g_heatmap)

It is interesting that most of our paying users prefer to do 1to10 purchases with total payment of below 100.

# Churn vs payment alluvial 

# a <- transactions_tagged %>%
#         filter(suspicious==FALSE) %>%
#         right_join(y = interactions_churn_data, by = "user_id") %>%
#         select(payment,quantile,churn) %>%
#         group_by(payment, churn) %>%
#         count()
# 
# 
# 
# alluvial(a[,c(1,2)], freq = a$n,
#          col = ifelse(a$churn == TRUE, "gold", "grey"),
#          border = ifelse(a$churn == TRUE, "gold", "grey"),
#          cex = 0.7)

# churn vs payment heatmap 
t <- transactions_tagged %>%
        filter(suspicious==FALSE) %>%
        inner_join(y = interactions_churn_data, by = "user_id") %>%
        select(payment,quantile,churn)



t<- t %>% 
        select(payment, churn) %>% 
        table() %>%
        data.frame()

g_heatmap <- ggplot(t, aes(x = payment, y = churn )) +
          geom_tile(aes(fill = Freq) , color = "white") +
          scale_fill_gradient(low = "blue", high = "red") +
          ylab("churn") +
          xlab("payment") +
          theme(legend.title = element_text(size = 10),
                legend.text = element_text(size = 12),
                plot.title = element_text(size=16),
                axis.title=element_text(size=14,face="bold"),
                axis.text.x = element_text(angle = 90, hjust = 1)) +
          labs(fill = "Frequency") + 
        ggtitle("Payment vs. Churn")

ggplotly(g_heatmap)

Churn Prediction

At this point, in order to show some machine leaerning applications, I use logistic regression to predict churn in 10 days, based on the first day activities.

#library(parallel)
churn_list <- list()
transactions$user_id <- as.integer(as.character(transactions$user_id))
transactions_tagged$user_id <- as.integer(as.character(transactions_tagged$user_id))

# Calculate the number of cores
no_cores <- detectCores() - 1

# Initiate cluster
cl <- makeCluster(no_cores)


t <- mclapply(X = 1:10 , FUN = function(d){
        temp <- transactions %>%
                filter(suspicious == FALSE) %>%
                mutate(day = day(time)) %>%
                filter(day == d) %>%
                filter(user_id %in% new_users$user_id[new_users$day==d] ) %>%
                group_by(user_id ) %>%
                mutate(avg_purchase = mean(eur), tot_purchase = sum(eur)) %>%
                ungroup() %>%
                select(user_id,avg_purchase,tot_purchase) %>%
                distinct()

  # nrow(temp)

  churn_list[[d]] <-    interactions_churn_data %>%
                mutate(day = day(time)) %>%
                filter(day == d) %>%
                filter(user_id %in% new_users$user_id[new_users$day==d] ) %>%
                group_by(user_id,interaction) %>%
                count() %>%
                spread(key = interaction, value = n, fill = 0 ) %>%
                distinct() %>%
                 inner_join(temp, by = "user_id")

})

# for (d in 1:10 ){
# 
# 
#  temp <- transactions_tagged %>%
#               filter(suspicious == FALSE) %>%
#               mutate(day = day(time)) %>%
#               filter(day == d) %>%
#               filter(user_id %in% new_users$user_id[new_users$day==d] ) %>%
#               group_by(user_id ) %>%
#               mutate(avg_purchase = mean(eur), tot_purchase = sum(eur)) %>%
#               ungroup() %>%
#               select(user_id,avg_purchase,quantile,tot_purchase) %>%
#               distinct()
# 
#   #nrow(temp)
# 
#   churn_list[[d]] <-       interactions_churn_data %>%
#                 mutate(day = day(time)) %>%
#                 filter(day == d) %>%
#                 filter(user_id %in% new_users$user_id[new_users$day==d] ) %>%
#                 group_by(user_id,interaction) %>%
#                 count() %>%
#                 spread(key = interaction, value = n, fill = 0 ) %>%
#                 distinct() %>%
#                  inner_join(temp, by = "user_id")
# 
# 
# }

#new_churn_df <- do.call(rbind, churn_list)
new_churn_df <- do.call(rbind, t)


new_churn_df <- new_churn_df %>% 
        inner_join( y = interactions_churn_data %>% 
                            select(user_id,churn) %>% 
                            distinct(), 
                    by = "user_id")

 stopCluster(cl)
#head(new_churn_df) 
#--------
# d <- 1 
# transactions %>% 
#                 filter(suspicious == FALSE) %>% 
#                 mutate(day = day(time)) %>% 
#                 filter(day == d) %>% 
#                 filter(user_id %in% new_users$user_id[new_users$day==d] ) %>%
#                 group_by(user_id ) %>%
#                 mutate(avg_purchase = mean(eur), tot_purchase = sum(eur)) %>%
#                 ungroup() %>% 
#                 select(user_id,avg_purchase,tot_purchase) %>%
#                 distinct()

What I want to do is building a model to predict 7days churn based on the first day acitivity of a new user.

set.seed(7)
train_vec <- sample.split(Y = new_churn_df$churn)
train_df <- new_churn_df[train_vec,]
test_df <- new_churn_df[!train_vec, ]

logit_model <- glm(data = train_df ,
                   formula = churn ~ .  ,
                   family = "binomial")

summary(logit_model)

## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8971  -0.5835  -0.3733  -0.1491   3.4652  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.232e-01  7.309e-01   0.305 0.760109    
## user_id              -1.165e-08  1.269e-08  -0.918 0.358650    
## item_interaction     -3.842e-01  1.075e-01  -3.575 0.000350 ***
## multiplayer_mission  -3.125e-01  1.421e-01  -2.200 0.027825 *  
## singleplayer_mission -4.778e-01  8.937e-02  -5.347 8.97e-08 ***
## `<NA>`               -1.917e-01  1.538e-01  -1.246 0.212663    
## avg_purchase         -7.283e-02  2.048e-02  -3.555 0.000377 ***
## tot_purchase          1.814e-02  5.112e-03   3.549 0.000386 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 917.27  on 1281  degrees of freedom
## Residual deviance: 796.49  on 1274  degrees of freedom
## AIC: 812.49
## 
## Number of Fisher Scoring iterations: 6

logit_pred <- predict(object = logit_model,
                      newdata = test_df ,
                      type = "response" )


confusionMatrix(data = as.integer(logit_pred>0.1) ,
                reference =  as.integer(test_df$churn))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 295  15
##          1 272  59
##                                           
##                Accuracy : 0.5523          
##                  95% CI : (0.5128, 0.5912)
##     No Information Rate : 0.8846          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1265          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5203          
##             Specificity : 0.7973          
##          Pos Pred Value : 0.9516          
##          Neg Pred Value : 0.1782          
##              Prevalence : 0.8846          
##          Detection Rate : 0.4602          
##    Detection Prevalence : 0.4836          
##       Balanced Accuracy : 0.6588          
##                                           
##        'Positive' Class : 0               
##

roc_pred <- prediction(predictions = logit_pred  , labels = test_df$churn)
roc_perf <- performance(roc_pred , "tpr" , "fpr")
plot(roc_perf,
     colorize = TRUE,
     print.cutoffs.at= seq(0,1,0.05),
     text.adj=c(-0.2,1.7))

I don’t know why the specificity and sensitivity are in this order, since 59/(59+15)=0.80 is True Positive Rate, i.e. sensitivity. So the model works great to predict churn, in other words, it is very good to predict a user would not come back to the app in the next 7 days, if the user actually does not want to return. The False Positive of the model is a little bit high, but it may not be as important to us as False Negative.

Business Insights

Due to lack of data, many questions remained unanswered. However, based on what data I had, these are the insights that I gained from the data, and the evidences that I found.

From data engineering team, I would like to ask the reasons behind missing items in the interactions dataset. Is it possible for a player to do something beside playing singleplayer, multiplayer and interact with item section? Why there are so many missing values about the user interactions?

The users without any specific interactions, i.e. missing values, may be the users that experience app crash each time they start the game. This is a very serious issue that could be checked having game logs.

Also the payments are very suspicious. There are myriad strange payments, from negative values to milions of euros. Of course, both are irrational but what is the reason? Bugs in the system? Players can buy stuff without actually paying for them?

I would like to talk with game designers about the behaviour of users in preference of singleplayer_mission over multiplayer, i.e. most of the players play single player. Is it something expected due to nature of the game?

According to alluvial diagrams in the further investigation section, most of the players who did not play single-player and played multiplayer curned after 10days. However, most of the players who play both campaigns remain in the game. Moreover, if a player played multiplayer, it is very likely that he/she remains in the game. Are these findings helpful in redesigning the incentives to play multiplayer or even single player? We can test such changes by A/B testing.

Also there are some very popular items to purchase, e.g. item_01, item_03, boost_12, and Item_11 and very unpopular items such as most of the bundles. What are the reasons behind such popularity? Are the items and their effects on the game imbalanced? The popular ones can be used for promotions, since they seem more attractive to users, for any reason.

On Tuesdays, we have the fewest number of players, so maybe with some promotions and incentives we can make Tuesdays more popular! It is important to engage users daily with the game, I guess.

The change in the behaviour of our paying users is interesting. The number of paying users is going down, while the average revenue per paying user is more steady. It means that the remaining paying users are paying more and more, and probably making the game imbalance. Having game logs would help here.

Day1 retention for the first day is very high, in line with day3 and day7 retentions of the first day users. Does it mean that there is some sort of advantage for the first day users over others which make the game imbalance? Rationally, the game is equally attractive for the first day users and the users who joined the game on the second day. Or maybe these are players from different countries or different platforms? Also churn for the day1 users is relatively lower.

Finally, marking team may be able to help understanding the decline of new DAU after the first day! maybe the game is not enough visible except for the first day?

I wish I could know more about the game, so I could come up with more accurate questions. It is very important for a data scientist to have intuiton from the game itself.

Mobile Game Data Analysis

Shahin Ashkiani: Contact@Shahin-Ashkiani.com

5/2/2018