The user engagement data provided by Showwcase records key metrics made by each user over the month of October 2019. The analysis below attempts to find insights on user engagement and provide significant trends which may be beneficial to Showwcase's business strategy.

Getting Started

The dataset is based on the 300 log-ins made by 48 distinct users. The frequency of log-ins made daily can be seen in the chart below. The number of log-ins are plotted irrespective of the same user logging-in on the same day.

With a daily mean of 10 logins and standard deviation of 4.04, at first glance one data point (October 26th) appears to be an outlier. We will discuss this further in the next section. Besides plotting the daily number of log-ins, one can also analyze the number of log-ins by each Customer ID as seen below:

The average number of log-ins made by each customer ID is 6.25 with a standard deviation of 6.06. This suggests that there exists a relatively wide spread between users. To follow-up on these findings, we can possibly try to group the users based on the key metrics and find whether there exists such grouping.

User Grouping and Session Duration Modeling

Earlier, we saw that there is a relatively wide spread between users in the number of log-ins made. To better understand whether there exists any commonality between these users, we can try grouping such users and note their activities online.

Each point in the graph above denotes each user (i.e. 48 users = 48 points). The number of engagements is calculated as a sum of the total likes, comments, and projects that the user posted over the month of October. Likewise, the total session duration is the accumulation of time a user spends over October. While we can see that some points are clustered in the lower-left corner and some others are grouped in the upper-right corner, we want to know if there exists optimal k-groups such that these groupings can accurately depict each user's level of engagement. The method that we are going to use here is k-means where we are going to split dataset into varying k groups and verify two outputs: first, the number of clusters which minimizes the total distance of each data point to its assigned cluster center and second, the number of clusters which has the most ideal R-squared value.

The two graphs above refer to elbow diagrams where we can examine the 'kink' in the curve to determine the ideal number of clusters. This 'kink' is where the marginal benefit of adding another cluster starts to be small. In the R-squared against clusters diagram, one can determine that the elbowing starts to occur at k = 4 or k = 5. Meanwhile, in the total distance against clusters diagram, it seems that k = 5 is a reasonable cluster value. Therefore, based on these two graphs, we can safely determine that the appropriate number of groups to divide the 48 users is five groups. Below shows the plot of the 5 groups with their corresponding data points (i.e. users).

The five clusters as seen above denote the levels of engagement that users in the month of October displayed, where one being the most engaged and five being the least engaged. What we can further draw from this in future analyses is whether there exists a pattern when suppose we enlarge our dataset to greater number of users. Showwcase can target certain groups to generate more engagements.

Further examining the above graph, we can ask ourselves if it is possible to model the time that a user spends against each engagement activity. At a glance, there seem to be a positive correlation between the total session duration against the total number of engagements, which is confirmed by our findings below:

## 
## Call:
## lm(formula = customer_tot_session_dur ~ 0 + customer_tot_engagements, 
##     data = df5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7095.8 -1258.3  -261.2  1260.0  9846.0 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## customer_tot_engagements   81.067      2.964   27.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2587 on 47 degrees of freedom
## Multiple R-squared:  0.9409, Adjusted R-squared:  0.9396 
## F-statistic: 747.9 on 1 and 47 DF,  p-value: < 2.2e-16

We find that with a coefficient of determination of 94.09%, the cumulative time a user spends in a month can be modeled as Session Time = 81.067*(Number of Engagements). Also we can see a confidence band such that as the number of engagements increases, the variability in the time spend also increases. Furthermore, in the model, we assumed that if a user does not spend any time, then the user will not perform any kind of engagement. Therefore, session time and number of engagements are modeled proportional to each other (i.e. passes the origin).

With this in mind, we want to further know the breakdown of each components. Using regression, we can breakdown the engagements into number of likes, comments, and projects to determine the impact they have towards time spent in Showwcase. Note that we need to scale each engagement type to the same range. This is to ensure that we have equal weighting on each of the component. If not, then it is easy to see that only the number of likes and comments are the ones that are dominant in the model. The approach that we are going to take is to scale the number of projects, likes, comments, and time spend to [0,1] with 0 as being the lowest value in their respective attribute and 1 being the highest.

## 
## Call:
## lm(formula = customer_tot_session_dur ~ 0 + customer_tot_likes + 
##     customer_tot_comments + customer_tot_projects, data = df8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14794 -0.04704 -0.00179  0.03668  0.29171 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## customer_tot_likes     0.42868    0.11693   3.666 0.000648 ***
## customer_tot_comments  0.55155    0.12112   4.554    4e-05 ***
## customer_tot_projects -0.04402    0.10242  -0.430 0.669398    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07879 on 45 degrees of freedom
## Multiple R-squared:  0.9513, Adjusted R-squared:  0.948 
## F-statistic: 292.9 on 3 and 45 DF,  p-value: < 2.2e-16

From the test summary, with an R-squared value of 95.13%, we can see that both likes and comments contribute positively towards the time spend in Showwcase. Moreover, for every project posted, it seems to only reduce the time a user spends in Showwcase. While we only consider modeling the data above using MLR, other regression avenues should be examined to possibly uncover better insights in terms of which engagement interactions might seem significant. Furthermore, testing the model against more attributes (i.e. number of unique page visits per session, user's first date as a member, etc) would also be another method to dig deeper into valuable user engagement insights.

Summary

As a tech worker community platform, Showwcase had been showing positive trends in October 2019. First, we considered analyzing the outlier in the number of log-ins made in October 24th. It seemed that there were indeed a lot of likes, comments, and projects posted on that day which we determined happen to be Saturday. While not every weekend days mimic the pattern we saw in October 24th, we found that the number of comments rose from the beginning to the end of the month using Change-Detection analysis. Moreover, we were able to group the 48 users into 5 groups where the groups indicate the different level of engagement intensities showed by the users. Lastly, we used regression to measure for the contribution of each type of engagement towards the time a user spends in the website. While the projects attribute affected the session duration negatively, the number of likes and comments posted provide significant impact to the session duration. In the future, Showwcase can validate this result against more data points and attributes to uncover more insights and better understand the pattern of their users.