Showwcase User Engagement Analysis

The user engagement data provided by Showwcase records key metrics made by each user over the month of October 2019. The analysis below attempts to find insights on user engagement and provide significant trends which may be beneficial to Showwcase's business strategy.

Getting Started

The dataset is based on the 300 log-ins made by 48 distinct users. The frequency of log-ins made daily can be seen in the chart below. The number of log-ins are plotted irrespective of the same user logging-in on the same day.

With a daily mean of 10 logins and standard deviation of 4.04, at first glance one data point (October 26th) appears to be an outlier. We will discuss this further in the next section. Besides plotting the daily number of log-ins, one can also analyze the number of log-ins by each Customer ID as seen below:

The average number of log-ins made by each customer ID is 6.25 with a standard deviation of 6.06. This suggests that there exists a relatively wide spread between users. To follow-up on these findings, we can possibly try to group the users based on the key metrics and find whether there exists such grouping.

Investigating Trends in Daily Number of Log-ins

To verify whether the data point on October 26 is an outlier or not, we can visualize it using a boxplot and test the data point using the grubbs.test method.

## 
##  Grubbs test for one outlier
## 
## data:  df1$login_count
## G = 2.96819, U = 0.68573, p-value = 0.01908
## alternative hypothesis: highest value 22 is an outlier

Using alpha = 0.05 significance value, one can see that the test rejects the null hypothesis and confirms that the October 26 data point is an outlier. Instead of discarding it, Showwcase can investigate further by exploring occurences that might happen on that day (i.e. a popular project was added which cause people to login on that day, etc). Using the data available, one can also confirm whether there are increases in the number of comments, likes, and projects posted on that same day.

The graph above summarizes the total number of comments, likes, and projects posted on the corresponding date. As expected, a lot of engagements occured on October 24th which coincides with the relatively high user log-ins. Moreover, one can see that the graph above follows a cyclical pattern; engagements are relatively low on certain days and active on another. Therefore, it is not a coincidence that the two highest engagement days, October 24 and October 5 fall on a weekend (Saturday), whereas the lowest engagement day, October 14 fall on a weekday (Monday). Having more spare times on the weekend, people are generally more involved in playing with the social media which again explains for the spikes.

One can also verify whether the number of bugs occured increased as more users are online.

It seems that the bugs does not occur randomly. The more users are online, the more bugs appeared, which is expected. To further understand the bugs, data such as bug types might need to be included to better slice the data. Nevertheless, the number of bugs occured followed the user engagement pattern.

While this trend may not always be true (i.e. October 21st which is a Monday is the third highest in the number of engagements), one can group these engagements based on the days of the week and observe if the mean between the days are significantly different. One method to do this is by using ANOVA test.

The approach that is taken is summing up the number of comments, likes, and projects per day and separate them into 7 groups (i.e. 7 days). Then, we can test whether the mean of each group differs significantly from the others or not. However, prior to the ANOVA test, one should check if the data is normally distributed, which can be verified using a QQ plot. Moreover, to check equivariance, one can use the Bartlett's and observe its p-value.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  tot by days
## Bartlett's K-squared = 7.7195, df = 6, p-value = 0.2594

From the QQ plot and Bartlett test's p-value, it is safe to assume normality and equivariance. Therefore, we can proceed with the ANOVA test.

##             Df Sum Sq Mean Sq F value Pr(>F)
## days         6  43103    7184   1.545  0.208
## Residuals   23 106927    4649

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tot ~ days, data = df4)
## 
## $days
##               diff        lwr       upr     p adj
## Mon-Fri      12.50 -142.89968 167.89968 0.9999687
## Sat-Fri      65.50  -89.89968 220.89968 0.8172430
## Sun-Fri       9.25 -146.14968 164.64968 0.9999947
## Thurs-Fri   -25.50 -180.89968 129.89968 0.9980915
## Tues-Fri    -28.50 -175.92508 118.92508 0.9953077
## Wed-Fri     -62.90 -210.32508  84.52508 0.8089546
## Sat-Mon      53.00 -102.39968 208.39968 0.9219440
## Sun-Mon      -3.25 -158.64968 152.14968 1.0000000
## Thurs-Mon   -38.00 -193.39968 117.39968 0.9838775
## Tues-Mon    -41.00 -188.42508 106.42508 0.9694743
## Wed-Mon     -75.40 -222.82508  72.02508 0.6543214
## Sun-Sat     -56.25 -211.64968  99.14968 0.8994414
## Thurs-Sat   -91.00 -246.39968  64.39968 0.5075915
## Tues-Sat    -94.00 -241.42508  53.42508 0.4098086
## Wed-Sat    -128.40 -275.82508  19.02508 0.1166811
## Thurs-Sun   -34.75 -190.14968 120.64968 0.9898263
## Tues-Sun    -37.75 -185.17508 109.67508 0.9796646
## Wed-Sun     -72.15 -219.57508  75.27508 0.6970837
## Tues-Thurs   -3.00 -150.42508 144.42508 1.0000000
## Wed-Thurs   -37.40 -184.82508 110.02508 0.9805910
## Wed-Tues    -34.40 -173.39370 104.59370 0.9828630

While Saturdays often appear to have spikes, the ANOVA test suggests otherwise. There appears to be no difference in the mean of the amount of engagements between the seven days at alpha = 0.05 significance value. Moreover, none of the multiple pairwise-comparisons indicate any significant pairs between the days. Such finding is corroborated by the plot of total engagements by days.

Even though the mean of engagement between days are the same, one can analyze whether there is an increasing trend in the amount of engagements throughout October. Using a change-detection analysis would help us analyze the trend.

The Change-Detection analysis above is done for each engagement types. The number of projects throughout October had been constant and did not seem to have significant increases or decreases. Meanwhile, there were observable increases in the number of likes in the beginning and the end of the month. However, the trend appear to slightly decrease in the number of likes in mid-October. On the contrary, a lot people posted comments towards the end of October which suggests an increasing trend in user comment pattern.

User Grouping and Session Duration Modeling

Earlier, we saw that there is a relatively wide spread between users in the number of log-ins made. To better understand whether there exists any commonality between these users, we can try grouping such users and note their activities online.

Each point in the graph above denotes each user (i.e. 48 users = 48 points). The number of engagements is calculated as a sum of the total likes, comments, and projects that the user posted over the month of October. Likewise, the total session duration is the accumulation of time a user spends over October. While we can see that some points are clustered in the lower-left corner and some others are grouped in the upper-right corner, we want to know if there exists optimal k-groups such that these groupings can accurately depict each user's level of engagement. The method that we are going to use here is k-means where we are going to split dataset into varying k groups and verify two outputs: first, the number of clusters which minimizes the total distance of each data point to its assigned cluster center and second, the number of clusters which has the most ideal R-squared value.

The two graphs above refer to elbow diagrams where we can examine the 'kink' in the curve to determine the ideal number of clusters. This 'kink' is where the marginal benefit of adding another cluster starts to be small. In the R-squared against clusters diagram, one can determine that the elbowing starts to occur at k = 4 or k = 5. Meanwhile, in the total distance against clusters diagram, it seems that k = 5 is a reasonable cluster value. Therefore, based on these two graphs, we can safely determine that the appropriate number of groups to divide the 48 users is five groups. Below shows the plot of the 5 groups with their corresponding data points (i.e. users).

The five clusters as seen above denote the levels of engagement that users in the month of October displayed, where one being the most engaged and five being the least engaged. What we can further draw from this in future analyses is whether there exists a pattern when suppose we enlarge our dataset to greater number of users. Showwcase can target certain groups to generate more engagements.

Further examining the above graph, we can ask ourselves if it is possible to model the time that a user spends against each engagement activity. At a glance, there seem to be a positive correlation between the total session duration against the total number of engagements, which is confirmed by our findings below:

## 
## Call:
## lm(formula = customer_tot_session_dur ~ 0 + customer_tot_engagements, 
##     data = df5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7095.8 -1258.3  -261.2  1260.0  9846.0 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## customer_tot_engagements   81.067      2.964   27.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2587 on 47 degrees of freedom
## Multiple R-squared:  0.9409, Adjusted R-squared:  0.9396 
## F-statistic: 747.9 on 1 and 47 DF,  p-value: < 2.2e-16

We find that with a coefficient of determination of 94.09%, the cumulative time a user spends in a month can be modeled as Session Time = 81.067*(Number of Engagements). Also we can see a confidence band such that as the number of engagements increases, the variability in the time spend also increases. Furthermore, in the model, we assumed that if a user does not spend any time, then the user will not perform any kind of engagement. Therefore, session time and number of engagements are modeled proportional to each other (i.e. passes the origin).

With this in mind, we want to further know the breakdown of each components. Using regression, we can breakdown the engagements into number of likes, comments, and projects to determine the impact they have towards time spent in Showwcase. Note that we need to scale each engagement type to the same range. This is to ensure that we have equal weighting on each of the component. If not, then it is easy to see that only the number of likes and comments are the ones that are dominant in the model. The approach that we are going to take is to scale the number of projects, likes, comments, and time spend to [0,1] with 0 as being the lowest value in their respective attribute and 1 being the highest.

## 
## Call:
## lm(formula = customer_tot_session_dur ~ 0 + customer_tot_likes + 
##     customer_tot_comments + customer_tot_projects, data = df8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14794 -0.04704 -0.00179  0.03668  0.29171 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## customer_tot_likes     0.42868    0.11693   3.666 0.000648 ***
## customer_tot_comments  0.55155    0.12112   4.554    4e-05 ***
## customer_tot_projects -0.04402    0.10242  -0.430 0.669398    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07879 on 45 degrees of freedom
## Multiple R-squared:  0.9513, Adjusted R-squared:  0.948 
## F-statistic: 292.9 on 3 and 45 DF,  p-value: < 2.2e-16

From the test summary, with an R-squared value of 95.13%, we can see that both likes and comments contribute positively towards the time spend in Showwcase. Moreover, for every project posted, it seems to only reduce the time a user spends in Showwcase. While we only consider modeling the data above using MLR, other regression avenues should be examined to possibly uncover better insights in terms of which engagement interactions might seem significant. Furthermore, testing the model against more attributes (i.e. number of unique page visits per session, user's first date as a member, etc) would also be another method to dig deeper into valuable user engagement insights.

Summary

As a tech worker community platform, Showwcase had been showing positive trends in October 2019. First, we considered analyzing the outlier in the number of log-ins made in October 24th. It seemed that there were indeed a lot of likes, comments, and projects posted on that day which we determined happen to be Saturday. While not every weekend days mimic the pattern we saw in October 24th, we found that the number of comments rose from the beginning to the end of the month using Change-Detection analysis. Moreover, we were able to group the 48 users into 5 groups where the groups indicate the different level of engagement intensities showed by the users. Lastly, we used regression to measure for the contribution of each type of engagement towards the time a user spends in the website. While the projects attribute affected the session duration negatively, the number of likes and comments posted provide significant impact to the session duration. In the future, Showwcase can validate this result against more data points and attributes to uncover more insights and better understand the pattern of their users.

Showwcase User Engagement Analysis

William Parwoto Wirono

September 15, 2020

Getting Started

Investigating Trends in Daily Number of Log-ins

User Grouping and Session Duration Modeling

Summary