0.1 Introduction

MDSI Slack has been used among MDSI students since 2016. It is an efficient collaboration tool for students to chat, work in a team, discuss ideas, or share resources. Regarding to data perspective, Slack provides access to all chat log data for public conversations. In this assignment, a subset of data was put into SQL database for us to access and perform analysis. A combination of SQL, R and Python is explored and exploited its power to produce few insights about this dataset.

Regarding to Slack data structure, there are three tables in this database:

0.2 SQL

Connect database

## Loading required package: DBI

Find popular channel

We are looking for the popular and active channels which has more than 10 users. We assumed that these groups are official study group, not the private small group.

SELECT
  c.channel_id               AS channel_id,
  c.channel_name             AS channel_name,
  COUNT(DISTINCT m.user_id)  AS number_users,
  COUNT(*)                   AS number_messages,
  ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT m.user_id)::DECIMAL, 1) AS average_messages_per_user
FROM
  messages AS m
  LEFT JOIN users AS u
    ON m.user_id = u.user_id
  LEFT JOIN channels AS c
    ON m.channel_id = c.channel_id
WHERE
  c.channel_is_archived = FALSE
GROUP BY c.channel_id, c.channel_name
HAVING count(DISTINCT m.user_id) > 10
ORDER BY number_users DESC
Displaying records 1 - 10
channel_id channel_name number_users number_messages average_messages_per_user
C18SWDACD mdsi_announcements 116 1735 15.0
C1U4T4GCR mdsi_electives 87 1161 13.3
C1924SRPG dev_r 48 3653 76.1
C192JHDND ds_hackathons 44 741 16.8
C4GL5GF4Y mdsi_dvn_aut_18 44 338 7.7
C1E873E2E ds_jobs 39 280 7.2
C191P7JE6 ds_protips 37 248 6.7
C4P5NMR0D mdsi_datacamp 33 162 4.9
C1CHS0P45 dev_data_vis 32 458 14.3
C5SQ1Q1UH mdsi_deeplearn_spr_17 32 991 31.0

Results show that there are 24 active channels having more than 10 users. The ‘mdsi_announcements’ channel accounts for the greatest number of users and 1735 messages. However the dev_r which only has 48 users but gives most significant number of messages and the average number of messages per user.

Find popular users on each channel

The specific question here is who are the most popular users in above list channel, how many messages of that user and ratio of that to the average.


WITH active_channel AS (
  SELECT
    c.channel_id               AS channel_id,
    c.channel_name             AS channel_name,
    COUNT(DISTINCT m.user_id)  AS number_users,
    COUNT(*)                   AS number_messages,
    channel_creator            AS channel_creator,
    ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT m.user_id)::DECIMAL, 1) as average_messages_per_user
  FROM
    messages AS m
    LEFT JOIN users AS u
      ON m.user_id = u.user_id
    LEFT JOIN channels AS c
      ON m.channel_id = c.channel_id
  GROUP BY c.channel_id, c.channel_name, c.channel_creator
  HAVING count(DISTINCT m.user_id) > 10
  ORDER BY number_users DESC
),

aggregate_users AS (
  SELECT
    m.channel_id  AS channel_id,
    u.user_id     AS user_id,
    u.user_name   AS user_name,
    count(*)      AS number_messages
  FROM
    messages AS m
    LEFT join users AS u
      ON m.user_id = u.user_id
  GROUP BY m.channel_id, u.user_id, u.user_name
  ORDER BY number_messages DESC
),

active_users AS (
  SELECT
    DISTINCT ON (ag.channel_id) channel_id,
    ag.user_id     AS user_id,
    ag.user_name   AS user_name,
    ag.number_messages
  FROM
    aggregate_users ag
  ORDER BY ag.channel_id, ag.number_messages DESC
)

SELECT 
  ac.channel_id       AS channel_id,
  ac.channel_name     AS channel_name,
  ac.number_users     AS number_users,
  au.user_name        AS most_actived_user,
  au.number_messages  AS messages_of_user,
  ROUND (au.number_messages / ac.average_messages_per_user, 1) as ratio_with_average,
  CASE
    WHEN ac.channel_creator = au.user_id THEN 'Creator'
    ELSE 'Not Creator'
  END 
  AS creator
FROM
  active_channel ac
  INNER JOIN active_users au
    ON ac.channel_id = au.channel_id
ORDER BY ratio_with_average DESC
Displaying records 1 - 10
channel_id channel_name number_users most_actived_user messages_of_user ratio_with_average creator
C1924SRPG dev_r 48 Perry Stephenson 960 12.6 Not Creator
C4GL5GF4Y mdsi_dvn_aut_18 44 Rory Angus 91 11.8 Not Creator
C1WS2QTNK ds_hackathon_unearth 30 Perry Stephenson 190 11.4 Creator
C1DPDS65N mdsi_dam_aut_18 48 Alex Scriven 317 10.5 Not Creator
C18SWDACD mdsi_announcements 116 Perry Stephenson 157 10.5 Not Creator
C1U4T4GCR mdsi_electives 87 Alex Scriven 126 9.5 Not Creator
CBVTSQ0SZ mdsi_dam_spr_18 20 Alex Scriven 122 8.9 Creator
C192JHDND ds_hackathons 44 Perry Stephenson 134 8.0 Not Creator
C8PRYCL74 mdsi_dadm_aut_18 33 Rory Angus 110 7.6 Creator
C1E873E2E ds_jobs 39 Perry Stephenson 48 6.7 Not Creator

Perry Stephenson is quite popular in a few channels at the top of the list in which he is either a creator or not-a-creator. He also accounts for the highest number of messages of all channels, 960 and contributes twelve times higher than the average number of messages. Following Perry on the list, we also have Alex Scriven, who is also quite active in a few channels.

0.3 Import & Merge data by R & Python

R - Read data from database

users <- dbGetQuery(con, "select * from users")
channels <- dbGetQuery(con, "select * from channels")
messages <- dbGetQuery(con, "select * from messages")

Python - Merge data

import pandas as pd

users = r.users
channels = r.channels
messages = r.messages

df_all = messages.merge(users, 
                          on = 'user_id', 
                          how='left')
df_all = df_all.merge(channels, 
                          on = 'channel_id', 
                          how='left')
df_all.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28693 entries, 0 to 28692
## Data columns (total 20 columns):
## channel_id                  28693 non-null object
## user_id                     28693 non-null object
## message_text                28693 non-null object
## message_timestamp           28693 non-null datetime64[ns]
## message_attachment_count    28693 non-null float64
## message_reply_count         28693 non-null float64
## user_nickname               28061 non-null object
## user_name                   28061 non-null object
## user_timezone               28061 non-null object
## user_photo                  28061 non-null object
## user_is_admin               28061 non-null object
## user_is_owner               28061 non-null object
## user_is_bot                 28061 non-null object
## user_is_deleted             28061 non-null object
## channel_name                28693 non-null object
## channel_created             28693 non-null datetime64[ns]
## channel_creator             28693 non-null object
## channel_is_archived         28693 non-null bool
## channel_topic               28693 non-null object
## channel_purpose             28693 non-null object
## dtypes: bool(1), datetime64[ns](2), float64(2), object(15)
## memory usage: 4.4+ MB
library(DataExplorer)
df_all = merge(messages, 
               users, 
               by = "user_id", 
               all.x = TRUE)

df_all = merge(df_all, 
               channels, 
               by = "channel_id", 
               all.x = TRUE)
plot_str(df_all)

We make three steps:

From summary information, we have 28693 observations and 20 variables. We could see detail of each variable on the above graph.

0.4 Exploration analysis

R - plot missing

plot_missing(df_all)

We had missing data for few variables “user_is_deleted”, “user_is_bot”, “user_is_owner” and so on. The missing percentage of these variables are the same. It is quite understandable for missing value at “user_photo”, “user_nickname” because these are optional inputs. However, “user_name” is a required input. Few variables such as “user_is_deleted”, “user_is_bot”, “user_is_owner” could be filled by its dominant value.

plot_histogram(df_all)

Plot histogram presents that the number of message without attachment dominates the one with attachment. And it is the same situation as the number of messages without the reply and reply. It is quite understandable because people do not need to make a thread to answer on Slack.

par(mfrow=c(2,3))
for (col in c("user_is_deleted", 
              "user_is_bot", 
              "user_is_owner", 
              "user_is_admin", 
              "channel_is_archived"))
  barplot(table(df_all[[col]]), main = col)

Bar plot in R shows information: - There is no deleted user. - It could probably the same rate between owner user and admin user. Or maybe both are the same users. - Quite a few channels are archived. These information could be assisted in filling the missing values at above step.

0.7 Conclusion

A combination of SQL, R and Python is very effective to provide insights for our data analysis. SQL is a structured query language which is used to retrieve subset information that represent info in limited ways. Given data queried by SQL language, R and Python are powerful tools to perform data manipulation, data exploration and data visualization. With reticulate library, R and Python can convert variables to each other to leverage the power of each language.

In this article, SQL executes query to find out popular users and channel information. Then, a combination of R and Python presents this information by plotting onto histogram or boxplot. We found that the number of users less than 20 and the number of channels having no messages account significantly. Number of channels having 10 to 100 messages were given a considerable amount. Daily messages per weekday having no remarkable differences among days in the week. Finally, the use of wordcloud tool given by R and Python presents that Perry Stephenson and Alex Scriven are most popular users while dev_r, mdsi_annoucement are popular channels.