MDSI Slack has been used among MDSI students since 2016. It is an efficient collaboration tool for students to chat, work in a team, discuss ideas, or share resources. Regarding to data perspective, Slack provides access to all chat log data for public conversations. In this assignment, a subset of data was put into SQL database for us to access and perform analysis. A combination of SQL, R and Python is explored and exploited its power to produce few insights about this dataset.
Regarding to Slack data structure, there are three tables in this database:
Connect database
## Loading required package: DBI
Find popular channel
We are looking for the popular and active channels which has more than 10 users. We assumed that these groups are official study group, not the private small group.
SELECT
c.channel_id AS channel_id,
c.channel_name AS channel_name,
COUNT(DISTINCT m.user_id) AS number_users,
COUNT(*) AS number_messages,
ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT m.user_id)::DECIMAL, 1) AS average_messages_per_user
FROM
messages AS m
LEFT JOIN users AS u
ON m.user_id = u.user_id
LEFT JOIN channels AS c
ON m.channel_id = c.channel_id
WHERE
c.channel_is_archived = FALSE
GROUP BY c.channel_id, c.channel_name
HAVING count(DISTINCT m.user_id) > 10
ORDER BY number_users DESC
| channel_id | channel_name | number_users | number_messages | average_messages_per_user |
|---|---|---|---|---|
| C18SWDACD | mdsi_announcements | 116 | 1735 | 15.0 |
| C1U4T4GCR | mdsi_electives | 87 | 1161 | 13.3 |
| C1924SRPG | dev_r | 48 | 3653 | 76.1 |
| C192JHDND | ds_hackathons | 44 | 741 | 16.8 |
| C4GL5GF4Y | mdsi_dvn_aut_18 | 44 | 338 | 7.7 |
| C1E873E2E | ds_jobs | 39 | 280 | 7.2 |
| C191P7JE6 | ds_protips | 37 | 248 | 6.7 |
| C4P5NMR0D | mdsi_datacamp | 33 | 162 | 4.9 |
| C1CHS0P45 | dev_data_vis | 32 | 458 | 14.3 |
| C5SQ1Q1UH | mdsi_deeplearn_spr_17 | 32 | 991 | 31.0 |
Results show that there are 24 active channels having more than 10 users. The ‘mdsi_announcements’ channel accounts for the greatest number of users and 1735 messages. However the dev_r which only has 48 users but gives most significant number of messages and the average number of messages per user.
Find popular users on each channel
The specific question here is who are the most popular users in above list channel, how many messages of that user and ratio of that to the average.
WITH active_channel AS (
SELECT
c.channel_id AS channel_id,
c.channel_name AS channel_name,
COUNT(DISTINCT m.user_id) AS number_users,
COUNT(*) AS number_messages,
channel_creator AS channel_creator,
ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT m.user_id)::DECIMAL, 1) as average_messages_per_user
FROM
messages AS m
LEFT JOIN users AS u
ON m.user_id = u.user_id
LEFT JOIN channels AS c
ON m.channel_id = c.channel_id
GROUP BY c.channel_id, c.channel_name, c.channel_creator
HAVING count(DISTINCT m.user_id) > 10
ORDER BY number_users DESC
),
aggregate_users AS (
SELECT
m.channel_id AS channel_id,
u.user_id AS user_id,
u.user_name AS user_name,
count(*) AS number_messages
FROM
messages AS m
LEFT join users AS u
ON m.user_id = u.user_id
GROUP BY m.channel_id, u.user_id, u.user_name
ORDER BY number_messages DESC
),
active_users AS (
SELECT
DISTINCT ON (ag.channel_id) channel_id,
ag.user_id AS user_id,
ag.user_name AS user_name,
ag.number_messages
FROM
aggregate_users ag
ORDER BY ag.channel_id, ag.number_messages DESC
)
SELECT
ac.channel_id AS channel_id,
ac.channel_name AS channel_name,
ac.number_users AS number_users,
au.user_name AS most_actived_user,
au.number_messages AS messages_of_user,
ROUND (au.number_messages / ac.average_messages_per_user, 1) as ratio_with_average,
CASE
WHEN ac.channel_creator = au.user_id THEN 'Creator'
ELSE 'Not Creator'
END
AS creator
FROM
active_channel ac
INNER JOIN active_users au
ON ac.channel_id = au.channel_id
ORDER BY ratio_with_average DESC
| channel_id | channel_name | number_users | most_actived_user | messages_of_user | ratio_with_average | creator |
|---|---|---|---|---|---|---|
| C1924SRPG | dev_r | 48 | Perry Stephenson | 960 | 12.6 | Not Creator |
| C4GL5GF4Y | mdsi_dvn_aut_18 | 44 | Rory Angus | 91 | 11.8 | Not Creator |
| C1WS2QTNK | ds_hackathon_unearth | 30 | Perry Stephenson | 190 | 11.4 | Creator |
| C1DPDS65N | mdsi_dam_aut_18 | 48 | Alex Scriven | 317 | 10.5 | Not Creator |
| C18SWDACD | mdsi_announcements | 116 | Perry Stephenson | 157 | 10.5 | Not Creator |
| C1U4T4GCR | mdsi_electives | 87 | Alex Scriven | 126 | 9.5 | Not Creator |
| CBVTSQ0SZ | mdsi_dam_spr_18 | 20 | Alex Scriven | 122 | 8.9 | Creator |
| C192JHDND | ds_hackathons | 44 | Perry Stephenson | 134 | 8.0 | Not Creator |
| C8PRYCL74 | mdsi_dadm_aut_18 | 33 | Rory Angus | 110 | 7.6 | Creator |
| C1E873E2E | ds_jobs | 39 | Perry Stephenson | 48 | 6.7 | Not Creator |
Perry Stephenson is quite popular in a few channels at the top of the list in which he is either a creator or not-a-creator. He also accounts for the highest number of messages of all channels, 960 and contributes twelve times higher than the average number of messages. Following Perry on the list, we also have Alex Scriven, who is also quite active in a few channels.
R - Read data from database
users <- dbGetQuery(con, "select * from users")
channels <- dbGetQuery(con, "select * from channels")
messages <- dbGetQuery(con, "select * from messages")
Python - Merge data
import pandas as pd
users = r.users
channels = r.channels
messages = r.messages
df_all = messages.merge(users,
on = 'user_id',
how='left')
df_all = df_all.merge(channels,
on = 'channel_id',
how='left')
df_all.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28693 entries, 0 to 28692
## Data columns (total 20 columns):
## channel_id 28693 non-null object
## user_id 28693 non-null object
## message_text 28693 non-null object
## message_timestamp 28693 non-null datetime64[ns]
## message_attachment_count 28693 non-null float64
## message_reply_count 28693 non-null float64
## user_nickname 28061 non-null object
## user_name 28061 non-null object
## user_timezone 28061 non-null object
## user_photo 28061 non-null object
## user_is_admin 28061 non-null object
## user_is_owner 28061 non-null object
## user_is_bot 28061 non-null object
## user_is_deleted 28061 non-null object
## channel_name 28693 non-null object
## channel_created 28693 non-null datetime64[ns]
## channel_creator 28693 non-null object
## channel_is_archived 28693 non-null bool
## channel_topic 28693 non-null object
## channel_purpose 28693 non-null object
## dtypes: bool(1), datetime64[ns](2), float64(2), object(15)
## memory usage: 4.4+ MB
library(DataExplorer)
df_all = merge(messages,
users,
by = "user_id",
all.x = TRUE)
df_all = merge(df_all,
channels,
by = "channel_id",
all.x = TRUE)
plot_str(df_all)
We make three steps:
From summary information, we have 28693 observations and 20 variables. We could see detail of each variable on the above graph.
R - plot missing
plot_missing(df_all)
We had missing data for few variables “user_is_deleted”, “user_is_bot”, “user_is_owner” and so on. The missing percentage of these variables are the same. It is quite understandable for missing value at “user_photo”, “user_nickname” because these are optional inputs. However, “user_name” is a required input. Few variables such as “user_is_deleted”, “user_is_bot”, “user_is_owner” could be filled by its dominant value.
plot_histogram(df_all)
Plot histogram presents that the number of message without attachment dominates the one with attachment. And it is the same situation as the number of messages without the reply and reply. It is quite understandable because people do not need to make a thread to answer on Slack.
par(mfrow=c(2,3))
for (col in c("user_is_deleted",
"user_is_bot",
"user_is_owner",
"user_is_admin",
"channel_is_archived"))
barplot(table(df_all[[col]]), main = col)
Bar plot in R shows information: - There is no deleted user. - It could probably the same rate between owner user and admin user. Or maybe both are the same users. - Quite a few channels are archived. These information could be assisted in filling the missing values at above step.
Python - Find number of messages per channel
df_message_by_channel = df_all.groupby('channel_name')['channel_creator'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False)
df_message_by_channel.head(10)
## channel_name count
## 26 dev_r 3653
## 13 dam-assign-3 1874
## 67 mdsi_announcements 1735
## 102 spring_stds_assign2 1733
## 71 mdsi_dam_aut_18 1450
## 1 36100decepticons 1437
## 84 mdsi_stds_proj 1378
## 79 mdsi_electives 1161
## 14 databusters 1074
## 76 mdsi_deeplearn_spr_17 991
R - Plot distribution of messages per channel
plot_histogram(py$df_message_by_channel)
At this step, Python is utilised to prepare data and transfer that variable to R to plot histogram. Results show that ‘dev_r’ channel accounts for most significant number of messages while eight other channels made major contribution over 1000 messages. Surprisingly, 60 channels have no message. It could be because that these channels were made in mistake, or were abandoned.
R - Find number of users per channel
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df_users_by_channel <- df_all %>%
select('channel_name', 'user_id') %>%
group_by(channel_name, user_id) %>%
summarize(n = n()) %>%
group_by(channel_name) %>%
summarize(num_users = n()) %>%
arrange(desc(num_users))
head(df_users_by_channel)
## # A tibble: 6 x 2
## channel_name num_users
## <chr> <int>
## 1 mdsi_announcements 116
## 2 mdsi_electives 87
## 3 dev_r 48
## 4 mdsi_dam_aut_18 48
## 5 ds_hackathons 44
## 6 mdsi_dvn_aut_18 44
Python - Box plot number of users per channel
r.df_users_by_channel.plot.box(by = 'num_users')
In opposite of previous step, R makes data munging first then Python performs the plot. 116 is the highest number of users per channel but less than 20 number accounts for the majority of cases. ‘mdsi_announcements’ and ‘mdsi_electives’ are the two largest channels which their total number of users are threefold the following channel in the list. Outliers could be found majorly in the range of 30 to 40 number of users.
R - Find daily messages per channel
library(dplyr)
df_daily_messages_per_channel <- df_all %>%
select('channel_name', 'message_timestamp') %>%
mutate(message_date = as.Date(message_timestamp)) %>%
group_by(channel_name, message_date) %>%
summarize(num_messages = n()) %>%
arrange(desc(num_messages))
head(df_daily_messages_per_channel)
## # A tibble: 6 x 3
## # Groups: channel_name [3]
## channel_name message_date num_messages
## <chr> <date> <int>
## 1 dam-assign-3 2018-06-09 458
## 2 spring_stds_assign2 2018-09-30 364
## 3 dam-assign-3 2018-06-08 341
## 4 dam-assign-3 2018-06-10 325
## 5 dev_r 2016-09-09 267
## 6 spring_stds_assign2 2018-09-27 233
Python - Boxplot daily messages per channel
r.df_daily_messages_per_channel.plot.box(by = 'num_messages')
Since there is a large number of channels having no message so the centre boxplot is quite near the 0. The highest number of messages, 458 belongs to ‘dam-assign-3’ channel on 2018-06-09. Outliers could be found significantly in range of 10 to above 100.
Python - Boxplot frequency of channel per weekday
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ readr 1.3.1
## ✔ tibble 2.1.3 ✔ purrr 0.3.2
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ ggplot2 3.2.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
df_messages_per_weekday <- df_all %>%
select('channel_name', 'message_timestamp') %>%
mutate(weekday = strftime(message_timestamp,'%a')) %>%
group_by(channel_name, weekday) %>%
summarize(num_messages = n()) %>%
arrange(desc(num_messages))
boxplot(num_messages ~ weekday,
data = df_messages_per_weekday,
xlab = "Weekday",
ylab = "Number of message",
names = c("Mon","Tue","Wed", "Thu", "Fri", "Sat", "Sun"))
The mean of daily messages per channel is quite low. Monday could be a little higher than the rest in terms of the number of daily messages. There are quite a large amount of outliers, which could be due to the few abandoned channels. Generally, there is no day in the week which could give the remarkable difference.
Python - Word cloud for the popular user
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
text = ','.join(str(v) for v in df_all['user_name'])
wordcloud = WordCloud(
width = 300,
height = 200,
background_color = 'black',
stopwords = STOPWORDS).generate(str(text))
# Display the generated image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
## (-0.5, 299.5, 199.5, -0.5)
plt.show()
Perry Stephenson and Alex Scriven are popular name on the wordcloud image, which is same as above SQL result. Better than SQL, with wordcloud we can also view few other names as well such as Ho jacky, Jason Nguyen, Jacky Wong, dan Booth, Ka Ho and so on.
R - Word cloud for the popular channel
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
jeopCorpus <- VCorpus(VectorSource(df_all$channel_name))
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
#jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
Same as above SQL result, dev_r, mdsi_annoucement are the most popular channels.
A combination of SQL, R and Python is very effective to provide insights for our data analysis. SQL is a structured query language which is used to retrieve subset information that represent info in limited ways. Given data queried by SQL language, R and Python are powerful tools to perform data manipulation, data exploration and data visualization. With reticulate library, R and Python can convert variables to each other to leverage the power of each language.
In this article, SQL executes query to find out popular users and channel information. Then, a combination of R and Python presents this information by plotting onto histogram or boxplot. We found that the number of users less than 20 and the number of channels having no messages account significantly. Number of channels having 10 to 100 messages were given a considerable amount. Daily messages per weekday having no remarkable differences among days in the week. Finally, the use of wordcloud tool given by R and Python presents that Perry Stephenson and Alex Scriven are most popular users while dev_r, mdsi_annoucement are popular channels.