Assignment DSP 2B

0.1 Introduction
0.2 SQL
0.3 Import & Merge data by R & Python
0.4 Exploration analysis
0.5 Popular channel analysis
0.6 Find the most popular user and channel by word cloud in both R and Python
0.7 Conclusion

0.1 Introduction

MDSI Slack has been used among MDSI students since 2016. It is an efficient collaboration tool for students to chat, work in a team, discuss ideas, or share resources. Regarding to data perspective, Slack provides access to all chat log data for public conversations. In this assignment, a subset of data was put into SQL database for us to access and perform analysis. A combination of SQL, R and Python is explored and exploited its power to produce few insights about this dataset.

Regarding to Slack data structure, there are three tables in this database:

users: data about each user
channels: data about each public channel
messages: all messages posted in public channels.

0.2 SQL

Connect database

## Loading required package: DBI

Find popular channel

We are looking for the popular and active channels which has more than 10 users. We assumed that these groups are official study group, not the private small group.

SELECT
  c.channel_id               AS channel_id,
  c.channel_name             AS channel_name,
  COUNT(DISTINCT m.user_id)  AS number_users,
  COUNT(*)                   AS number_messages,
  ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT m.user_id)::DECIMAL, 1) AS average_messages_per_user
FROM
  messages AS m
  LEFT JOIN users AS u
    ON m.user_id = u.user_id
  LEFT JOIN channels AS c
    ON m.channel_id = c.channel_id
WHERE
  c.channel_is_archived = FALSE
GROUP BY c.channel_id, c.channel_name
HAVING count(DISTINCT m.user_id) > 10
ORDER BY number_users DESC

Displaying records 1 - 10
channel_id	channel_name	number_users	number_messages	average_messages_per_user
C18SWDACD	mdsi_announcements	116	1735	15.0
C1U4T4GCR	mdsi_electives	87	1161	13.3
C1924SRPG	dev_r	48	3653	76.1
C192JHDND	ds_hackathons	44	741	16.8
C4GL5GF4Y	mdsi_dvn_aut_18	44	338	7.7
C1E873E2E	ds_jobs	39	280	7.2
C191P7JE6	ds_protips	37	248	6.7
C4P5NMR0D	mdsi_datacamp	33	162	4.9
C1CHS0P45	dev_data_vis	32	458	14.3
C5SQ1Q1UH	mdsi_deeplearn_spr_17	32	991	31.0

Results show that there are 24 active channels having more than 10 users. The ‘mdsi_announcements’ channel accounts for the greatest number of users and 1735 messages. However the dev_r which only has 48 users but gives most significant number of messages and the average number of messages per user.

Find popular users on each channel

The specific question here is who are the most popular users in above list channel, how many messages of that user and ratio of that to the average.


WITH active_channel AS (
  SELECT
    c.channel_id               AS channel_id,
    c.channel_name             AS channel_name,
    COUNT(DISTINCT m.user_id)  AS number_users,
    COUNT(*)                   AS number_messages,
    channel_creator            AS channel_creator,
    ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT m.user_id)::DECIMAL, 1) as average_messages_per_user
  FROM
    messages AS m
    LEFT JOIN users AS u
      ON m.user_id = u.user_id
    LEFT JOIN channels AS c
      ON m.channel_id = c.channel_id
  GROUP BY c.channel_id, c.channel_name, c.channel_creator
  HAVING count(DISTINCT m.user_id) > 10
  ORDER BY number_users DESC
),

aggregate_users AS (
  SELECT
    m.channel_id  AS channel_id,
    u.user_id     AS user_id,
    u.user_name   AS user_name,
    count(*)      AS number_messages
  FROM
    messages AS m
    LEFT join users AS u
      ON m.user_id = u.user_id
  GROUP BY m.channel_id, u.user_id, u.user_name
  ORDER BY number_messages DESC
),

active_users AS (
  SELECT
    DISTINCT ON (ag.channel_id) channel_id,
    ag.user_id     AS user_id,
    ag.user_name   AS user_name,
    ag.number_messages
  FROM
    aggregate_users ag
  ORDER BY ag.channel_id, ag.number_messages DESC
)

SELECT 
  ac.channel_id       AS channel_id,
  ac.channel_name     AS channel_name,
  ac.number_users     AS number_users,
  au.user_name        AS most_actived_user,
  au.number_messages  AS messages_of_user,
  ROUND (au.number_messages / ac.average_messages_per_user, 1) as ratio_with_average,
  CASE
    WHEN ac.channel_creator = au.user_id THEN 'Creator'
    ELSE 'Not Creator'
  END 
  AS creator
FROM
  active_channel ac
  INNER JOIN active_users au
    ON ac.channel_id = au.channel_id
ORDER BY ratio_with_average DESC

Displaying records 1 - 10
channel_id	channel_name	number_users	most_actived_user	messages_of_user	ratio_with_average	creator
C1924SRPG	dev_r	48	Perry Stephenson	960	12.6	Not Creator
C4GL5GF4Y	mdsi_dvn_aut_18	44	Rory Angus	91	11.8	Not Creator
C1WS2QTNK	ds_hackathon_unearth	30	Perry Stephenson	190	11.4	Creator
C1DPDS65N	mdsi_dam_aut_18	48	Alex Scriven	317	10.5	Not Creator
C18SWDACD	mdsi_announcements	116	Perry Stephenson	157	10.5	Not Creator
C1U4T4GCR	mdsi_electives	87	Alex Scriven	126	9.5	Not Creator
CBVTSQ0SZ	mdsi_dam_spr_18	20	Alex Scriven	122	8.9	Creator
C192JHDND	ds_hackathons	44	Perry Stephenson	134	8.0	Not Creator
C8PRYCL74	mdsi_dadm_aut_18	33	Rory Angus	110	7.6	Creator
C1E873E2E	ds_jobs	39	Perry Stephenson	48	6.7	Not Creator

Perry Stephenson is quite popular in a few channels at the top of the list in which he is either a creator or not-a-creator. He also accounts for the highest number of messages of all channels, 960 and contributes twelve times higher than the average number of messages. Following Perry on the list, we also have Alex Scriven, who is also quite active in a few channels.

0.3 Import & Merge data by R & Python

R - Read data from database

users <- dbGetQuery(con, "select * from users")
channels <- dbGetQuery(con, "select * from channels")
messages <- dbGetQuery(con, "select * from messages")

Python - Merge data

import pandas as pd

users = r.users
channels = r.channels
messages = r.messages

df_all = messages.merge(users, 
                          on = 'user_id', 
                          how='left')
df_all = df_all.merge(channels, 
                          on = 'channel_id', 
                          how='left')
df_all.info()

## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28693 entries, 0 to 28692
## Data columns (total 20 columns):
## channel_id                  28693 non-null object
## user_id                     28693 non-null object
## message_text                28693 non-null object
## message_timestamp           28693 non-null datetime64[ns]
## message_attachment_count    28693 non-null float64
## message_reply_count         28693 non-null float64
## user_nickname               28061 non-null object
## user_name                   28061 non-null object
## user_timezone               28061 non-null object
## user_photo                  28061 non-null object
## user_is_admin               28061 non-null object
## user_is_owner               28061 non-null object
## user_is_bot                 28061 non-null object
## user_is_deleted             28061 non-null object
## channel_name                28693 non-null object
## channel_created             28693 non-null datetime64[ns]
## channel_creator             28693 non-null object
## channel_is_archived         28693 non-null bool
## channel_topic               28693 non-null object
## channel_purpose             28693 non-null object
## dtypes: bool(1), datetime64[ns](2), float64(2), object(15)
## memory usage: 4.4+ MB

library(DataExplorer)
df_all = merge(messages, 
               users, 
               by = "user_id", 
               all.x = TRUE)

df_all = merge(df_all, 
               channels, 
               by = "channel_id", 
               all.x = TRUE)
plot_str(df_all)

We make three steps:

Read data from database by using R
Convert variable from R to Python and display info
Merge data by R

From summary information, we have 28693 observations and 20 variables. We could see detail of each variable on the above graph.

0.4 Exploration analysis

R - plot missing

plot_missing(df_all)

We had missing data for few variables “user_is_deleted”, “user_is_bot”, “user_is_owner” and so on. The missing percentage of these variables are the same. It is quite understandable for missing value at “user_photo”, “user_nickname” because these are optional inputs. However, “user_name” is a required input. Few variables such as “user_is_deleted”, “user_is_bot”, “user_is_owner” could be filled by its dominant value.

plot_histogram(df_all)

Plot histogram presents that the number of message without attachment dominates the one with attachment. And it is the same situation as the number of messages without the reply and reply. It is quite understandable because people do not need to make a thread to answer on Slack.

par(mfrow=c(2,3))
for (col in c("user_is_deleted", 
              "user_is_bot", 
              "user_is_owner", 
              "user_is_admin", 
              "channel_is_archived"))
  barplot(table(df_all[[col]]), main = col)

Bar plot in R shows information: - There is no deleted user. - It could probably the same rate between owner user and admin user. Or maybe both are the same users. - Quite a few channels are archived. These information could be assisted in filling the missing values at above step.

0.5 Popular channel analysis

Python - Find number of messages per channel

df_message_by_channel = df_all.groupby('channel_name')['channel_creator'] \
                              .count() \
                              .reset_index(name='count') \
                              .sort_values(['count'], ascending=False)
df_message_by_channel.head(10)

##               channel_name  count
## 26                   dev_r   3653
## 13            dam-assign-3   1874
## 67      mdsi_announcements   1735
## 102    spring_stds_assign2   1733
## 71         mdsi_dam_aut_18   1450
## 1         36100decepticons   1437
## 84          mdsi_stds_proj   1378
## 79          mdsi_electives   1161
## 14             databusters   1074
## 76   mdsi_deeplearn_spr_17    991

R - Plot distribution of messages per channel

plot_histogram(py$df_message_by_channel)

At this step, Python is utilised to prepare data and transfer that variable to R to plot histogram. Results show that ‘dev_r’ channel accounts for most significant number of messages while eight other channels made major contribution over 1000 messages. Surprisingly, 60 channels have no message. It could be because that these channels were made in mistake, or were abandoned.

R - Find number of users per channel

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df_users_by_channel <- df_all %>% 
  select('channel_name', 'user_id') %>%
  group_by(channel_name, user_id) %>%
  summarize(n = n()) %>%
  group_by(channel_name) %>%
  summarize(num_users = n()) %>%
  arrange(desc(num_users))

head(df_users_by_channel)

## # A tibble: 6 x 2
##   channel_name       num_users
##   <chr>                  <int>
## 1 mdsi_announcements       116
## 2 mdsi_electives            87
## 3 dev_r                     48
## 4 mdsi_dam_aut_18           48
## 5 ds_hackathons             44
## 6 mdsi_dvn_aut_18           44

Python - Box plot number of users per channel

r.df_users_by_channel.plot.box(by = 'num_users')

In opposite of previous step, R makes data munging first then Python performs the plot. 116 is the highest number of users per channel but less than 20 number accounts for the majority of cases. ‘mdsi_announcements’ and ‘mdsi_electives’ are the two largest channels which their total number of users are threefold the following channel in the list. Outliers could be found majorly in the range of 30 to 40 number of users.

R - Find daily messages per channel

library(dplyr)
df_daily_messages_per_channel <- df_all %>% 
  select('channel_name', 'message_timestamp') %>%
  mutate(message_date = as.Date(message_timestamp)) %>%
  group_by(channel_name, message_date) %>%
  summarize(num_messages = n()) %>%
  arrange(desc(num_messages))

head(df_daily_messages_per_channel)

## # A tibble: 6 x 3
## # Groups:   channel_name [3]
##   channel_name        message_date num_messages
##   <chr>               <date>              <int>
## 1 dam-assign-3        2018-06-09            458
## 2 spring_stds_assign2 2018-09-30            364
## 3 dam-assign-3        2018-06-08            341
## 4 dam-assign-3        2018-06-10            325
## 5 dev_r               2016-09-09            267
## 6 spring_stds_assign2 2018-09-27            233

Python - Boxplot daily messages per channel

r.df_daily_messages_per_channel.plot.box(by = 'num_messages')

Since there is a large number of channels having no message so the centre boxplot is quite near the 0. The highest number of messages, 458 belongs to ‘dam-assign-3’ channel on 2018-06-09. Outliers could be found significantly in range of 10 to above 100.

Python - Boxplot frequency of channel per weekday

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ readr   1.3.1
## ✔ tibble  2.1.3     ✔ purrr   0.3.2
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ ggplot2 3.2.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

df_messages_per_weekday <- df_all %>%
  select('channel_name', 'message_timestamp') %>%
  mutate(weekday = strftime(message_timestamp,'%a')) %>%
  group_by(channel_name, weekday) %>%
  summarize(num_messages = n()) %>%
  arrange(desc(num_messages))

boxplot(num_messages ~ weekday, 
    data = df_messages_per_weekday, 
    xlab = "Weekday",
    ylab = "Number of message", 
    names = c("Mon","Tue","Wed", "Thu", "Fri", "Sat", "Sun"))

The mean of daily messages per channel is quite low. Monday could be a little higher than the rest in terms of the number of daily messages. There are quite a large amount of outliers, which could be due to the few abandoned channels. Generally, there is no day in the week which could give the remarkable difference.

0.6 Find the most popular user and channel by word cloud in both R and Python

Python - Word cloud for the popular user

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

text = ','.join(str(v) for v in df_all['user_name'])
wordcloud = WordCloud(
    width = 300,
    height = 200,
    background_color = 'black',
    stopwords = STOPWORDS).generate(str(text))

# Display the generated image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

## (-0.5, 299.5, 199.5, -0.5)

plt.show()

Perry Stephenson and Alex Scriven are popular name on the wordcloud image, which is same as above SQL result. Better than SQL, with wordcloud we can also view few other names as well such as Ho jacky, Jason Nguyen, Jacky Wong, dan Booth, Ka Ho and so on.

R - Word cloud for the popular channel

library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(SnowballC)
library(wordcloud)

## Loading required package: RColorBrewer

jeopCorpus <- VCorpus(VectorSource(df_all$channel_name))
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
#jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

Same as above SQL result, dev_r, mdsi_annoucement are the most popular channels.

0.7 Conclusion

A combination of SQL, R and Python is very effective to provide insights for our data analysis. SQL is a structured query language which is used to retrieve subset information that represent info in limited ways. Given data queried by SQL language, R and Python are powerful tools to perform data manipulation, data exploration and data visualization. With reticulate library, R and Python can convert variables to each other to leverage the power of each language.

In this article, SQL executes query to find out popular users and channel information. Then, a combination of R and Python presents this information by plotting onto histogram or boxplot. We found that the number of users less than 20 and the number of channels having no messages account significantly. Number of channels having 10 to 100 messages were given a considerable amount. Daily messages per weekday having no remarkable differences among days in the week. Finally, the use of wordcloud tool given by R and Python presents that Perry Stephenson and Alex Scriven are most popular users while dev_r, mdsi_annoucement are popular channels.