This report is an analysis of the Masters of Data Science and Innovation (MDSI) slack community. It demonstrates the use of two common languages among data scientists, R and Python, along with SQL to extract, analyse and make visualizations of the data. Additionally, this report demonstrates the use of R’s Plumber API to host the data in localhost and how it is pulled and parsed in Python.
The visualizations included in this report are
For the purpose of this report, we are going to analyse messages in Slack of the channels of MDSI containing data from 2016 May till 2019 March. This data is dumped into a database and hosted on AWS. So, we are going to connect to this database and construct our dataset. We have been told that there were three tables, messages, users and channels.
The following code is used to connect to the database, query into it and get the data that we need.
#
# filename: Analysis1.R
#l oad libraries
library(RPostgreSQL)
library(sqldf)
# Database connection string
con <- dbConnect(drv = dbDriver('PostgreSQL'),
host = 'mdsislack.clnutj7nhgyn.us-east-2.rds.amazonaws.com',
port = 5432,
user = '<username>', #hidden for security purpose
password = '<password>', #hidden for security purpose
dbname = 'mdsislack')
# Peek into the tables that are there
users <- dbGetQuery(con, "select * from users")
head(users)
channels <- dbGetQuery(con, "select * from channels")
head(channels)
messages <- dbGetQuery(con, "select * from messages")
head(messages)
#Join the tables based on their IDs, filter out the data that are generated by bot users
messages_users <- dbGetQuery(con, "select *
from messages
inner join users
on messages.user_id = users.user_id
inner join channels
on messages.channel_id = channels.channel_id
where
users.user_is_bot is FALSE")
#disconnect the database connections
dbDisconnect(con)
#see what we have got
head(messages_users)
Wordcloud is a great for visualizing the text usages in a dataset. We are going to use wordcloud to see which user has posted the most in the MDSI slack community.
#
# filename: Analysis2.R
#load libraries RColorBrewer and wordcloud
library(RColorBrewer)
library(wordcloud)
#Extract only user_nickname and messages text for analysis from the dataset
nick_msgs <- messages_users[, c("user_nickname", "message_text")]
# Get the frequency of posts of each user (for wordcloud)
# Note: For sqldf, we need to include the parameter drv to evaluate on our local datasets
count_msgs <- sqldf("select user_nickname,
count(1) as msg_frequency
from nick_msgs
group by
user_nickname", drv = "SQLite")
# Create the wordcloud
wordcloud(words = count_msgs$user_nickname, freq = count_msgs$msg_frequency,
min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
So, we get an outpot as follows:
Upon looking at the wordcloud in the above visualization, the nicknames with largest sizes in the wordcloud such as perry and ajscriven are also the admins of the MDSI slack community. So it is also helpful to see how our admins have been participating in the discussions that are going on in the community. For this purpose, let’s create an API using plumber and host it in our localhost.
#
##filename: plumber_api.R
# Extract only message_timestamp, user_nickname and user_is_admin columns from
# the mani dataset
subset_user_time_msgs <-
messages_users[, c("message_timestamp", "user_nickname",
"user_is_admin")]
# Fromat message_timestamp to date
subset_user_time_msgs$msg_day <-
as.Date(as.POSIXct(subset_user_time_msgs$message_timestamp, 'Sydney'))
# Filter only the messages where user_is_admin parameter is TRUE (i.e. admin's messages
# only)
admin_time_msgs <- sqldf("select msg_day,
user_nickname,
count(1) as msg_frequency
from
subset_user_time_msgs
where
user_is_admin is TRUE
group by msg_day, user_nickname",
drv = "SQLite")
# The following code constructs the API using the dataset constructed above
#' Get admin's interaction data
#' @get /interaction
function(msg){
list(admin_time_msgs)
}
Now, we can use Plumber to host this api to our localhost.
#
##filename: plumber.R
# load the library Plumber
library(plumber)
# Host the file plumber_api.R
r <- plumb('plumber_api.R')
r$run(port=8000)
Plumber hosts the api on localhost, along with documentation using swagger.
For this, Python will be used for visualization. The following code loads the needed libraries and also defines a function that parses the json formatted string that we obtain from the api.
#
##file name: Analysis3.py
#import necessary libraries
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import numpy as np
from dfply import *
#create Data Frame using the response obtained as paramter and return it
def myData(response):
#Check, and evaluate only if the status of request is successful (200 = success)
if response.status_code ==200:
myMessagesList=[]
data = response.json()
for i in data:
for j in i:
myMessagesList.append([j['user_nickname'], j['msg_day'], j['msg_frequency']])
#print(myMessagesList)
df = pd.DataFrame(myMessagesList, columns = ['nick', 'day', 'freq'])
return df
else:
print("Error getting API data")
exit(1)
Now that we have our functions ready, we will be able to call the API and use the function to parse it.
#
##file name: Analysis3.py (continued...)
# Calling the API
response= requests.get("http://127.0.0.1:8000/interaction")
#send the response data to the function myData and store its returned dataframe to df
df = myData(response)
#format the column day as datetime
df['day'] = pd.to_datetime(df['day'])
#extract the year and month from the datetime and save it as a new dataframe
df1 = df >> mutate(yearmonth = X.day.dt.strftime('%Y%m'))>>select(X.nick, X.day, X.yearmonth, X.freq)
#the following code groups the data frame by yearmonth and nick, sums up the frequency
# and unstacks them
# unstack creates a pivot table with yearmonth on rows and nick on columns
df1 = df1.groupby(['yearmonth', 'nick']).sum()['freq'].unstack().fillna(0)
# display our new data
print(df1.head())
# set the colors for the plot
colors = ['red', 'blue', 'green', 'violet', 'indigo']
# plot a stacked bar graph for the data
fig = df1.loc[:,['ajscriven','dan', 'perry', 'roryangus', 'shan.yang']].plot.bar(stacked=True, color=colors, figsize=(10,7))
#set titles and labels for the plot
fig.set_title("Message frequency distribution over months of Slack Admins")
fig.set_xlabel("Months")
fig.set_ylabel("Messages")
#show the plot
plt.show()
The plot generated from the code is:
Now we will once again step into R. We will try to analyse the buzz words (or topics) in mdsi_electives channel. Since in MDSI, half of the course is elective, a concern for all the students is the electives and which subjects were available and could be chosen to study.
A wordcloud can be constructed simply like how we did the wordcloud of users posting the most, but this may not be a practical. In any natural language, collocations are most widely used instead of certain words. For example, a term “machine learning” is used together in MDSI at all times. So, if we form a wordcloud, then the words ‘machine’ and ‘learning’ would have equal frequencies and would be displayed differently. However, if we form collocations, then ‘machine learning’ would be one word. The word ‘machine learning’ could also be made distinct from another word similar word ‘deep learning’ via collocations.
Hence, we proceed with using the libraries of text mining and quantitative analysis of text data first.
#
##file name: Analysis4.R
library(quanteda)
library(tm)
#get a subset of messages
subset_messages <- messages_users[, c("message_text", "channel_name")]
#filter out the messages from mdsi_electives channel
mdsi_electives_msgs <- sqldf("select * from subset_messages
where channel_name = 'mdsi_electives'", drv="SQLite")
#get only messages
messages <- mdsi_electives_msgs$message_text
#convert the messages to corpus for text analysis
docs <- Corpus(VectorSource(messages))
#a function to replace contents in the text
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
#convert the messages to lowercase
docs <- tm_map(docs, content_transformer(tolower))
#remove numbers
docs <- tm_map(docs, removeNumbers)
#remove punctuation
docs <- tm_map(docs, removePunctuation)
#remove white space
docs <- tm_map(docs, stripWhitespace)
#remove common stop words
docs <- tm_map(docs, removeWords, stopwords('en'))
#convert the corpus back to text for checking collocations via quanteda library
msgs_text <- data.frame(text=sapply(docs, identity),
stringsAsFactors=F)
#compute collocations of words, from quanteda library
colloq <- textstat_collocations(msgs_text$text, min_count = 2)
The function textstat_collocations groups the text in the form of collocations and also gives the count of them. So we can use this information in our wordcloud function to generate a wordcloud.
We will be using two-word collocations for this analysis. Since collocations are long, we will need to increase the device size to incorporate them initially.
#
##file name: Analysis4.R (continued...)
#need to increase the size of the device (plot) to incorporate long words
dev.new(width = 1000, height = 1000, unit = "px")
#create a wordcloud based on the collocations computed
wordcloud(words = colloq$collocation, freq = colloq$count,
min.freq = 2, max.words=200, random.order=FALSE, rot.per=0.25, scale=c(4,.25),
colors=brewer.pal(8, "Dark2"))
So, the wordcloud thus generated is as follows:
Data scientists may need to interchange between languages, pass data between them and perform analyses accordingly. Here, the main power of the language can be utilized. Python is a very important language when it comes to machine learning and creating flow networks, whereas R is useful to do tasks with data and create visualizations with ease. Along with that, since data is always hosted in large databases, a good knowledge of SQL can help to talk with that database with ease and pull required data from that database.
The above task is very helpful as it wants us do analysis on slack community’s messages data, which is both structured and unstructured, hosted on the internet in a database. We had to access it with a database connection string and get data by querying it. After that, the data was analysed and visualized on R, as well as passed to Python via making an API, which is the new mode of communicating information in the present.
My take from this task is that I found it very difficult to create visualizations in Python. I had passed it in such a format which would take only one or two lines of R code to create the visualization. In Python I needed to do a lot of modifications and try out many commands in order to get the visualizations. For me, R is an easy language and it is very quick and easy to work with data and visualizations. I have decided that I will be using Python for machine learning tasks and R for creating visualizations. Though I havn’t done it here, I will be passing the data from Python to R and back via csv files.