Introduction

In this project, I visualise the network structure of an online multi-participant chat room. The goal is to create an informative visualisation that captures the relationships between the people in the chat room based on their interactions.

What is Social Network Analysis

The goal of social network analysis is to uncover the relationships between actors by visualizing the connections between them. In a network, the actors are represented as nodes and the relationships between them are represented as links or edges. The idea behind social network analysis is that the relationships between actors are not random, but instead are representative of the social processes that underlie them. (Luke, 2015).

Data

The data I am working with comes from the NPS Chat Corpus (Frosith, Lin and Martell), which is distributed as part of the NLTK package. The dataset includes 14 snapshots of chat rooms, including all activity within the time frame of a few hours. As part of a project at the University of Birmingham, some of these chats have been annotated for their response structure: each turn by a chat room participant is coded for whether it either a) initiates a new topic, b) responds to a previous turn, and if so, which one or c) is too ambiguous to determine whether it is a) or b). The goal of this project is to create an informative visual that captures the relationships between the people in the chat room based on their interactions.

A Look at the Data

First, we read in the coded data from an excel file.

## # A tibble: 593 × 4
##    USER     MESSAGE                                TURN AGREEMENT
##    <chr>    <chr>                                 <dbl> <chr>    
##  1 STURGEON PUMA, i'll stalk to ya soon also          1 ?        
##  2 MEALYBUG have a good night                         2 ?        
##  3 DINGO    good u LUNGFISH?                          3 ?        
##  4 PUMA     lmao STURGEON                             4 1        
##  5 POMCHI   i got fired from dunkin donuts            5 X        
##  6 STURGEON me? a good nite? pssssh                   6 2        
##  7 GOBERIAN hello                                     7 X        
##  8 MEALYBUG lol                                       8 6        
##  9 BEAGADOR hello GOBERIAN                            9 7        
## 10 STURGEON that involves havin a good day. later    10 2        
## # … with 583 more rows

We can see that the data we just loaded in has 583 rows and four columns, USER, TURN and AGREEMENT. The USER column contains the anonymised names of the chat room participants. The MESSAGE column contains the messages sent in the chat room. The TURN column contains the number of the turn, with each individual contribution to the chat room (every time a chat room participant hits enter) counting as one instance, starting with 1 at the first turn. The AGREEMENT column contains the final code for a given turn. The values in this column can be X, ? or one or multiple numbers between 1 and the second to last turn of the chat.

Creating a Node List

There are two components we need to generate a network: a node list and an edge list. The node list is essentailly just a list of all the unique participants in the chat room. Creating the node list is therefore quite simple: we just take all the unique items from the USER column of the chat data. Additionally, we will add an ID column, so that each user is identifyable by a unique number (we will use these IDs later for the edge list as well.)

## # A tibble: 44 × 2
##       id user    
##    <int> <chr>   
##  1     1 STURGEON
##  2     2 MEALYBUG
##  3     3 DINGO   
##  4     4 PUMA    
##  5     5 POMCHI  
##  6     6 GOBERIAN
##  7     7 BEAGADOR
##  8     8 LUNGFISH
##  9     9 KODKOD  
## 10    10 BEEFALO 
## # … with 34 more rows

We see that we have 44 unique participants.

Creating an Edge List

Wile the node list records all the entities in a network, the edge list records the relationships between them. Edges are also called links or ties. In this case, a link (or edge) between two participants is established when they respond to each other. Each time participant A responds to participant B, we want to record a link of the weight 1 between A and B. Because the act of responding to someone is directed, we are going to distinguish between the person who is doing the responding and the person who is being responded to. So if person A responds to person B, we are going to record it as A–>B and if person B responds to person A, we will record it as B–>A. In other words, we are building a directed network.

To create an edge list then, we will go through all the turns in our data that contain responses and record who is responding to who. We will add this information in a ‘respondee’ column (who is being responded to) and a ‘weight’ column (how often does this person respond to the other person)

## # A tibble: 328 × 5
##    USER     TURN  AGREEMENT respondee weight
##    <chr>    <chr> <chr>     <chr>      <dbl>
##  1 PUMA     4     1         STURGEON       1
##  2 STURGEON 6     2         MEALYBUG       1
##  3 MEALYBUG 8     6         STURGEON       1
##  4 BEAGADOR 9     7         GOBERIAN       1
##  5 STURGEON 10    2         MEALYBUG       1
##  6 BEEFALO  16.1  15        BEAGADOR       2
##  7 BEEFALO  16.2  15        BEAGADOR       2
##  8 LUNGFISH 17    13        KODKOD         1
##  9 BEAGADOR 19.1  16        BEEFALO        2
## 10 BEAGADOR 19.2  16        BEEFALO        2
## # … with 318 more rows

We now have a table in which each row records the interaction between two participants. As you may notice, the weight column has some 1 values and some 2 values. The 2 values are for instances where someone’s turn was split over two turns (someone wrote a line, hit enter and then followed this up with another turn right after). When someone now responds to those “two in one” turns, the weighting gets recorded as 2 instead of 1.

We can now add together all those interactions, so that we have a cummulative weight for each participant pairing (in both directions). In the same step, we will also delete the TURN and AGREEMENT columns and replace the pseudonyms of the participants with the IDs we created in the node list. The result is our edge list.

## # A tibble: 101 × 3
##     from    to weight
##    <int> <int>  <int>
##  1    36     7      1
##  2    36    35      2
##  3    36     5      1
##  4    36     4      3
##  5    36    24      4
##  6     7    36      1
##  7     7    10      3
##  8     7    18      2
##  9     7    12     11
## 10     7     6      1
## # … with 91 more rows

Data Visualisation

Now we can create a social network diagram. There is a variety of packages in R to facilitate the plotting of social networks. Here, we work with visNetwork, because it comes with inbuilt interactive features.

The idea of this section is to move from the default network to a more and more informative and visually appealing visualisation.

Default Network

This is the default network visNetwork creates. You can move the nodes, zoom in and clikc on a node to highlight the connections this node has with other nodes. While this

Add arrows

There are many options for customizing this basic plot. For example, we can add arrows to mark the direction of the relationship.

Change Colors Globally

We can also change the colors of both the edges and the nodes. To get more interesting color palettes, I am using the wesanderson package.

Adjust Node Size

Let’s make the node size informative, so that the node size represents information about how many turns a participant takes. The bigger a participant’s node, the more times they write something in the chat.

Adjust Node Color for Response Ratio

Another interesting way of characterising this data is by looking at the number of responses a participant receives and how many responses they themselves make. This can be captured as a ratio of responses in to responses out and can be encoded in the color of each node. We can do this with RGB. A participant who only responds to other people but gets no responses in turn will be neon purple. A participant who only receives responses but does no responding will be neon yellow. All other percenteges will fall on a color spectrum between those poles.

## # A tibble: 44 × 8
##       id label    color   value links_in links_out in_ratio out_ratio
##    <int> <chr>    <chr>   <int>    <int>     <int>    <dbl>     <dbl>
##  1     1 STURGEON #D69C4E     3        2         2    0.5       0.5  
##  2     2 MEALYBUG #D69C4E    14       11         9    0.55      0.45 
##  3     3 DINGO    #D69C4E     1        0         0  NaN       NaN    
##  4     4 PUMA     #D69C4E   109       42        75    0.359     0.641
##  5     5 POMCHI   #D69C4E    27       24         9    0.727     0.273
##  6     6 GOBERIAN #D69C4E    25       18        12    0.6       0.4  
##  7     7 BEAGADOR #D69C4E    54       25        38    0.397     0.603
##  8     8 LUNGFISH #D69C4E     4        0         2    0         1    
##  9     9 KODKOD   #D69C4E     2        2         0    1         0    
## 10    10 BEEFALO  #D69C4E    95       50        49    0.505     0.495
## # … with 34 more rows

We can see that the nodes in the central cluster all fall somewhere in-between the extremes. On the edges, we have two pairs of nodes that have on participant with only responses in and one participant with only responses out. All nodes that do not have any edges (no responses in or out) remain a pale blue.

Adjust Edge Width for Number of Interactions

We can also adjust the edge width as a function of the weight column, that is based on the number of interactions between two nodes. The more links two nodes have, the wider that connecting link will be.

Dropping Edges

Removing edges that are low in their weighting can help make a network diagram clearer, as it removes some of the visual noise. Theoretically, a low weighting of an edge represents a weak link between two actors. Therefore, excluding edges with low weighting scores means removing weak links in the network. However, what constitutes a ‘weak’ link will vary from network to network. Let us therefore have a look at the distribution of weights across this particular chat.

Edge Weight Distribution

As we can see, the most frequent weighting of directed links in this data is 1. Let’s remove links with a weight of 1 and compare that to the original network with all edges included.

Network All Edge Weights

Network with Edge Weight > 1

We can see that the central cluster becomes less busy. Four nodes that only have one-weighted links have dropped out of the center and moved to the edge.