Social Network Analysis of a Discussion on Reddit

James Speckart ECI 588 Spring 2023

Introduction

Text discussions on the internet can take many forms and have many different structures, power dynamics, and oversight. Each location for discussion, whether a website forum or smartphone app, can influence the discussion by their users, whether by rewarding short or long comments, amplifying debate, or by providing heavy or light oversight.

While many discussion sites have conversations that can last for months or years, the social media website Reddit.com functions has discussions tend to last for about a day or maybe two before they become less popular. This analysis is interested in looking at the structure of a discussion on Reddit, one of the most popular websites in the world, and one that is built on a combination of openness and user control.

More specifically, this analysis takes data from a popular discussion forum a.k.a. “subreddit” at reddit.com/r/teachers, gathering data from a discussion thread titled “I strongly believe that the hatred of remote learning is that parents are being called out for lack of parenting.” This conversation among hundreds of Reddit users, likely to be teachers, is a useful test case to see where this Reddit community exists between two extremes: is it an open dialogue among equals, or is it a highly controlled conversation with users holding clear power to control the content of the conversation?

About Reddit

Reddit is a free-to-use link sharing and discussion platform that is one of the more popular sites on the internet. As of 2022, it is the fifth most visited website in the United States, and has 430 million active users each month, and over 2.8 million separate discussion forums called “subreddits”, where users can post and discuss news stories, images, videos, or personal posts. Each subreddit is moderated by selected users in each subreddit, and each subreddit has different rules about what can be posted or discussed, although the Reddit administration has occasionally stepped into moderate subreddits when they break Reddit rules about content or behavior. Roughly 41% of the Reddit user base is in the United States, making it disproportionately American in demographic makeup.

The Teachers subreddit (at https://www.reddit.com/r/teachers) was created in December 2008, and currently has 426,000 “members”, who are Reddit users who have subscribed to the subreddit to receive announcements or news updates. While Reddit is free to use and is free to join the Teachers subreddit, this analysis assumes that most users in the Teachers subreddit have chosen to join the discussions because they are teachers themselves who wish to communicate with fellow teachers. Most Reddit users prefer the anonymity of screen names and share little to no identifiable data, so the validity of this assumption is very difficult to confirm, but the author believes it is reasonable that anyone who seeks out the Teachers subreddit and posts there has a high likelihood of being a current or former teacher.

Data Preparation

Data was downloaded in R through the package, which allows direct queries of the Reddit API. Due to limits placed on the Reddit API, this package cannot download all of the comments and associated metadata from large threads. For this study, I went to the Teachers subreddit and used the sorting options to view the most active discussion threads, and then selected one that covered a topical subject about instructional methods titled “I strongly believe that the hatred of remote learning is that parents are being called out for lack of parenting” (https://www.reddit.com/r/Teachers/comments/jvtat3/i_strongly_believe_that_the_hatred_of_remote/). This thread had 898 total comments, and the RedditExtractoR package was able to download 471 comments with their associated metadata.

Extracting edge data from the resulting dataframe involved some custom coding. The nested structure of discussion on Reddit is described in RedditExtractoR data as a column of text where the first reply to a submission is numbered “1”, the second reply is numbered “2, and so on. A comment on the first reply is given the string”1_1”, the first comment on the second reply is numbered “2_1”, and this pattern continues for subsequent comments so that the second comment on the second reply would be numbered “2_2”, and the first response to that comment is marked as “2_2_1”. Thus a branched structure for each intitial post is represented by a string where the numbers indicated the order of comments. With this information, we can determine what post a particular comment is responding to. For example, comment “2_2_1” is the first comment to respond to comment “2_2”, and we can determine the parent comment of any child comment by unraveling the nested numerical sequence. Once this is done, an edgelist can be constructed of senders and recievers, where the senders are the authors of a particular response, and the receiver is the post that they are responding to.

Initial graph

The resulting network of users has one component, which makes intuitive sense as all Reddit posts in a discussion thread are tied to the original post:

 network_reddit |> 
    ggraph(layout = "stress") +  
    geom_edge_link(color= "#CC6000") +
    geom_node_point() +
    ggtitle("Reddit discussion by teachers on parents and remote learning") +
    theme_graph() +
    theme(plot.title = element_text(size = 12, face = "bold"))

The main cluster of nodes are direct responses to the initial post, although there are a number of branched discussions that continue several layers deep, reflecting discussions between multiple users.

Basic Network Metrics

The resulting network has 302 nodes and 479 edges. It has one component with an overall graph density of 0.005, making it a sparse graph. It has a diameter of 15, with a mean distance between nodes of 5.5 edges. Its reciprocity measures 0.26, and has a transitivity measure of 0.009.

############################DISCUS THESE?################

The total centrality degree of the network is 0.27. This shows that many of the reply “branches” from the initial discussion “trunk” are quite short, and many are solitary responses directly to the initial post. These numbers may reflect a tendency of many individual users to contribute a single comment in response to the main post and not engage in further discussions. A minority of users participate in longer discussion chains, which themselves may not contain many “branches”.

The average node betweenness centrality is equal to the number of edges, 471, and the average edge betweenness centrality is 286. These high numbers indicate a tightly knit conversation, which makes sense with the linear branched structure of Reddit discussions.

network_reddit |> 
    ggraph(layout = "fr") +
    aes(color= clr) +
    geom_label(aes(x=x, y=y, label = author), nudge_y = 0.1, label.size = NA, color = "darkgray", alpha = .2) +
    geom_edge_link(color= "#CC6000") +
    scale_color_gradient("ggplot2") +
    geom_node_point(aes(size=total_comment_score, color = total_comment_score)) +
    ggtitle("Reddit discussion by teachers on parents and remote learning") +
    theme_graph() +
    theme(plot.title = element_text(size = 12, face = "bold"))

We see a large cluster around the main thread post where commenters replied directly to the original post and not to other authors. But we also see a number of edges that denote conversation threads that branch away from the original comment.

A question arose whether the heaviest Reddit users have a dominating impact on the discussions in this data, indicated by filling connector roles between separate clusters, or whether we see evidence of a more democratic discussion where individual authors have less control over the conversations. Reddit forums can exhibit either dynamic: some are highly dominated by a small circle of users, but others have discussions that show involvement by a wide diversity of independent actors.

When we look at the average node measures to see the node in-degree and out-degree, we see that they average just above 1. An out-degree value of 1 indicates that an author has written 1 post in the thread, and an in-degree of 1 indicates that at least 1 person has replied to an author’s post.

summary(node_measures)
##     author          total_comment_score   gold_count     activity_proportion
##  Length:302         Min.   :    5       Min.   :0.0000   Min.   :0.0010     
##  Class :character   1st Qu.: 3822       1st Qu.:0.0000   1st Qu.:0.5847     
##  Mode  :character   Median : 8027       Median :0.0000   Median :0.9920     
##                     Mean   : 9581       Mean   :0.3643   Mean   :0.7720     
##                     3rd Qu.:12881       3rd Qu.:0.0000   3rd Qu.:1.0000     
##                     Max.   :68861       Max.   :5.0000   Max.   :1.0000     
##                     NA's   :11          NA's   :11                          
##    in_degree         out_degree    
##  Min.   :  0.000   Min.   : 1.000  
##  1st Qu.:  0.000   1st Qu.: 1.000  
##  Median :  0.000   Median : 1.000  
##  Mean   :  1.586   Mean   : 1.586  
##  3rd Qu.:  1.000   3rd Qu.: 1.000  
##  Max.   :157.000   Max.   :57.000  
## 

The top in-degree is for the initial post, which had 157 users respond directly to it. The top out-degree measure of 57 captures an artifact of Reddit data: some posts are deleted after posting, but edges to these deleted posts remain in this data. The 57 out-degrees are shared by a collective author name of “[deleted]”, and the top named authors have outdegrees of 10 and 7, indicating that they wrote 10 and 7 comments in the thread respectively.

One simple way of looking at the power of heavy Reddit users in this data is to look at the network of just the users with the most lifetime activity on Reddit, which is captured through the RedditExtractoR users() function and summing the total upvote/downvote scores of all of their posts for that account. So if we look at the heaviest Reddit users, as measured by authors who have lifetime post scores in the top 25% of authors in the dataset, we can see that the top authors are not in heavily reciprocated dyads or triads that dominate the discussion. Instead, their network is relatively diffuse:

network_reddit |> 
  activate(nodes) |>
  filter(activityproportion >.750) |>
  ggraph(layout = "fr") +
  aes(color= clr) +
  geom_label(aes(x=x, y=y, label = author), nudge_y = 0.1, label.size = NA, color = "darkgray", alpha = .2) +
  geom_edge_link(color= "#CC6000") +
  scale_color_gradient("ggplot2") +
  geom_node_point(aes(size=total_comment_score, color = total_comment_score)) +
  ggtitle("Heaviest Reddit users: discussion by teachers on parents and remote learning") +
  theme_graph() +
  theme(plot.title = element_text(size = 12, face = "bold"))

There are three clusters of heavy Reddit users, with two large clusters and one smaller cluster. These are clusters of heavy Reddit users engaging in discussion with each other. However, the majority of nodes are isolates, indicating that they posted individual “hit and run” comments to the discussion without regard to the potential power brokers in the network.

And when we look at the the bottom quartile of Reddit users in the data, as measured by authors whose lifetime Reddit post scores are in the bottom 25% of authors in the dataset, have a very sparse network:

network_reddit |> 
  activate(nodes) |>
  filter(activityproportion <.250) |>
  ggraph(layout = "fr") +
  aes(color= clr) +
  geom_label(aes(x=x, y=y, label = author), nudge_y = 0.1, label.size = NA, color = "darkgray", alpha = .2) +
  geom_edge_link(color= "#CC6000") +
  scale_color_gradient("ggplot2") +
  geom_node_point(aes(size=total_comment_score, color = total_comment_score)) +
  ggtitle("Lightest Reddit users: discussion by teachers on parents and remote learning") +
  theme_graph() +
  theme(plot.title = element_text(size = 12, face = "bold"))

Both of these networks show little evidence that Reddit power users are brokers for discussion in the data, which is backed up by the average betweenness measure being equal to the total number of edges. This discussion has a structure that befits an open discussion by equals, where many comments are one-time posts by infrequent users.

To get a sense of what the most popular comments are in the discussion, we can look at the 10 comments with the highest upvote score:

buildingnetwork |>
  arrange(desc(score)) |>
  select(comment) |>
  head() |>
  kable()
comment
In the past, parents always had enough distance to say “oh it’s the teacher’s fault” “he gets distracted by his friends” blah, blah, blah.

Now they can see that the root of the problem is their kid. You can’t wake up on time, open your computer and listen? You can’t stay off YouTube for an hour? This is an unpleasant realization after years of throwing teachers, admin and other students under the bus. Get them out of my house so I can return to my illusion. | |I think it also shines a light on the fact that parents were supposed to have been involved in the educational process all along and many were not. Being a parent actually involves effort - academically as well - and school exists to be part of a student’s educational process, not as an opportunity for you to hand your kids off to a teacher to “babysit” 8 hours per day every weeekday and believe that an educated child will magically emerge 13 academic years later. The parent is also an essential part of learning and while that’s more true now, it’s always been true, but now it’s pretty obvious who recognizes that responsibility and who doesn’t.

Not lacking empathy for parents who are working from home while managing kids. I get that it’s hard (we are doing it too). But there’s a huge difference between the “this is hard” and the “this shouldn’t be my problem” people and it’s very clear. | |Another realization that many parents are having is that they relied on not having to be a parent for the majority of the day. Schools have taken on so much responsibility in what used to be the realm of parents that parents dont know what to do. | |Employers have cut wages so low that both parents have to work and neither of them have time to parent the child, let alone be able to develop parenting skills. | |My district (CCSD) has been inflating graduation rates for a decade. Now that everything is online and digitally recorded, our passing rates are back to what our graduation rates were 10 years ago. The superintendent is claiming that there is an “academic crisis” and that we must reopen schools because they can’t hide that half the kids aren’t doing work anymore. | |The reason I’ve decided against kids is that, while I would enjoy children, I would fucking hate the day-to-day work involved in raising them, and I think a lot of people feel that way, but had kids anyway, and now we’re seeing it. |

#########################################DISCUSSS

Sentiment Analysis

The package for R was used to calculate a sentiment score for each post in the data. The vader package identifies words in the data that have matches in its lexicon, and then uses a predetermined positive or negative value between -1 and 1 to provide a measure of the linguistically positive or negative sentiment of that word. These sentiment values are then added up for all matching words in a post to provide a single value to represent the overall positivity or negativity for that Reddit post.

In this data, there is an average sentiment of 0.17, indicating a slightly positive sentiment to posts in the thread, with both positive and negative sentiments being expressed as well.

summary(vaderscores$compound)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9650 -0.2060  0.2360  0.1678  0.6420  0.9970

The variability of seniments for each post are easier seen if we look at a barplot of the total sentiment of each post (i.e. each post gets one sentiment value that is either positive or negative)

barplot(height=vaderscores$compound, col=ifelse(vaderscores$compound > 0, 3, 2), border=NA, main = "Comment Sentiment Barplot")

Exponential Random Graph Model

Finally, an Exponential Random Graph Model (ERGM) was run to see whether the network structure in the data could be expected purely through chance. ERGMs generate networks at random with the same number of nodes and edges as in our data to see the probabilities of the same structure appearing at random. If it is significantly different than a random network with the same node and edge count, then we can say that we can draw meaningful insights from our structural analysis.

For this study, an ERGM model was run using the number of mutual connections between authors as a major variable, under the assumption that one author is more likely to make connections to another if they have an existing connection in common. In addition, the measure of total Reddit user score was included to see if heavy Reddit users are more likely to make connections than a random Reddit user. The results are shown below:

summary(ergm_activity)
## Call:
## ergm(formula = teacher_network ~ edges + mutual + nodecov("activity_proportion"))
## 
## Monte Carlo Maximum Likelihood Results:
## 
##                             Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges                       -5.79913    0.15377      0 -37.713   <1e-04 ***
## mutual                       4.69654    0.16572      0  28.341   <1e-04 ***
## nodecov.activity_proportion  0.08812    0.09106      0   0.968    0.333    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 126436  on 91204  degrees of freedom
##  Residual Deviance:   5090  on 91201  degrees of freedom
##  
## AIC: 5096  BIC: 5124  (Smaller is better. MC Std. Err. = 0.7883)

Our ERGM model indicates that having mutual connections is strongly associated with reciprocating an edge between authors, with a log-odds estimate of 4.7 of reciprocating edges at a statistically significant p value less than 0.0001. However, being a heavy Reddit user is not tied to making connections to another author, with no statistically significant estimate. This indicates that heavy Reddit users show no more reciprocated edges than random Reddit users.

Conclusion

This analysis of a large discussion thread on Reddit shows a egalitarian power structure among users in the reddit.com/r/teachers forum. The data were taken from a discussion on a contentious topic–the role of parents in resistance to distance learning–and yet the discussion itself is characterized by a large number of posts from new and old Reddit users alike. Even a sentiment analysis of the posts shows that while highly emotional words were common, the overall discussion was balanced between positive and negative sentiments.

While different subreddits have different rules about allowable posts, and some have featured heavy maniuplation by forum moderators, this discussion on r/teachers seems to embody the open, free-flowing discussion that are often touted by free discussion websites such as Reddit. Because Reddit is free for users, and because r/teachers allows any Reddit user to post to their discussion threads, this social network analysis reveals a nearly non-existant power structure within the discussion flow. Heavy Reddit users do not control the flow of communication in the discussion, as seen both graphically in the network of the top 25% heaviest Reddit users and also in the results of our ERGM model. Our analysis indicates that r/teachers is a democratic space for teachers to discuss even controversial topics.

It is possible that moderators have selectively deleted some posts, which might greatly alter the network structure of this discussion, and would be invisible to our data gathering tools. However, we retained all the data that could be obtained, including posts that were later deleted by their authors, in order to control for this concern as much as possible. Followup research could include cooperation with Reddit to obtain all data, including moderation decisions, to build a maximally complete network for future analysis.