Analyzing Conversations Between Teachers in the r/teaching SubReddit Using Bigrams

The link aggregation and discussion website Reddit (https://www.reddit.com) is one of the internet’s most popular sites, and it provides a medium for users to create their own discussion groups based on their own interests. Each discussion group is called a subreddit, and one subreddit named r/teaching has been created by teachers (predominantly in the U.S.) to discuss their job-related concerns and stories.

This subreddit has over 107,000 members and contains many thousands of posts. The question of concern here is whether text mining data analysis methods, specifically bigrams, can be informative about the most common topics discussed by teachers in the r/teaching subreddit.

Data Discovery

The author used the post sorting tool on Reddit to find the most heavily read posts of all time in r/teaching. Because several of these threads had occurred in the past 12 months, a 2 month period was chosen near the start of the current school year for web scraping. All posts from September 1, 2022 until October 31, 2022 were scraped using the Reddit search tool at https://camas.unddit.com. This resulted in a dataset of 15,866 posts in JSON format, which were imported into R. Some spam posts and automatically generated moderator notifications were cleaned, and a resulting dataset of 14,776 posts remained. The body text from each post was placed into a .csv file for text analysis.

# Read in data

redditposts <- read_csv("~/LD and T Courses/r teaching for text analysis.csv")
## New names:
## Rows: 14776 Columns: 2
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): x dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

By count, many posts were still very short and contained inconsequential content. Therefore, all posts under 50 characters in length were eliminated so that more in-depth, content rich posts would be considered for bigram analysis.

# Count characters in each entry, remove posts under 50 characters

posts <- data.frame(text = redditposts$x[str_length(redditposts$x) >= 50])

Creating Bigrams

Next, the data was tokenized into bigrams using the unnest_tokens() function in R’s tidytext package. This placed the first and second words of each bigram into separate columns in the dataframe. Stop words were removed from each column, including custom stop words to catch most of the remaining spam posts and html code remnants from the posts. The remaining isolated words in each row of the dataframe were then reunited into individual two-word bigrams.

invisible({capture.output({
# tokenize text

reddit_bigrams <- posts %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

# create dataframe with bigrams by count

reddit_bigrams %>%
  count(bigram, sort = TRUE)

# create dataframe with bigram words split into separate columns

reddit_bigrams_separated <- reddit_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

# filter out stop words from separated columns

reddit_bigrams_filtered <- reddit_bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# create custom stop word dictionary

custom_stops <- data.frame(word = c("https","amp","utm_name","ios_app","utm_medium","belledelphine","08","onlyfans","f0lder"))

# filter out custom stop words from separated columns

reddit_bigrams_filtered <- reddit_bigrams_filtered %>%
  filter(!word1 %in% custom_stops$word) %>%
  filter(!word2 %in% custom_stops$word)


# create dataframe with separated columns by count

reddit_bigram_counts <- reddit_bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)


# Reunite separate words into bigrams after filtering and counting complete

reddit_bigrams_united <- reddit_bigram_counts %>%
  unite(bigram, word1, word2, sep = " ")

})})

Table of Bigrams

The full list of the 53,320 remaining bigrams in the data can be navigated below, sorted by frequency.

paged_table(reddit_bigrams_united)

Data Analysis

The large number of bigrams in the data is unwieldy, and scrolling through pages of these bigrams is an inefficient method of analysis. However, just by looking at the first page of results, we do see some topics of note. In particular, the high ranking of the bigram “mental health” as the third most popular bigram, just after “lesson plans”, indicates that much discussion in the r/teaching subreddit is about psychological health, probably of the teachers themselves. However, student psychological health is also a top concern, as the bigrams “special education” and “special ed” rank as the 9th and 10th most common bigrams, and the 4th highest bigram, “classroom management,” incorporates elements of student psychology as well.

Visualizing Using Network Graphs

To gain a different sense of what the top discusssion topics are in r/teaching, a network analysis was also performed. In order to avoid creating a snowball graph of all 53,320 bigrams, the network was reduced to just the top 60 bigrams by limiting the data to the bigrams that appeared 5 times or more.

# convert bigrams into graph of format: word 1 -> word 2

reddit_bigram_graph <- reddit_bigrams_united %>%
  graph_from_data_frame()

# reduce data to top 60 bigrams (cutoff are bigrams that repeat >30 times)

reddit_bigram_graph_filtered <- reddit_bigram_counts %>%
  filter(n > 30) %>%
  graph_from_data_frame()

set.seed(100)

ggraph(reddit_bigram_graph_filtered, layout = "fr") +
  geom_edge_link(aes(edge_alpha = .1, edge_color="gray"), show.legend = FALSE) +
  geom_node_point(color = "cornflowerblue") +
  geom_node_text(aes(label = name), vjust = 1.5, hjust = 1) +
  labs(title = "Network Graph of Top 60 Bigrams of r/Teaching Sept-Oct 2022") +
  theme_void()
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.

The resulting graph has several larger clusters of words that are in bigrams in different combinations. The largest cluster, with 17 nodes, has central nodes of “school” and “teacher”, and it covers major administrative distinctions such as public/private schools, but interestingly also includes the bigram “teacher shortage”. The next largest two clusters refer to lesson plan lengths and school grades, but there are some more insightful bigrams as well. There is a cluster around the words “bad”, “feel”, “don’t”, and “care” that illustrate the posts concerning teacher psychological health worries, as do the bigrams “mental health”, “life balance/easier,” and “sick days.” But there a number of bigrams that refer to neutral topics, such as “water bottle/s”, “contract hours,” and “morning person.”

Conclusion

The r/teaching subreddit of the link aggregation and discussion website Reddit.com is a diverse and active community of teachers and members of the general public. The posts in this forum cover many topics related to teaching, from lesson plans to teacher mental health. A bigram analysis of the top discussion topics in the subreddit over a 2 month timespan near the beginning of the 2022 school year showed that practical ideas on lesson plans and classroom management were common topics, but there were also many bigrams that reflect teacher stress and the politics of education.

Researchers with more specific areas of concern could drill down into this data to find more patterns related to a variety of research questions, from sentiment analysis of posts on particular topics over several weeks to reactions to school shootings on a particular calendar date. The relatively easy access to the data is useful, although we must remain vigilant about acknowledging the potential limitations to Reddit data: it is likely to reflect the activity of a demographic that skews toward the higher end of technical literacy in the United States, and its open (but moderated) nature means that it will reflect its own particular kind of discussion that is not exhaustive of the possible styles of discussion about teaching.

Addendum on Trigrams

Another avenue for network analysis could be using trigrams instead of bigrams to generate triads as the units of analysis. That would require additional assistance for accuracy and insight, but we can see that trigrams in this dataset provide a different set of word combinations, and would likely lead to different interpretations, which we leave for future potential research.

invisible({capture.output({

# tokenize text to trigrams

reddit_trigrams <- posts %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)

# create dataframe with trigrams by count

reddit_trigrams %>%
  count(trigram, sort = TRUE)

# create dataframe with trigram words split into separate columns

reddit_trigrams_separated <- reddit_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

# filter out stop words from separated columns

reddit_trigrams_filtered <- reddit_trigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word3 %in% stop_words$word)

# filter out custom stop words from separated columns

reddit_trigrams_filtered <- reddit_trigrams_filtered %>%
  filter(!word1 %in% custom_stops$word) %>%
  filter(!word2 %in% custom_stops$word) %>%
  filter(!word3 %in% custom_stops$word)


# create dataframe with separated columns by count

reddit_trigram_counts <- reddit_trigrams_filtered %>% 
  count(word1, word2, word3, sort = TRUE)

reddit_trigram_counts

# Reunite separate words into trigrams after filtering and counting complete

reddit_trigrams_united <- reddit_trigram_counts %>%
  unite(trigram, word1, word2, word3, sep = " ") 
})})
paged_table(reddit_trigrams_united)