Analyzing Conversations Between Teachers in the r/teaching SubReddit
Using Bigrams
The link aggregation and discussion website Reddit (https://www.reddit.com) is
one of the internet’s most popular sites, and it provides a medium for
users to create their own discussion groups based on their own
interests. Each discussion group is called a subreddit, and one
subreddit named r/teaching has been created by teachers (predominantly
in the U.S.) to discuss their job-related concerns and stories.
This subreddit has over 107,000 members and contains many thousands
of posts. The question of concern here is whether text mining data
analysis methods, specifically bigrams, can be informative about the
most common topics discussed by teachers in the r/teaching
subreddit.
Data Discovery
The author used the post sorting tool on Reddit to find the most
heavily read posts of all time in r/teaching. Because several of these
threads had occurred in the past 12 months, a 2 month period was chosen
near the start of the current school year for web scraping. All posts
from September 1, 2022 until October 31, 2022 were scraped using the
Reddit search tool at https://camas.unddit.com. This resulted in a dataset of
15,866 posts in JSON format, which were imported into R. Some spam posts
and automatically generated moderator notifications were cleaned, and a
resulting dataset of 14,776 posts remained. The body text from each post
was placed into a .csv file for text analysis.
# Read in data
redditposts <- read_csv("~/LD and T Courses/r teaching for text analysis.csv")
## New names:
## Rows: 14776 Columns: 2
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): x dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
By count, many posts were still very short and contained
inconsequential content. Therefore, all posts under 50 characters in
length were eliminated so that more in-depth, content rich posts would
be considered for bigram analysis.
# Count characters in each entry, remove posts under 50 characters
posts <- data.frame(text = redditposts$x[str_length(redditposts$x) >= 50])
Creating Bigrams
Next, the data was tokenized into bigrams using the unnest_tokens()
function in R’s tidytext package. This placed the first and second words
of each bigram into separate columns in the dataframe. Stop words were
removed from each column, including custom stop words to catch most of
the remaining spam posts and html code remnants from the posts. The
remaining isolated words in each row of the dataframe were then reunited
into individual two-word bigrams.
invisible({capture.output({
# tokenize text
reddit_bigrams <- posts %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
# create dataframe with bigrams by count
reddit_bigrams %>%
count(bigram, sort = TRUE)
# create dataframe with bigram words split into separate columns
reddit_bigrams_separated <- reddit_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
# filter out stop words from separated columns
reddit_bigrams_filtered <- reddit_bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# create custom stop word dictionary
custom_stops <- data.frame(word = c("https","amp","utm_name","ios_app","utm_medium","belledelphine","08","onlyfans","f0lder"))
# filter out custom stop words from separated columns
reddit_bigrams_filtered <- reddit_bigrams_filtered %>%
filter(!word1 %in% custom_stops$word) %>%
filter(!word2 %in% custom_stops$word)
# create dataframe with separated columns by count
reddit_bigram_counts <- reddit_bigrams_filtered %>%
count(word1, word2, sort = TRUE)
# Reunite separate words into bigrams after filtering and counting complete
reddit_bigrams_united <- reddit_bigram_counts %>%
unite(bigram, word1, word2, sep = " ")
})})
Table of Bigrams
The full list of the 53,320 remaining bigrams in the data can be
navigated below, sorted by frequency.
paged_table(reddit_bigrams_united)
Data Analysis
The large number of bigrams in the data is unwieldy, and scrolling
through pages of these bigrams is an inefficient method of analysis.
However, just by looking at the first page of results, we do see some
topics of note. In particular, the high ranking of the bigram “mental
health” as the third most popular bigram, just after “lesson plans”,
indicates that much discussion in the r/teaching subreddit is about
psychological health, probably of the teachers themselves. However,
student psychological health is also a top concern, as the bigrams
“special education” and “special ed” rank as the 9th and 10th most
common bigrams, and the 4th highest bigram, “classroom management,”
incorporates elements of student psychology as well.
Visualizing Using Network Graphs
To gain a different sense of what the top discusssion topics are in
r/teaching, a network analysis was also performed. In order to avoid
creating a snowball graph of all 53,320 bigrams, the network was reduced
to just the top 60 bigrams by limiting the data to the bigrams that
appeared 5 times or more.
# convert bigrams into graph of format: word 1 -> word 2
reddit_bigram_graph <- reddit_bigrams_united %>%
graph_from_data_frame()
# reduce data to top 60 bigrams (cutoff are bigrams that repeat >30 times)
reddit_bigram_graph_filtered <- reddit_bigram_counts %>%
filter(n > 30) %>%
graph_from_data_frame()
set.seed(100)
ggraph(reddit_bigram_graph_filtered, layout = "fr") +
geom_edge_link(aes(edge_alpha = .1, edge_color="gray"), show.legend = FALSE) +
geom_node_point(color = "cornflowerblue") +
geom_node_text(aes(label = name), vjust = 1.5, hjust = 1) +
labs(title = "Network Graph of Top 60 Bigrams of r/Teaching Sept-Oct 2022") +
theme_void()
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.

The resulting graph has several larger clusters of words that are in
bigrams in different combinations. The largest cluster, with 17 nodes,
has central nodes of “school” and “teacher”, and it covers major
administrative distinctions such as public/private schools, but
interestingly also includes the bigram “teacher shortage”. The next
largest two clusters refer to lesson plan lengths and school grades, but
there are some more insightful bigrams as well. There is a cluster
around the words “bad”, “feel”, “don’t”, and “care” that illustrate the
posts concerning teacher psychological health worries, as do the bigrams
“mental health”, “life balance/easier,” and “sick days.” But there a
number of bigrams that refer to neutral topics, such as “water
bottle/s”, “contract hours,” and “morning person.”
Conclusion
The r/teaching subreddit of the link aggregation and discussion
website Reddit.com is a diverse and active community of teachers and
members of the general public. The posts in this forum cover many topics
related to teaching, from lesson plans to teacher mental health. A
bigram analysis of the top discussion topics in the subreddit over a 2
month timespan near the beginning of the 2022 school year showed that
practical ideas on lesson plans and classroom management were common
topics, but there were also many bigrams that reflect teacher stress and
the politics of education.
Researchers with more specific areas of concern could drill down into
this data to find more patterns related to a variety of research
questions, from sentiment analysis of posts on particular topics over
several weeks to reactions to school shootings on a particular calendar
date. The relatively easy access to the data is useful, although we must
remain vigilant about acknowledging the potential limitations to Reddit
data: it is likely to reflect the activity of a demographic that skews
toward the higher end of technical literacy in the United States, and
its open (but moderated) nature means that it will reflect its own
particular kind of discussion that is not exhaustive of the possible
styles of discussion about teaching.
Addendum on Trigrams
Another avenue for network analysis could be using trigrams instead
of bigrams to generate triads as the units of analysis. That would
require additional assistance for accuracy and insight, but we can see
that trigrams in this dataset provide a different set of word
combinations, and would likely lead to different interpretations, which
we leave for future potential research.
invisible({capture.output({
# tokenize text to trigrams
reddit_trigrams <- posts %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
# create dataframe with trigrams by count
reddit_trigrams %>%
count(trigram, sort = TRUE)
# create dataframe with trigram words split into separate columns
reddit_trigrams_separated <- reddit_trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
# filter out stop words from separated columns
reddit_trigrams_filtered <- reddit_trigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word)
# filter out custom stop words from separated columns
reddit_trigrams_filtered <- reddit_trigrams_filtered %>%
filter(!word1 %in% custom_stops$word) %>%
filter(!word2 %in% custom_stops$word) %>%
filter(!word3 %in% custom_stops$word)
# create dataframe with separated columns by count
reddit_trigram_counts <- reddit_trigrams_filtered %>%
count(word1, word2, word3, sort = TRUE)
reddit_trigram_counts
# Reunite separate words into trigrams after filtering and counting complete
reddit_trigrams_united <- reddit_trigram_counts %>%
unite(trigram, word1, word2, word3, sep = " ")
})})
paged_table(reddit_trigrams_united)