options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("tidytext")
install.packages("vader")
install.packages("tidyverse")
install.packages("wordcloud2")
install.packages("RedditExtractoR")
install.packages("modeest")
install.packages("reshape2")
install.packages("wordcloud")
install.packages("multcomp")School Psychology Sentiments
Exploring the Sentiments of r/schoolpsychology
School psychologists serve a pivotal role within U.S. public schools providing students, staff, and school systems with social-emotional, behavioral, and academic support. A critical function of school psychologists is to conduct individualized psychoeducational evaluations for students suspected of having a disability that may impact their education. Despite their essential role in schools, there is currently an acute national shortage of practicing school psychologists within schools. The National Association of School Psychologists (NASP; (NASP, n.d.)) recommends that schools have a ratio of 1 school psychologist for every 500 students; however, the current ratio is 1:1119, with large variations across states. Shortages within the field are due to multiple factors, including a lack of qualified faculty to teach in graduate training programs, burnout leading to practitioners leaving the field, as well as a lack of awareness among high school and college-age students about the field (APA, n.d.; Schilling and Randolph 2021). There have been several initiatives to boost the number of school psychologists, including the NASP Exposure Project, which seeks to use local school psychologists to conduct presentations for high school and college-age students about the field and present school psychology as a viable career pathway for them (“NASP Exposure Project (NASP-EP),” n.d.). While underexplored in the school psychology literature, online communities designed for school psychologists and those interested in the field may also represent an important avenue for providing accurate information and recruiting students interested in the field.
Online communities for professionals on Facebook, X, Reddit, and other social media websites, serve as hubs for knowledge, promote belonging, and can increase satisfaction with work (Oksanen et al. 2024). However, these communities can also serve as spaces to express negative sentiments and seek support from peers. While potentially beneficial at the individual level, research indicates that these negative sentiments may get more engagement than positive perspectives (Davis and Graham 2021). This by extension may serve to paint a negative overall perspective of the field and dissuade prospective applicants from joining. To date, however, no research has directly examined the sentiments expressed in online school psychology spaces. In this report, I examine the sentiments expressed on r/schoolpsychology which is a subreddit on Reddit. Subreddits are online communities that are dedicated to a specific topic or interest. Users primarily engage with one another through posts made by individual users, comments by other users on those posts, and points assigned to those posts through upvotes (adding points) and downvotes (removing points). r/schoolpsychology is a subreddit with over 13,000 members who are interested in school psychology, graduate students, practitioners, and/or trainers of school psychologists (“Reddit for School Psychologists!” n.d.). Using data sourced from Reddit, I answer the following questions:
- What are the sentiments expressed on r/schoolpsychology?
- Is there a difference between sentiments expressed on posts compared to comments?
- How have sentiments changed over time?
- Do negative sentiments get more engagement than positive sentiments (comments, upvotes, downvotes)?
Before we go further, let’s install some necessary packages first.
tidytext - provides functions for handling and analyzing text data using tidy data principles [ Silge et al. (2024) ].
VADER (Valence Aware Dictionary and sEntiment Reasoner) - a rule-based sentiment analysis tool trained on social media text. Vader generates a positive, neutral, or negative sentiment score using a -1 to +1 scale [ Hutto and Gilbert (2014) ].
tidyverse - a series of R packages used for data manipulation that share an underlying design and language [ Wickham (2007) ].
wordcloud2 - creates interactive word clouds [ Lang and Chien (2018) ].
wordcloud - creates static word clouds [ Fellows (2018)].
RedditExtractoR - uses Reddit’s API to scrape and extract data from Reddit, including posts, comments, and metadata [ Rivera (2023) ].
modeest - provides tools for estimating the mode of a dataset [ Paul and Clarke (2016) ].
reshape2 - transforms data between wide and long formats [ Wickham (2007) ].
multcomp - this package provides several post hoc comparison measures useful for ANOVA [ Hothorn et al. (2024) ].
To begin, I will first need to call these packages into our current session.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Loading required package: RColorBrewer
Loading required package: stats4
Loading required package: splines
Step 1: Wrangle
The first step is the most exciting one, gathering Reddit data! The RedditExtractoR() package allows us to scrape any subreddit for post data, up to 999 posts per call. Since r/schoolpsychology does not get posts very often, that should give us a few years worth of posts.
To begin scraping data, I first need to use the find_thread_urls() function. This function requires that you specify the subreddit(s) you want to scrape data from, and how you want to sort data. In this case, I am sorting by top posts to get a sample of the most popular topics/posts, and by date range. Since I want the largest sample possible, the date range is specified as “all” to get 999 posts.
reddschoolpsych <- find_thread_urls(subreddit = "schoolpsychology",
sort_by = "top", period = "all")
commschoolpsych <- get_thread_content(reddschoolpsych$url)Let’s see what we’ve got!
glimpse(reddschoolpsych)Rows: 999
Columns: 7
$ date_utc <chr> "2022-08-16", "2022-07-28", "2022-06-21", "2022-06-16", "202…
$ timestamp <dbl> 1660675004, 1658977235, 1655845096, 1655417025, 1648994396, …
$ title <chr> "Calming/wellness room for high school", "report writing in …
$ text <chr> "Hi,\nHas anyone successfully established a calming or welln…
$ subreddit <chr> "schoolpsychology", "schoolpsychology", "schoolpsychology", …
$ comments <dbl> 4, 13, 28, 12, 73, 80, 27, 4, 22, 20, 6, 4, 8, 4, 6, 7, 7, 9…
$ url <chr> "https://www.reddit.com/r/schoolpsychology/comments/wq1xgx/c…
glimpse(commschoolpsych)List of 2
$ threads :'data.frame': 260 obs. of 15 variables:
..$ url : chr [1:260] "https://www.reddit.com/r/schoolpsychology/comments/wq1xgx/calmingwellness_room_for_high_school/" "https://www.reddit.com/r/schoolpsychology/comments/w9y61b/report_writing_in_different_states/" "https://www.reddit.com/r/schoolpsychology/comments/vhnez1/masters_level_school_psychs_and_iq_tests/" "https://www.reddit.com/r/schoolpsychology/comments/vdx8q1/looking_for_good_sel_curriculum/" ...
..$ author : chr [1:260] "ana_banana_obp" "bageltechnician" "msolorio79" "feedthebite" ...
..$ date : chr [1:260] "2022-08-16" "2022-07-28" "2022-06-21" "2022-06-16" ...
..$ timestamp : num [1:260] 1.66e+09 1.66e+09 1.66e+09 1.66e+09 1.65e+09 ...
..$ title : chr [1:260] "Calming/wellness room for high school" "report writing in different states" "Masters level School Psychs and IQ tests" "Looking for good SEL curriculum" ...
..$ text : chr [1:260] "Hi,\nHas anyone successfully established a calming or wellness room that was manageable to monitor along with a"| __truncated__ "Is report writing required in your state? I\031m in Georgia and it\031s required here.\n\nI heard Washington an"| __truncated__ "I sat through a presentation today by a BCBA and one of the things he said caught my attention. He said that s"| __truncated__ "Hi everyone, my district is getting a grant for SEL MTSS. Do you know of good researched based SEL that could b"| __truncated__ ...
..$ subreddit : chr [1:260] "schoolpsychology" "schoolpsychology" "schoolpsychology" "schoolpsychology" ...
..$ score : num [1:260] 10 10 9 11 10 10 10 11 10 10 ...
..$ upvotes : num [1:260] 10 10 9 11 10 10 10 11 10 10 ...
..$ downvotes : num [1:260] 0 0 0 0 0 0 0 0 0 0 ...
..$ up_ratio : num [1:260] 1 1 0.92 1 1 1 1 1 1 0.92 ...
..$ total_awards_received: num [1:260] 0 0 0 0 0 0 0 0 0 0 ...
..$ golds : num [1:260] 0 0 0 0 0 0 0 0 0 0 ...
..$ cross_posts : num [1:260] 0 0 0 0 0 0 0 0 0 0 ...
..$ comments : num [1:260] 4 13 28 12 73 80 27 4 22 20 ...
$ comments:'data.frame': 4133 obs. of 10 variables:
..$ url : chr [1:4133] "https://www.reddit.com/r/schoolpsychology/comments/wq1xgx/calmingwellness_room_for_high_school/" "https://www.reddit.com/r/schoolpsychology/comments/wq1xgx/calmingwellness_room_for_high_school/" "https://www.reddit.com/r/schoolpsychology/comments/wq1xgx/calmingwellness_room_for_high_school/" "https://www.reddit.com/r/schoolpsychology/comments/wq1xgx/calmingwellness_room_for_high_school/" ...
..$ author : chr [1:4133] "sendapicofyourkitty" "odd-42" "VaginaPirate" "lmidor" ...
..$ date : chr [1:4133] "2022-08-16" "2022-08-17" "2022-08-17" "2022-08-17" ...
..$ timestamp : num [1:4133] 1.66e+09 1.66e+09 1.66e+09 1.66e+09 1.66e+09 ...
..$ score : num [1:4133] 10 9 3 2 17 4 6 4 3 2 ...
..$ upvotes : num [1:4133] 10 9 3 2 17 4 6 4 3 2 ...
..$ downvotes : num [1:4133] 0 0 0 0 0 0 0 0 0 0 ...
..$ golds : num [1:4133] 0 0 0 0 0 0 0 0 0 0 ...
..$ comment : chr [1:4133] "I tried one and it failed fairly spectacularly- it just wasn\031t a safe space. Certain groups of kids would co"| __truncated__ "Our social worker tried this. It blew up similarly. \034Cool kids\035 claimed it as \034theirs.\035 The othe"| __truncated__ "yes, has to be associated to specific needs of individuals who will be using it. behavior accommodations and/o"| __truncated__ "My high school set one up successfully last year. The students had to get a pass from either the mental health"| __truncated__ ...
..$ comment_id: chr [1:4133] "1" "1_1" "2" "3" ...
Next, since I am interested in posts and comments, I use the get_thread_content function, which uses each post’s URL in order to scrape its comments as well as some other useful metadata. However, after doing so, I noticed that Reddit’s API is limiting the number of comments that I am able to extract, leaving a large number of comments empty. We will need to address this missing data.
Now that we have our initial dataset, I’ll use glimpse() in order to get an initial sense of how the data are structured.
Rows: 260
Columns: 15
$ url <chr> "https://www.reddit.com/r/schoolpsychology/comme…
$ author <chr> "ana_banana_obp", "bageltechnician", "msolorio79…
$ date <chr> "2022-08-16", "2022-07-28", "2022-06-21", "2022-…
$ timestamp <dbl> 1660675004, 1658977235, 1655845096, 1655417025, …
$ title <chr> "Calming/wellness room for high school", "report…
$ text <chr> "Hi,\nHas anyone successfully established a calm…
$ subreddit <chr> "schoolpsychology", "schoolpsychology", "schoolp…
$ score <dbl> 10, 10, 9, 11, 10, 10, 10, 11, 10, 10, 9, 10, 9,…
$ upvotes <dbl> 10, 10, 9, 11, 10, 10, 10, 11, 10, 10, 9, 10, 9,…
$ downvotes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ up_ratio <dbl> 1.00, 1.00, 0.92, 1.00, 1.00, 1.00, 1.00, 1.00, …
$ total_awards_received <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ golds <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cross_posts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ comments <dbl> 4, 13, 28, 12, 73, 80, 27, 4, 22, 20, 6, 4, 8, 4…
Rows: 4,133
Columns: 10
$ url <chr> "https://www.reddit.com/r/schoolpsychology/comments/wq1xgx/…
$ author <chr> "sendapicofyourkitty", "odd-42", "VaginaPirate", "lmidor", …
$ date <chr> "2022-08-16", "2022-08-17", "2022-08-17", "2022-08-17", "20…
$ timestamp <dbl> 1660691647, 1660697989, 1660700124, 1660751725, 1658978871,…
$ score <dbl> 10, 9, 3, 2, 17, 4, 6, 4, 3, 2, 3, 3, 5, 3, 5, 2, 3, 70, 13…
$ upvotes <dbl> 10, 9, 3, 2, 17, 4, 6, 4, 3, 2, 3, 3, 5, 3, 5, 2, 3, 70, 13…
$ downvotes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ golds <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ comment <chr> "I tried one and it failed fairly spectacularly- it just wa…
$ comment_id <chr> "1", "1_1", "2", "3", "1", "1_1", "2", "2_1", "3", "3_1", "…
Examining the scraped data, we can see that we have 999 posts (observations) and 7 columns in the reddschoolpsych data frame, including:
date_utc - the date of the post, in year-month-day format
timestamp - a time stamp for each post
title - the title description for each post
text - the main text for each post
subreddit - in this case, r/schoolpsychology
comments - the number of comments each post gets - this also includes a count of hidden, deleted or blocked comments made on the post
URL - a direct link to each post
Using the URL for each post, we can use get_thread_content() to scrape the comments for each individual post. After doing so, we get a list that includes two data frames. The first has 260 observations and 15 columns. While this data frame has extensive information, such as the author of each post, upvotes, downvotes, awards given, etc., we are primarily going to need the upvote, downvote, score, and URL columns.
The second data frame is much larger than the first and contains 4133 comments (observations) from the 999 posts we scraped. Aside from the comments themselves, there are 9 other variables; however, from this data frame we primarily need the comments and URL columns for our analysis.
Because the data are spread across several tables and contains several extraneous columns, I’ll need to do some cleaning and data processing, which is where I turn to next.
Step 2: Pre-Processing
In Step two, I’ll begin cleaning the data and creating useful new variables. This is a process that will be revisited throughout this analysis, but this initial cleaning will give a good starting point.
To begin, I want to take my list that contains two data frames and break it up into two separate data frames. This will be necessary as while lists are useful for storing multiple types of data, they are difficult to perform calculations on directly.
After creating new data frames, we can extract useful columns from them and add them to the main data frame that contains post information. An interesting quirk of the data are that they are different sizes. comm_sepschoolpsych has 4133 observations, whereas reddschoolpsych only has 999 observations. If you try to combine these data frames as is, R will “helpfully” fill in the missing 3134 values and duplicate all of the post information. To avoid this, I first combine all of the comments associated with each post by its shared URL using the summarize() function and separate comments by a semicolon. Now when we add them to the reddschoolpsych data frame, each post with comments available will have all of its comments stored in a single cell.
Next, I focus on combining information from the thread_sepschoolpsych data frame with our posts and comments. I use full_join() for this and match each post to its scores, upvotes, and downvotes by its associated URL.
After doing this, there are just a few more steps that need to be completed to get a nice clean dataset. I want to rename my column names to be something more descriptive, separate my dates so that each post’s year, month, and day are stored in their own column, and choose what columns I want to keep.
Finally, to address missing data from our posts, I will use list wise deletion for posts with scores of “NA”. While this will reduce the number of observations in our data, it will leave us with a more complete dataset.
#Separating comments and threads
comm_sepschoolpsych <- commschoolpsych$comments
thread_sepschoolpsych <- commschoolpsych$threads
#joining the data so that comments are collapsed into one cell and are associated with each post.
comments_combined <- comm_sepschoolpsych |>
group_by(url) |>
summarize(comments = paste(comment, collapse = "; "))
posts_comm_comb <- reddschoolpsych |>
full_join(comments_combined, by = "url")
#joining threads data so that we can get post rating information
complete_reddit_data <- thread_sepschoolpsych |>
select(url, score) |>
group_by(url) |>
right_join(posts_comm_comb, by = "url") |>
ungroup(url)
#renaming columns
complete_reddit_data <- complete_reddit_data |>
rename(num_comments = comments.x,
comb_comments = comments.y)
# separating date
complete_reddit_data <- complete_reddit_data |>
separate(date_utc, c("Year", "Month", "Day"), sep = "-")
# selecting columns to retain
complete_reddit_data <- complete_reddit_data |>
select(title, text, comb_comments, score, num_comments, Year, Month, url)
# addressing NA values
complete_reddit_data <- na.omit(complete_reddit_data)Now, let’s see if our hard work paid off.
glimpse(complete_reddit_data)Rows: 251
Columns: 8
$ title <chr> "Calming/wellness room for high school", "report writing…
$ text <chr> "Hi,\nHas anyone successfully established a calming or w…
$ comb_comments <chr> "I tried one and it failed fairly spectacularly- it just…
$ score <dbl> 10, 10, 9, 11, 10, 10, 10, 11, 10, 10, 9, 10, 9, 10, 9, …
$ num_comments <dbl> 4, 13, 28, 12, 73, 80, 27, 4, 22, 20, 6, 4, 8, 4, 6, 7, …
$ Year <chr> "2022", "2022", "2022", "2022", "2022", "2022", "2022", …
$ Month <chr> "08", "07", "06", "06", "04", "03", "03", "03", "03", "0…
$ url <chr> "https://www.reddit.com/r/schoolpsychology/comments/wq1x…
Excellent! It looks like we have a decent starting point for analysis. The new data frame complete_reddit_data has 251 observations, and 8 columns: title, text, comb_comments (each post’s combined comments), num_comments (the number of comments each post received), Year, Month, and URL. The significant reduction in observations was due to Reddit’s API limiting the number of comments that could be gathered. Thus, 748 posts did not have comments or score information and were excluded from analysis.
Now that I have a clean dataset, I can begin to do some initial analysis and explore the data further.
Step 3: Analyze
The first thing we can do with our data is to get a sense of its shape and distribution. Using ggplot2 I create histograms for each post’s scores and number of comments.
#visualizing numeric data
complete_reddit_data |>
ggplot(aes(score)) +
geom_histogram() +
ggtitle("Histogram of Post Scores") +
theme_minimal()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
complete_reddit_data |>
ggplot(aes(num_comments)) +
geom_histogram() +
labs(title = "Histogram of Number of Comments",
x = "Number of Comments") +
theme_minimal()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at the histogram for the number of comments, we can see that the data has a strong positive skew and appears leptokurtic. Most posts only seem to have a few comments with some outliers exceeding 200+ comments!
Similarly, for post score, we can see that there is also a right-tailed skew, with the majority of posts having a score of 10 and only a few having scores between 1 and 25.
We can also quantify the distributions using the summary() function and the mfv() function which returns the mode of the data.
# summary of numeric data
summary(select(complete_reddit_data,
score,
num_comments)) score num_comments
Min. : 8.00 Min. : 1.0
1st Qu.:11.00 1st Qu.: 6.0
Median :12.00 Median : 11.0
Mean :14.29 Mean : 16.3
3rd Qu.:19.00 3rd Qu.: 16.5
Max. :24.00 Max. :171.0
mfv(complete_reddit_data$score, na_rm = TRUE)[1] 10
mfv(complete_reddit_data$num_comments, na_rm = TRUE)[1] 6 9
Looking at the data, we can again see evidence of a slight skew. Scores have a mean of 14.29, a median of 12, and a mode of 10. Similarly, the number of comments have a mean of 16.3, a median of 11, and the data are bi-modal at 6 and 9.
Next, we can get a sense of what is being discussed in the data by looking at posts directly.
#viewing top posts
complete_reddit_data |>
arrange(desc(score)) |>
slice(1:10)# A tibble: 10 × 8
title text comb_comments score num_comments Year Month url
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 Considering school … "I a… "1. Right no… 24 10 2021 02 http…
2 Struggling with con… "Hi,… "One great p… 23 7 2021 04 http…
3 School psychologist… "Sch… "Honestly? I… 23 29 2020 11 http…
4 Advice for Internsh… "Hey… "Be flexible… 23 4 2019 07 http…
5 How to emphasize im… "Hi … "Friendly re… 22 6 2024 02 http…
6 Academic testing fo… "Aft… "I think it … 22 21 2023 09 http…
7 How many evaluation… "In … "I do about … 22 15 2022 07 http…
8 Can I ask to shadow… "Hel… "My district… 22 16 2021 07 http…
9 When does the anxie… "So … "[deleted]; … 22 14 2021 05 http…
10 Best moment as a sc… "Wha… "Years ago I… 22 11 2021 01 http…
#viewing bottom posts
complete_reddit_data |>
arrange(score) |>
slice(1:10)# A tibble: 10 × 8
title text comb_comments score num_comments Year Month url
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 Anyone here work in… "Hel… "Looks like … 8 1 2021 07 http…
2 Graduate School, Ad… "Hel… "Hi everyone… 8 44 2021 06 http…
3 Masters level Schoo… "I s… "BCBAs, seco… 9 28 2022 06 http…
4 Counseling outside … "I w… "I'm a bot. … 9 6 2022 02 http…
5 Automatic referral … "Yes… "\"What is t… 9 8 2022 01 http…
6 Contemplating my ca… "Hel… "Pros and co… 9 6 2022 01 http…
7 Giving the BASC-3 t… "If … "I would att… 9 9 2021 12 http…
8 Salary Schedules "Whe… "I'm a bot. … 9 4 2021 09 http…
9 On Pedagogy, reform… "Do … "Following. … 9 3 2021 08 http…
10 How to Email School… "I a… "Hello, 2nd … 9 13 2021 07 http…
Looking at the top and bottom posts, we can see that the highest-rated posts are generally about people seeking advice on how to become school psychologists, credentialing standards, and evaluation considerations. However, three of the top posts, “Struggling with consultation”, “School psychologist Frustration”, and “When does the anxiety stop?” seem to be expressing negative sentiments about the field and experiences within it.
The lowest-rated posts pertain to educational programs, jobs/compensation, research, and district-level procedures. Nothing on the surface at least appears to be negative about these posts.
Now that we have a general sense of what the posts are talking about, I can begin preparing our data for sentiment analysis. Sentiment analysis examines subjective elements, such as words, for their emotional quality [ Taboada (2016) ]. Sentiment analysis compares words to dictionaries that have been pre-rated by researchers for their emotional quality by either assigning a qualitative label (e.g. positive, negative) or through assigning a quantitative value (e.g. 1 [negative] – 5 [positive]) (Silge and Robinson, n.d.). For the present study, I used the Valence Aware Dictionary and sEntiment Reasoner (VADER; Hutto and Gilbert (2014) ), as it is trained to analyze the sentiments of short social media posts.
posts_unested <- select(complete_reddit_data, title, text, Year, score, num_comments)
comments_unested <- select(complete_reddit_data, comb_comments, Year)
posts_unested <- posts_unested |>
unnest_tokens(output = word,
input = title) |>
unnest_tokens(output = word,
input = text)
comments_unested <- comments_unested |>
unnest_tokens(output = cword,
input = comb_comments)
# Remove stop words from both posts and comments
posts_unested <- posts_unested |>
anti_join(stop_words, by = "word")
comments_unested <- comments_unested |>
anti_join(stop_words, by = c("cword" = "word"))
glimpse(posts_unested)Rows: 55,746
Columns: 4
$ Year <chr> "2022", "2022", "2022", "2022", "2022", "2022", "2022", "…
$ score <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
$ num_comments <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ word <chr> "successfully", "established", "calming", "wellness", "ma…
glimpse(comments_unested)Rows: 71,682
Columns: 2
$ Year <chr> "2022", "2022", "2022", "2022", "2022", "2022", "2022", "2022", …
$ cword <chr> "failed", "fairly", "spectacularly", "wasn", "safe", "space", "k…
For this analysis, words from the title of each post and their associated text have been combined into a new object, posts_unested. Similarly, comments have been unnested into comments_unested. I also performed some initial cleaning of the data by removing “stop words” or common words that do not convey much information on their own, such as “the”, “and”, “if”, etc. After unnesting, we end up with 55, 746 unique words for posts and 71, 682 words for our comments. Let’s now take a look a closer look at the words. Using ggplot() we can create a column chart that displays the frequencies for each data frame.
#identifying top words
posts_unested|>
count(word, sort = TRUE) |>
filter(n > 200) |>
mutate(word = reorder(word, n)) |>
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)comments_unested |>
count(cword, sort = TRUE) |>
filter(n > 200) |>
mutate(cword = reorder(cword, n)) |>
ggplot(aes(n, cword)) +
geom_col() +
labs(y = NULL)Looking through the histograms, we can see there were a lot of undesirable words generated such as “NA”, word fragments, such as “ve”, “don”, etc., as well as words that are obvious given the context from which the data came from, such as “school”, “psych”, “psychology”, etc. To get more usable data, I will need to remove these words. Luckily, I can use a similar procedure as described above and specify my own custom stop words.
# custom stopwords
restop <- c("school", "schools", "schoolpsychology", "psychology", "psych", "psychs",
"psychologist", "psychologists","grad", "graduate", "graduates", "NA", "N/A", "ve", "reddit",
"old.reddit.com", "don", "thread", "https", ".com", "faq", "wiki",
"2", "1", "3", "4", "5", "6", "7", "8", "9", "10", "deleted", "ll", "11",
"amp", NA)
posts_unested <-
posts_unested |>
filter(!word %in% restop)
comments_unested <-
comments_unested |>
filter(!cword %in% restop)
# reviewing top words
posts_unested |>
count(word, sort = TRUE) |>
filter(n > 200) |>
mutate(word = reorder(word, n)) |>
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL, x = "Number of Occurences", title = "Top Post Words")comments_unested |>
count(cword, sort = TRUE) |>
filter(n > 200) |>
mutate(cword = reorder(cword, n)) |>
ggplot(aes(n, cword)) +
geom_col() +
labs(y = NULL, x = "Number of Occurences", title = "Top Comment Words")Looking at the top post words, we can see that they mostly pertain to time, jobs, districts, students, and feelings. Comments refer to programs, time, districts, jobs, and students. There are some clear overlaps between what is commonly discussed in posts and comments. Another way to visualize words is through word clouds.
# creating word clouds
posts_unested |>
select(word) |>
count(word, sort = TRUE) |>
slice_max(order_by = n, n = 50) |>
wordcloud2(size = 0.5)In this view, the same words for posts are represented as they were in the previous chart. However, this lets us better visualize the relative frequency of each word.
Now that the data are prepared, we’re ready to begin answering some questions! The first is, What are the sentiments expressed on r/schoolpsychology? The second is, is there is a difference between sentiments expressed on posts compared to comments? To answer these questions, I’ll need to conduct a sentiment analysis on the words. The vader() package allows us to use a rubric of words that have already been assigned a positive (1) to negative (-1) score to compare our words against and assign ratings to each.
# Sentiment Analysis
sentiments_posts <- vader_df(posts_unested$word)
sentiments_comments <- vader_df(comments_unested$cword)
vader_posts_summary <- sentiments_posts |>
mutate(sentiment = ifelse(compound >= 0.05, "positive",
ifelse(compound <= -0.05, "negative", "neutral"))) |>
count(sentiment, sort = TRUE) |>
spread(sentiment, n) |>
relocate(positive) |>
mutate(ratio = positive/negative)
vader_comments_summary <- sentiments_comments |>
mutate(sentiment = ifelse(compound >= 0.05, "positive",
ifelse(compound <= -0.05, "negative", "neutral"))) |>
count(sentiment, sort = TRUE) |>
spread(sentiment, n) |>
relocate(positive) |>
mutate(ratio = positive/negative)
vader_posts_summary positive negative neutral ratio
1 4047 2436 42061 1.66133
vader_comments_summary positive negative neutral ratio
1 6108 2993 56557 2.040762
A few things become apparent through my analysis using vader(). First, comments appear to have a higher ratio of positive to negative words compared to posts. For every 100 negative words in the comments, there are around 200 positive words present. Whereas there is a 1.66 ratio for positive to negative words for posts, in other words, for every 100 negative words, there are 166 positive words. Taken together, it seems that there is a neutral to positive overall sentiment expressed on r/schoolpsychology.
The second observation is that there is a large proportion of words that vader() classified as Neutral - 86.6% of post words and 86.1% of words for comments. This can be due to these words being truly neutral (not having a negative or positive connotation), or they are just words that are not included in vader’s dictionary, and so vader is not able to assign a valence rating. We can take a look at what words were assigned to negative, positive, and neutral ratings next.
# most common positive and negative sentiments
sentiments_posts <- sentiments_posts |>
mutate(sentiment = ifelse(compound >= 0.05, "positive",
ifelse(compound <= -0.05, "negative", "neutral")))
sentiments_comments <- sentiments_comments |>
mutate(sentiment = ifelse(compound >= 0.05, "positive",
ifelse(compound <= -0.05, "negative", "neutral")))
sentiments_posts |>
count(text, sentiment, sort = TRUE) |>
filter(sentiment == "positive") |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup()# A tibble: 11 × 3
text sentiment n
<chr> <chr> <int>
1 love positive 154
2 curious positive 133
3 support positive 123
4 helpful positive 109
5 special positive 107
6 pretty positive 90
7 honestly positive 84
8 hope positive 80
9 appreciated positive 76
10 emotional positive 73
11 feeling positive 73
sentiments_posts |>
count(text, sentiment, sort = TRUE) |>
filter(sentiment == "negative") |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup()# A tibble: 11 × 3
text sentiment n
<chr> <chr> <int>
1 hard negative 124
2 anxiety negative 107
3 low negative 86
4 pay negative 80
5 bad negative 57
6 difficult negative 57
7 tough negative 55
8 nervous negative 45
9 worried negative 45
10 crazy negative 40
11 leave negative 40
sentiments_posts |>
count(text, sentiment, sort = TRUE) |>
filter(sentiment == "neutral") |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup()# A tibble: 10 × 3
text sentiment n
<chr> <chr> <int>
1 district neutral 520
2 job neutral 449
3 student neutral 435
4 time neutral 420
5 students neutral 407
6 program neutral 383
7 feel neutral 368
8 programs neutral 291
9 experience neutral 284
10 testing neutral 281
sentiments_comments |>
count(text, sentiment, sort = TRUE) |>
filter(sentiment == "positive") |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup()# A tibble: 10 × 3
text sentiment n
<chr> <chr> <int>
1 love positive 167
2 pretty positive 156
3 luck positive 154
4 accepted positive 124
5 hope positive 115
6 lol positive 109
7 helpful positive 105
8 special positive 103
9 support positive 97
10 honestly positive 75
sentiments_comments |>
count(text, sentiment, sort = TRUE) |>
filter(sentiment == "negative") |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup()# A tibble: 11 × 3
text sentiment n
<chr> <chr> <int>
1 pay negative 185
2 hard negative 125
3 leave negative 76
4 bad negative 75
5 anxiety negative 64
6 difficult negative 61
7 stress negative 50
8 rejected negative 40
9 tough negative 37
10 low negative 35
11 nervous negative 35
sentiments_comments |>
count(text, sentiment, sort = TRUE) |>
filter(sentiment == "neutral") |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup()# A tibble: 10 × 3
text sentiment n
<chr> <chr> <int>
1 program neutral 727
2 time neutral 671
3 district neutral 542
4 job neutral 480
5 student neutral 371
6 students neutral 371
7 lot neutral 360
8 interview neutral 345
9 people neutral 342
10 experience neutral 328
The tables generated above display the top 10 words for posts and comments. In brief, they include:
Posts
| Valence | Top Words |
|---|---|
| Positive | love, curious, support, helpful, special |
| Neutral | district, job, student, time, students |
| Negative | hard, anxiety, low, pay, bad |
Comments
| Valence | Top Words |
|---|---|
| Positive | love, pretty, luck, accepted, hope |
| Neutral | program, time, district, job, student |
| Negative | pay, hard, leave, bad, anxiety |
We can also see positive and negative words expressed visually through a “comparison cloud” that plots the top 100 positive and negative words next to each other. This view allows us to directly compare what words are considered positive and negative in the data as well as their relative frequency of representation.
suppressWarnings({sentiments_posts |>
filter(sentiment %in% c("positive", "negative")) |>
count(text, sentiment, sort = TRUE) |>
acast(text ~ sentiment, value.var = "n", fill = 0) |>
comparison.cloud(colors = c("darkblue", "goldenrod3"),
max.words = 50)
})Looking through the words, it does appear that vader did a decent job at selecting truly positive, negative, and neutral words. The positive words across posts and comments seem to deal with positive emotions, as well as being accepted. In this case “acceptance” is likely coming from a weekly graduate student thread posted on r/schoolpsychology, where prospective or current students can ask questions or share news of their acceptance into graduate programs. Neutral words also seem to be more objectively neutral, such as time, district, students, etc. However, of course, the context of words matters quite a bit. A limitation of examining each word individually is that we don’t know the context for the words. For example if “pay” is preceded “high” then we might consider that a positive sentiment (getting high pay); however, if “pay” is preceded by “low”, we might interpret that as a complaint about receiving “low pay”. To get a bit more context for the data, I also generate bigrams or two-word sequences rather than just one word individually. Doing so will require some tweaking to the code above, but the process is largely the same.
#creating bigrams
posts_unested_big <- select(complete_reddit_data, title, text, Year, score, num_comments)
comments_unested_big <- select(complete_reddit_data, comb_comments, Year)
post_bigram <- posts_unested_big |>
unnest_tokens(output = word,
input = title,
token = "ngrams",
n = 2) |>
unnest_tokens(output = word,
input = text,
token = "ngrams",
n = 2)
comments_bigram <- comments_unested_big |>
unnest_tokens(output = cword,
input = comb_comments,
token = "ngrams",
n = 2)
posts_bigrams_separated <- post_bigram |>
separate(word, c("word1", "word2"), sep = " ")
comm_bigrams_separated <- comments_bigram |>
separate(cword, c("word1", "word2"), sep = " ")
post_clean_bigrams <- posts_bigrams_separated |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word) |>
filter(!word1 %in% restop) |>
filter(!word2 %in% restop) |>
unite(word, word1, word2, sep = " ")
comments_clean_bigrams <- comm_bigrams_separated |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word) |>
filter(!word1 %in% restop) |>
filter(!word2 %in% restop) |>
unite(cword, word1, word2, sep = " ")
post_clean_bigrams |> count(word, sort=TRUE)# A tibble: 1,527 × 2
word n
<chr> <int>
1 questions related 96
2 programs admissions 85
3 training programs 85
4 special education 69
5 post questions 53
6 iep meetings 48
7 social emotional 48
8 hard time 42
9 mental health 42
10 assignments practicum 36
# ℹ 1,517 more rows
comments_clean_bigrams |> count(cword, sort = TRUE)# A tibble: 13,979 × 2
cword n
<chr> <int>
1 mental health 78
2 special education 57
3 report writing 47
4 private practice 39
5 phd program 36
6 feel free 33
7 nasp approved 33
8 eds program 30
9 special ed 27
10 message compose 26
# ℹ 13,969 more rows
Using bigrams gave a bit more context for the data. Most of the frequently occurring bigrams appear to be neutral questions (e.g. programs admissions, special education, etc), to positive (“greatly appreciated”). Similarly, comments deal with similar topics, such as “mental health”, “report writing”, etc. Thus, it seems that using single words (unigrams) in our analysis is capturing the true sentiments expressed in on r/schoolpsychology.
Now that we’ve looked at general sentiments expressed on r/schoolpsychology. We can turn our attention to the third question, How have sentiments changed over time? This can be accomplished fairly easily using ggplot() as well as features in the dataset that I’ve already created. Namely, I will generate a bar chart that plots the ratio of positive to negative scores by year.
#sentiments over time
posts_unested<- bind_cols(posts_unested, select(sentiments_posts, compound, sentiment))
comments_unested<- bind_cols(comments_unested, select(sentiments_comments, compound, sentiment))
posts_time <- posts_unested |>
group_by(Year, sentiment) |>
count(sentiment, sort = TRUE) |>
spread(sentiment, n) |>
relocate(Year) |>
mutate(ratio = positive/negative)
comments_time <- comments_unested |>
group_by(Year, sentiment) |>
count(sentiment, sort = TRUE) |>
spread(sentiment, n) |>
relocate(Year) |>
mutate(ratio = positive/negative)
posts_time |> ggplot(aes(Year, ratio)) +
geom_col() +
ggtitle ("Positive & Negative Posts by Year") +
theme_minimal()comments_time |> ggplot(aes(Year, ratio)) +
geom_col() +
ggtitle ("Positive & Negative Commments by Year") +
theme_minimal()Looking at the bar charts, we can see that for posts, the year with the lowest ratio of positive to negative words was 2022. In addition, it appears that positive posts have somewhat decreased over time; however, there continues to be more positive than negative posts, as a proportion, each year. In contrast, for comments, 2018 had the lowest number of positive to negative words ratio, and positive words have increased as a proportion until 2021 where words have remained consistently highly positive since. In short, posts have become somewhat less positive over time and comments have become more positive, but there is consistently more positive than negative sentiments expressed since 2018.
Now that we’ve examined how posts have changed over time, we’re ready to answer our final question, Do negative sentiments get more engagement than positive sentiments (comments, upvotes, downvotes)? To assess this, we could just examine the average sentiment ratings, points, and the number of comments of positive and negative words and see which appears to be larger. However, a more robust approach would be to use the aov function in order to conduct a one-way ANOVA, which will allow us to examine if mean differences in scores and number of comments between positive, negative, and neutral words are significant. However, an ANOVA will only tell us if there are significant differences between groups. Tukey’s HSD will be used as a post-hoc comparison in order to examine which groups were significantly different from each other.
# Do posts with negative sentiments get more engagement?
posts_unested <- posts_unested %>%
mutate(across(c(score, num_comments), as.numeric))
posts_unested |>
group_by(sentiment) |>
summarize(mean_num_comments = mean(num_comments, na.rm = TRUE),
mean_score = mean(score, na.rm = TRUE))# A tibble: 3 × 3
sentiment mean_num_comments mean_score
<chr> <dbl> <dbl>
1 negative 12.8 16.1
2 neutral 14.2 15.0
3 positive 12.1 15.0
posts_unested$sentiment <- factor(posts_unested$sentiment,
levels = c("negative", "neutral", "positive"))
scoredif <- aov(score ~ sentiment, data = posts_unested)
summary(scoredif) Df Sum Sq Mean Sq F value Pr(>F)
sentiment 2 2844 1421.9 68.64 <2e-16 ***
Residuals 48541 1005606 20.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
commdif <- aov(num_comments ~ sentiment, data = posts_unested)
summary(commdif) Df Sum Sq Mean Sq F value Pr(>F)
sentiment 2 19878 9939 41.24 <2e-16 ***
Residuals 48541 11697126 241
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(scoredif) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = score ~ sentiment, data = posts_unested)
$sentiment
diff lwr upr p adj
neutral-negative -1.109385866 -1.3316905 -0.8870812 0.0000000
positive-negative -1.100359264 -1.3739141 -0.8268044 0.0000000
positive-neutral 0.009026602 -0.1665406 0.1845938 0.9920269
TukeyHSD(commdif) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = num_comments ~ sentiment, data = posts_unested)
$sentiment
diff lwr upr p adj
neutral-negative 1.4221443 0.6639609 2.180328 0.0000328
positive-negative -0.6813163 -1.6142915 0.251659 0.2008148
positive-neutral -2.1034606 -2.7022431 -1.504678 0.0000000
Let’s first look at the mean number of comments and scores for positive, negative and neutral posts. The analysis indicates that posts containing negative sentiments, on average, get 12.79 comments, neutral posts get 14.22 comments, and positive posts only get 12.11 comments. It appears that positive posts get marginally fewer comments than negative or neutral posts. Looking at scores however, tells a slightly different story. Posts containing negative sentiments get 16.09 points, neutral posts get 14.98 points, and positive posts get 14.99 points. It appears that posts that express more negative sentiments get more points. We can also see if these differences are statistically significant.
In terms of scores, findings from a one-way ANOVA indicate that there was a significant difference between sentiment groups (F(2, 48541) = 68.64, p <.001). Tukey’s HSD indicated that negative sentiments (M = 16.09), had significantly higher scores than neutral (M = 14.98) and positive (M = 14.99) sentiments. This indicates that negative sentiments received higher scores than positive or neutral sentiments overall.
In terms of the number of comments, a one way ANOVA similarly indicated that there was a significant difference between sentiment groups (F(2, 48541) = 41.24, p <.001). However, Tukey’s HSD indicated that neutral sentiments (M = 14.22) received significantly more comments than positive (M = 12.11) and negative (M = 12.79) sentiments, and that posts with positive and negative sentiments did not significantly differ from each other.
Thus, in reference to the fourth question, it appears that how “engagement” is defined appears to matter. Negative sentiments appear to get more points than positive sentiments, but both posts with negative and positive sentiments get similar numbers of comments.
We can speculate further about why these results may have occurred. It may be that posts that express negative sentiments are generally seen as more relatable to other school psychologists. Thus, they are more likely to get upvoted when they occur. The finding that both posts with negative and positive sentiments get similar numbers of comments indicates that people may be engaging or supporting others when positive or celebratory posts are made. Similarly, negative posts may also elicit more empathy and support from others when they come up. For example, the post entitled “When does the anxiety stop?” received 22 upvotes and 14 comments. The majority of comments appear to be supportive of the post creator. For example:
- “For real though, I finally got to the point of not having panic attacks when I learned to take the onus off of myself. I am not the decision maker. I am not responsible for what the numbers look like (unless I made a mistake.) I am not the teacher. I am a part of a team and the team, including the parent, is responsible for decisions made. Honestly, I am probably one of the least important at the table, but I’m sure if you’re like me, you didn’t feel like that at the beginning of your career. And again, remember its your career, your job, not your life.”
Step 4: Communicate
Let’s briefly recap and summarize our four research questions:
- What are the sentiments expressed on r/schoolpsychology?
- In general, sentiments appear to be neutral to positive.
- Is there a difference between sentiments expressed on posts compared to comments?
- Yes, comments appear to contain more positive sentiments than posts.
- How have sentiments changed over time?
- Sentiments have stayed consistently positive for posts and have become increasingly positive for comments.
- Do negative sentiments get more engagement than positive sentiments (comments, upvotes, downvotes)?
- Negative sentiments do appear to get more engagement than positive sentiments in terms of total points awarded, and similar number of comments as positive sentiments.
What does this mean for online school psychology communities?
From the current study it appears that r/schoolpsychology is generally a positive place for current and prospective school psychologists. However, while the sentiments expressed on r/schoolpsychology are largely positive, there were proportionally more neutral sentiments, as well as a fairly sizable proportion of negative sentiments. In addition, negative sentiments were more likely to get more points. Posts with negative sentiments may lead to more discussion or sympathy for the negative feelings expressed. It appears that school psychologists may use r/schoolpsychology to vent or express frustration, and when they do, they receive positive support from others. This is consistent with prior research that indicates that online professional communities can offer a space for professionals to vent and receive support from colleagues [ Oksanen et al. (2024) ].
While posts expressing negative sentiments appear to be proportionally fewer than positive sentiments, research indicates that negative information may capture a disproportionate amount of our attention (Ito et al. 1998). Thus, these posts may still be what a prospective applicant thinks about when they think about the field of school psychology. This may be especially likely when prospective applicants view r/schoolpsychology by filtering by the “top” posts or by “hot” posts. Both show posts to users based on the total number of upvotes given to each post, with “hot” also favoring more recently created posts. Ranking by “hot” is the default method that Reddit uses to show posts. Thus, since negative sentiments are likely to get more upvotes, users browsing the platform are more likely to see posts containing these sentiments and are less likely to see the proportionally larger amount of positive or neutral posts.
Limitations
A key limitation of this study is that while sentiment analysis is able to provide a general understanding of sentiments expressed across a large corpus of text, the approach may miss more subtle nuances in language or context that change the meaning of the message being conveyed. For this reason, qualitative or mixed methods approaches examining posts directly in conjunction with sentiment analysis may provide more accurate insights. In addition, there was a significant amount of missing data which may have biased the present results. This limitation may be overcome through gaining full access by Reddit to use it’s API for research through applications such as Pushshift.io [ “NCRI Reddit Access” (n.d.) ]. Another limitation is that the views expressed on Reddit may not be representative of all school psychologists. However, research on the perceptions of school psychologists has rarely accounted for the views of school psychologists online. Given that r/schoolpsychology has over 13,000 members, the views of this community are likely to represent a significant, and under-researched, perspective in the field. Future research may supplement the perspectives of school psychologists online with qualitative approaches such as in-person interviews or questionnaires.
There are also several ethical considerations for the present study. First, data was taken from Reddit from users who did not necessarily consent to being a part of a study. This concern may be mitigated by the fact that Reddit is an anonymous social networking website, so the data used cannot be tracked to a user’s real name or other personally identifiable information, unless they have chosen to disclose that information. In addition, usernames were not used in this analysis. A related potential legal concern is the use of Reddit data for analysis. At this time, using the Reddit API to pull post information for research still may be allowed by Reddit’s terms of service, and packages like RedditExtractoR are still allowed to function. However, there is a lack of clarity around this issue as Reddit recently updated their terms of service to ban the use of their API for machine learning specifically [ “Data API Terms” (n.d.) ]. Since this is a research project and is not seeking to build a commercial product, it is likely that use of Reddit data in the present analysis is legally defensible.
Next Steps
While it is heartening to see that the majority of sentiments expressed on r/schoolpsychology are positive, there remains a sizable portion of negativity being expressed. These negative sentiments may provide an avenue for future efforts to intervene and improve working conditions for current and future practitioners. Many of the negative sentiments expressed deal with stress, anxiety, workload, and pay, among other concerns. These issues are not new in the field of school psychology and are major contributors to burnout and attrition (Schilling, Randolph, and Boan-Lenzo 2018). These issues require advocacy at the national, state, and regional levels through organizations such as NASP to promote better working conditions, pay, and support for new and seasoned school psychologists alike. For instance, NASP could advocate for policies that set limits on the number of evaluations school psychologists are required to complete each year, helping to reduce burnout and ensure that students receive high quality services. Additionally, addressing workforce shortages by investing in recruitment efforts is crucial. This could involve raising awareness about the field among undergraduate students and establishing structured pathways that offer research and practice-oriented experiences, which are often necessary for admission into school psychology graduate programs. Doing so may lead to a more organic reduction in negative sentiments expressed online and make school psychology a more attractive field for current and future practitioners.