Text as Data: Blog#2

Data Set

Context
This project uses data from 4 subreddit groups between 2019 and 2021. Three groups are from the major faith communities: r/Christianity, r/Islam, and r/Judaism and a covid-19 group, r/COVID19.

Data Acquisition
Using the redditextractoR library in Rstudio, I pulled the data using the find_thread_urls function. This function includes information on the date/time stamp, title, body of text, subreddit name, and URL. I searched by each subreddit group, sorting for the top threads of all time. This allowed me to gather approximately 1000 observations per subreddits. I chose the “all” time period to capture multiple years as the function limits the period search to within the last year, month, week, day, or all time. Note that I could only procure up to 1000 observations per search.

Tidying
After gathering each subreddit’s information, I compiled the separate sources into one data frame and tidied the data. This included delineating the date by a year, month, date and combining the title and text columns. This step was necessary as 90% of the data did not include text in the body (this was particularly true for r/COVID19). Finally, I filtered out dates before 2019 and after 2021. These dates provide the base year of 2019 and the two years under review, 2020 -2021. This step culminated in 3481 observations as the final data set.

Structure of Data Set

Character Variables: all_text (combines the original title and text of each thread separated by a ’_’), subreddit
Numeric variables: year, month, day, time stamp, comments (number of comments associated with each thread)

See the below data sample organized by time stamp from earliest to most recent.

# A tibble: 10 x 7
    year month day    timestamp all_text            subreddit comments
   <dbl> <chr> <chr>      <dbl> <chr>               <chr>        <dbl>
 1  2019 01    01    1546341461 "Alhamdulillah, I ~ ISLAM          151
 2  2019 01    01    1546375892 "Lettered Bible ve~ CHRISTIA~       97
 3  2019 01    04    1546642199 "Unexpected conseq~ JUDAISM         51
 4  2019 01    06    1546798310 "This morning\u001~ CHRISTIA~       78
 5  2019 01    06    1546803241 "A WTF tweet from ~ JUDAISM        127
 6  2019 01    07    1546832194 "Happy 160th birth~ JUDAISM         25
 7  2019 01    08    1546949876 "Ukraine: Bar Mitz~ JUDAISM         19
 8  2019 01    09    1547042752 "... it never occu~ JUDAISM        102
 9  2019 01    11    1547249236 "Jesus expels the ~ CHRISTIA~      426
10  2019 01    13    1547392685 "Love torah memes_" JUDAISM         15

The next set of tables and plots shows the distribution of threads per subreddit group over 2019 – 2021.

Frequency of subreddit threads by year:

      subreddit
year   CHRISTIANITY COVID19 ISLAM JUDAISM
  2019          186       0    68     194
  2020          351     648   377     383
  2021          254     298   421     301

The total threads per subreddit are 3481(overall), 791 (r/CHRISTIANITY), 946 (r/COVID19), 866(r/ISALM), 878(r/JUDAISM).

Portion of total observations by subreddit per year (in percentages):

      subreddit
year   CHRISTIANITY   COVID19     ISLAM   JUDAISM
  2019     5.343292  0.000000  1.953462  5.573111
  2020    10.083309 18.615340 10.830221 11.002585
  2021     7.296754  8.560758 12.094226  8.646941

Cumulative proportions of total threads for each subreddit group are 22.72% (r/CHRISTIANITY), 27.18% (r/COVID19), 24.88% (r/ISALM), 25.22% (r/JUDAISM).

View Plot of Total Threads by Subreddit and Year Here

Overall, I found these results acceptable to represent each subreddit and not overtly skewed for a particular group. Note that the r/COVID19 subreddit was formed in February 2020 and thus does not have information for 2019. Also, the r/Christianity subreddit has slightly fewer observations than r/Islam and r/Judaism, as the results for all-time top threads included more observations in earlier years. Nonetheless, this distribution appears appropriate to move forward.

Literary Review

Context
Researchers from MIT and Harvard released a study in the Journal of Medical Internet Research reviewing to uncover any negative semantic changes for various mental health support groups on Reddit. The study aimed to use natural language processing (NLP) to characterize changes in 15 of the world’s largest mental health support subreddit groups and 11 non-mental health subreddit groups during the first few months of the covid-19 pandemic.

Methods
Utilizing posts from over 820,000 unique users from 2018-to 2020, researchers developed machine learning models to analyze trends in text-derived features such as sentiment analysis and semantic categories. They started with feature extraction for various outcomes (i.e., sentiment analysis, readability, term frequency) and manually built lexicons for other topics of interest (i.e., economic stress, isolation, substance abuse). They employed both supervised machine learning to classify posts and interpret how different mental health problems manifested in language and unsupervised methods such as topic modeling and unsupervised clustering to discover concerns before and during the pandemic.

For the unsupervised clustering, researchers downsized the sample of posts/subreddit and term frequency n-grams to employ the SpectralClustering function, identifying 20 clusters. They then used the Wilcoxon rank sum test to determine cluster characteristics and annotate each theme, verifying the cluster annotations by reviewing the posts. This process was undertaken for both pre-pandemic and mid-pandemic posts and then compared.

For topic modeling, researchers used Latent Dirichlet Allocation (LDA) modeling from the gensium library to determine which set of words frequently appeared together. They once again narrowed posts/subreddit for a pre- and mid-pandemic sample. A final model of 10 pre-pandemic topics and a separate model of 10 mid-pandemic topics were selected. Models were then applied to all subreddits to record distribution in each period. Finally, using a two-sided Wilcoxon signed rank text with the Benjamini- Hochberg procedure, topic distribution was compared to test change across subreddits.

For supervised learning, researchers employed Uniform Manifold Approximation and Projection (UMAP) to capture nonlinear structures in the data. They then measured the asymmetric Hausdorff distance between subreddits as time progressed to estimate which subreddits were becoming more similar. Once again, the sample was reduced and grouped by 15 days and repeated for 50 samples to gather the median distance value.

Results
Researchers were able to identify trends in covid-related posts across each mental health support group compared to the number of confirmed cases and major covid-related events as markers. They saw which features (i.e., economic stress and isolation) increased or decreased during the first few months of the pandemic and charted the semantic change of the various mental health subreddits. Unsupervised clustering also revealed which topics emerged, grew, and became more associated from pre- to mid-pandemic.

Conclusion
Ultimately, researchers uncovered patterns in how specific mental health problems manifest in language, identifying users potentially at risk and showing the distribution of these concerns across Reddit. Text analysis was particularly useful in real-time discovery, identifying vulnerable groups, and alarming themes which could be utilized during other world-changing events.

Take Away
Overall, I found the methods appropriate and exhaustive to answer the research question. However, I was curious about sub-sampling methods since this took place at almost every step with varying pool sizes. I would have liked more information on how the sub-samples were determined. However, this sub-sampling method was encouraging as I had concerns about the size of my data set, but this study routinely used smaller samples of under 3000 posts/ subreddit. Nonetheless, I will employ some of these methods in my own research.

Next Steps - Preprocessing

The next step in preprocessing is assembling my corpus and tokenizing and annotating my data set. I will use the following natural language processing (NLP) techniques:
• N-grams – likely bi-and tri-grams to uncover any phrases of relevance;
• Term Frequency – what are the most common terms or phrases;
• Name Entity Recognition (NER) to reveal the people, organizations, or locations of interest; and
• Parts of Speech Tagging to document the actions, descriptions, and objects of interest. This could later inform sentiment about a particular topic.

I will use functions from the cleanNLP, quanteda, and quanteda.textstats packages.

Sources

Low DM, Rumker L, Talkar T, Torous J, Cecchi G, Ghosh SS. Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of Medical Internet Research. 2020;22(10):e22635. doi:10.2196/22635 [https://www.jmir.org/2020/10/e22635]