Introduction

A few months ago I noticed that one subreddit I frequent, r/science, had a growing proportion of pop-psychology posts getting many upvotes. Reddit is a social media site, and scientific journals are notoriously hard to access, so I expected a gap between the types of science content they feature. Still, I decided to test my gut feeling that the subreddit’s content was changing.

In this project I analyzed the top posts of r/science, and investigated which topics interest r/science users the most. I also analyzed post titles through sentiment analysis and word clouds, since titles are what users first see when engaging with content.

Scraping Reddit with PRAW

I used a Python script to scrape r/science top 100 posts over three time periods: the past month, the past year, and all-time. Within these time periods I obtained data for 9 different variables:

...1 ranks the post from 0-99 (0 being the highest rank).
id is a unique string of letters and numbers that identify each post.
created_unix_utc is the time the post was created (as a unix number).
post_url is the url for the website the post links to. This can be from an online news outlet, journal, etc.
post_title is the title of the post.
flair is a tag attached to each post by a moderator that places that post in a category. It represents the post’s topic.
score is the number of times a post was upvoted (each user can upvote a post once).
num_comments is the number of unique comments on a post.
upvote_ratio is the ratio of upvotes (indicating user approval) to downvotes (indicating user disapproval).

The script won’t run here; to run this code yourself, you’ll need to follow the instructions in this guide using your own credentials.

# import packages
import praw
import pandas

# read-only instance
reddit_read_only = praw.Reddit(
    client_id="", #your info here
    client_secret="", #your info here
    user_agent="", #your info here
)

# extract subreddit information

subreddit = reddit_read_only.subreddit("science")

# display subreddit name

print("Display Name:", subreddit.display_name)

# display subreddit title
print("Title:", subreddit.title)

# display subreddit description
print("Description:", subreddit.description)

# get top posts this from time period
# all = all time
# year = past year
# month = past month
posts = subreddit.top("all")

posts_dict = { 
    "id": [],
    "created_unix_utc": [],
    "post_url": [],
    "post_title": [],
    "flair": [],
    "score": [],
    "num_comments": [],
    "upvote_ratio": []
}

for post in posts:
    posts_dict["id"].append(post.id)
    posts_dict["created_unix_utc"].append(post.created_utc)
    posts_dict["post_url"].append(post.url)
    posts_dict["post_title"].append(post.title)
    posts_dict["flair"].append(post.link_flair_text)
    posts_dict["score"].append(post.score)
    posts_dict["num_comments"].append(post.num_comments)
    posts_dict["upvote_ratio"].append(post.upvote_ratio)

# change this when scraping different time periods
top_posts_all= pandas.DataFrame(posts_dict)
top_posts_all

top_posts_all.to_csv("Top-Posts-All.csv")
# repeat for past year, past month

Data Cleaning

I conducted all cleaning, analysis, and visualization in R.

library(tidyverse) # data processing and analysis

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate) # wrangle dates

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(skimr) # skim data frames
library(urltools) # wrangle urls
library(tidytext) # NLP toolkit
library(textdata) # Sentiment analysis
library(wordcloud2) # Word clouds
library(gghighlight) # Adding highlight to graphs
library(wesanderson) # graph colors

Get the Raw Data

All time

# create data frame with top 100 posts from all time
top_all <- read_csv("Top-Posts-All.csv")

## New names:
## Rows: 100 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): id, post_url, post_title, flair dbl (5): ...1, created_unix_utc, score,
## num_comments, upvote_ratio
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

summary(top_all)

##       ...1            id            created_unix_utc      post_url        
##  Min.   : 0.00   Length:100         Min.   :1.438e+09   Length:100        
##  1st Qu.:24.75   Class :character   1st Qu.:1.555e+09   Class :character  
##  Median :49.50   Mode  :character   Median :1.590e+09   Mode  :character  
##  Mean   :49.50                      Mean   :1.581e+09                     
##  3rd Qu.:74.25                      3rd Qu.:1.610e+09                     
##  Max.   :99.00                      Max.   :1.668e+09                     
##   post_title           flair               score         num_comments  
##  Length:100         Length:100         Min.   : 74211   Min.   : 1113  
##  Class :character   Class :character   1st Qu.: 77535   1st Qu.: 1950  
##  Mode  :character   Mode  :character   Median : 83020   Median : 2882  
##                                        Mean   : 88449   Mean   : 3358  
##                                        3rd Qu.: 91584   3rd Qu.: 4048  
##                                        Max.   :199299   Max.   :10930  
##   upvote_ratio   
##  Min.   :0.7100  
##  1st Qu.:0.8400  
##  Median :0.9000  
##  Mean   :0.8819  
##  3rd Qu.:0.9225  
##  Max.   :0.9600

Past year

# create data frame with top 100 posts from last year
top_year <- read_csv("Top-Posts-Year.csv")

## New names:
## Rows: 100 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): id, post_url, post_title, flair dbl (5): ...1, created_unix_utc, score,
## num_comments, upvote_ratio
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

summary(top_all)

##       ...1            id            created_unix_utc      post_url        
##  Min.   : 0.00   Length:100         Min.   :1.438e+09   Length:100        
##  1st Qu.:24.75   Class :character   1st Qu.:1.555e+09   Class :character  
##  Median :49.50   Mode  :character   Median :1.590e+09   Mode  :character  
##  Mean   :49.50                      Mean   :1.581e+09                     
##  3rd Qu.:74.25                      3rd Qu.:1.610e+09                     
##  Max.   :99.00                      Max.   :1.668e+09                     
##   post_title           flair               score         num_comments  
##  Length:100         Length:100         Min.   : 74211   Min.   : 1113  
##  Class :character   Class :character   1st Qu.: 77535   1st Qu.: 1950  
##  Mode  :character   Mode  :character   Median : 83020   Median : 2882  
##                                        Mean   : 88449   Mean   : 3358  
##                                        3rd Qu.: 91584   3rd Qu.: 4048  
##                                        Max.   :199299   Max.   :10930  
##   upvote_ratio   
##  Min.   :0.7100  
##  1st Qu.:0.8400  
##  Median :0.9000  
##  Mean   :0.8819  
##  3rd Qu.:0.9225  
##  Max.   :0.9600

Past month

# create data frame with top 100 posts from last year
top_month <- read_csv("Top-Posts-Month.csv")

## New names:
## Rows: 100 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): id, post_url, post_title, flair dbl (5): ...1, created_unix_utc, score,
## num_comments, upvote_ratio
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

summary(top_month)

##       ...1            id            created_unix_utc      post_url        
##  Min.   : 0.00   Length:100         Min.   :1.668e+09   Length:100        
##  1st Qu.:24.75   Class :character   1st Qu.:1.669e+09   Class :character  
##  Median :49.50   Mode  :character   Median :1.670e+09   Mode  :character  
##  Mean   :49.50                      Mean   :1.670e+09                     
##  3rd Qu.:74.25                      3rd Qu.:1.670e+09                     
##  Max.   :99.00                      Max.   :1.671e+09                     
##   post_title           flair               score        num_comments   
##  Length:100         Length:100         Min.   : 3818   Min.   :  39.0  
##  Class :character   Class :character   1st Qu.: 6746   1st Qu.: 367.2  
##  Mode  :character   Mode  :character   Median :12932   Median : 715.0  
##                                        Mean   :18299   Mean   :1054.4  
##                                        3rd Qu.:25152   3rd Qu.:1342.0  
##                                        Max.   :75014   Max.   :6759.0  
##   upvote_ratio   
##  Min.   :0.6700  
##  1st Qu.:0.8875  
##  Median :0.9400  
##  Mean   :0.9114  
##  3rd Qu.:0.9600  
##  Max.   :0.9800

Data Transformation

Column 1 represents post rank. I changed the column name and added 1 to all rankings (so that posts would be ranked from 1 to 100).

# change column 1 name from ...1 to rank
colnames(top_all)[1] <- "all_rank"
colnames(top_year)[1] <- "year_rank"
colnames(top_month)[1] <- "month_rank"

# add 1 to all the rankings for clarity
top_all$all_rank <- as.numeric(top_all$all_rank) + 1
top_year$year_rank <- as.numeric(top_year$year_rank) + 1
top_month$month_rank <- as.numeric(top_month$month_rank) + 1

I converted the created_unix field from a number to a datetime, and save this as created_utc. This represents the time each post was made in UTC.

top_all$created_utc <- as_datetime(top_all$created_unix_utc)
top_year$created_utc <- as_datetime(top_year$created_unix_utc)
top_month$created_utc <- as_datetime(top_month$created_unix_utc)

I isolated the domain name from each url, and saved this as post_url. The domain is the source website for each post. Sources include news outlets, science magazines, journals, blogs, etc.

top_all$domain <- domain(top_all$post_url)
top_year$domain <- domain(top_year$post_url)
top_month$domain <- domain(top_month$post_url)

I saved the cleaned and transformed data to fresh data frames. I also created data frames containing just the post titles.

# create clean df - all data
all_clean <- top_all
year_clean <- top_year
month_clean <- top_month

# create text df - titles only
text_month <- all_clean[c("post_title")]
text_year <- year_clean[c("post_title")]
text_all <- month_clean[c("post_title")]

Cleaning Text

I analyzed word frequencies in post titles. To do this I needed to restructure the data frame so that each title word has its own column (each row still represents one post).

# restructure one token per row: unnest tokens
text_all <- text_all %>% 
  unnest_tokens(word, post_title)

text_month <-text_month %>% 
  unnest_tokens(word, post_title)

text_year <- text_year %>% 
  unnest_tokens(word, post_title)

I removed stopwords using the tidytext package’s stopword list. Stopwords are common words like “a” and “the” that are very common and not meaningful. There were a few words generic science words in the data sets that I wanted to filter out, such as “study”, “found”, “suggests”, and “research”. I also filtered out numerals.

# get stop words list
data("stop_words")

num_text_all <- text_all %>% 
  arrange(word) %>% 
  head(75)

num_text_year <- text_year %>% 
  arrange(word) %>% 
  head(89)

num_text_month <- text_month %>% 
  arrange(word) %>% 
  head(91)

stop_science <- c("study", "found", "scientist", "scientists", "research", "researchers", "suggests", "finding")
stop_science <- as.data.frame(stop_science)
colnames(stop_science)[1] <- "word"

Word Frequencies Lists

I created lists of the most frequently used words in titles.

All time

# filter stopwords
# filter stopwords
text_all_clean <- text_all %>% 
  anti_join(stop_words, by = "word")

text_all_clean <- text_all_clean %>% 
  anti_join(num_text_year, by = "word")

text_all_clean <- text_all_clean %>% 
  anti_join(stop_science, by = "word")

# show 10 most frequent words
text_all_clean %>% 
  count(word, sort = TRUE)%>% 
  head(10)

## # A tibble: 10 × 2
##    word        n
##    <chr>   <int>
##  1 covid      16
##  2 people     15
##  3 sexual     10
##  4 risk        8
##  5 cancer      7
##  6 women       7
##  7 health      6
##  8 vaccine     6
##  9 white       6
## 10 brain       5

Past year

# filter stopwords
text_year_clean <- text_year %>% 
  anti_join(stop_words, by = "word")

text_year_clean <- text_year_clean %>% 
  anti_join(num_text_year, by = "word")

text_year_clean <- text_year_clean %>% 
  anti_join(stop_science, by = "word")

# show 10 most frequent words
text_year_clean %>% 
  count(word, sort = TRUE) %>% 
  head(10)

## # A tibble: 10 × 2
##    word          n
##    <chr>     <int>
##  1 people       11
##  2 black        10
##  3 covid        10
##  4 americans     9
##  5 women         8
##  6 increased     7
##  7 lungs         7
##  8 rates         7
##  9 adults        6
## 10 billion       6

Past month

# filter stopwords
text_month_clean <- text_month %>% 
  anti_join(stop_words, by = "word")

text_month_clean <- text_month_clean %>% 
  anti_join(num_text_year, by = "word")

text_month_clean <- text_month_clean %>% 
  anti_join(stop_science, by = "word")

# show 10 most frequent words
text_month_clean %>% 
  count(word, sort = TRUE) %>% 
  head(10)

## # A tibble: 10 × 2
##    word               n
##    <chr>          <int>
##  1 people            12
##  2 human             10
##  3 children           8
##  4 covid              8
##  5 sex                8
##  6 students           8
##  7 time               8
##  8 cannabis           7
##  9 life               7
## 10 administration     6

Data Exploration

Get the Clean Data

I created individual data frames for the top 100 posts of all time, the last year, and the last month. I also created data frames containing just the post titles for each time period. I did not combine these data frames because they each represent a different timescale.

All time

The oldest post on the all time list is from 2015, and the median year for posts is from 2020. The subreddit has been growing over time.

# Whole dataset
all_clean <- read.csv("~/Documents/Projects/reddit-science/all_clean.csv")

# Post titles only
text_all_clean <- read.csv("~/Documents/Projects/reddit-science/text_all_clean.csv")

summary(top_all)

##     all_rank           id            created_unix_utc      post_url        
##  Min.   :  1.00   Length:100         Min.   :1.438e+09   Length:100        
##  1st Qu.: 25.75   Class :character   1st Qu.:1.555e+09   Class :character  
##  Median : 50.50   Mode  :character   Median :1.590e+09   Mode  :character  
##  Mean   : 50.50                      Mean   :1.581e+09                     
##  3rd Qu.: 75.25                      3rd Qu.:1.610e+09                     
##  Max.   :100.00                      Max.   :1.668e+09                     
##   post_title           flair               score         num_comments  
##  Length:100         Length:100         Min.   : 74211   Min.   : 1113  
##  Class :character   Class :character   1st Qu.: 77535   1st Qu.: 1950  
##  Mode  :character   Mode  :character   Median : 83020   Median : 2882  
##                                        Mean   : 88449   Mean   : 3358  
##                                        3rd Qu.: 91584   3rd Qu.: 4048  
##                                        Max.   :199299   Max.   :10930  
##   upvote_ratio     created_utc                        domain         
##  Min.   :0.7100   Min.   :2015-07-27 11:42:28.00   Length:100        
##  1st Qu.:0.8400   1st Qu.:2019-04-15 00:35:53.50   Class :character  
##  Median :0.9000   Median :2020-05-23 04:22:58.00   Mode  :character  
##  Mean   :0.8819   Mean   :2020-02-06 22:46:05.92                     
##  3rd Qu.:0.9225   3rd Qu.:2021-01-06 09:58:01.25                     
##  Max.   :0.9600   Max.   :2022-11-12 11:52:32.00

Past year

# Whole dataset
text_year_clean <- read.csv("~/Documents/Projects/reddit-science/text_year_clean.csv")

# Post titles only
year_clean <- read.csv("~/Documents/Projects/reddit-science/year_clean.csv")

summary(top_year)

##    year_rank           id            created_unix_utc      post_url        
##  Min.   :  1.00   Length:100         Min.   :1.639e+09   Length:100        
##  1st Qu.: 25.75   Class :character   1st Qu.:1.647e+09   Class :character  
##  Median : 50.50   Mode  :character   Median :1.654e+09   Mode  :character  
##  Mean   : 50.50                      Mean   :1.654e+09                     
##  3rd Qu.: 75.25                      3rd Qu.:1.663e+09                     
##  Max.   :100.00                      Max.   :1.671e+09                     
##   post_title           flair               score         num_comments 
##  Length:100         Length:100         Min.   : 44534   Min.   : 659  
##  Class :character   Class :character   1st Qu.: 49362   1st Qu.:1680  
##  Mode  :character   Mode  :character   Median : 54370   Median :2280  
##                                        Mean   : 56867   Mean   :2857  
##                                        3rd Qu.: 60662   3rd Qu.:3239  
##                                        Max.   :123352   Max.   :9765  
##   upvote_ratio     created_utc                        domain         
##  Min.   :0.6100   Min.   :2021-12-12 11:27:10.00   Length:100        
##  1st Qu.:0.8000   1st Qu.:2022-03-07 18:18:16.25   Class :character  
##  Median :0.8700   Median :2022-05-29 08:36:12.00   Mode  :character  
##  Mean   :0.8528   Mean   :2022-06-05 13:56:23.91                     
##  3rd Qu.:0.9200   3rd Qu.:2022-09-15 20:45:52.25                     
##  Max.   :0.9500   Max.   :2022-12-11 18:14:55.00

Past month

# Whole dataset
month_clean <- read.csv("~/Documents/Projects/reddit-science/month_clean.csv")

# Post titles only
text_month_clean <- read.csv("~/Documents/Projects/reddit-science/text_month_clean.csv")

summary(top_month)

##    month_rank          id            created_unix_utc      post_url        
##  Min.   :  1.00   Length:100         Min.   :1.668e+09   Length:100        
##  1st Qu.: 25.75   Class :character   1st Qu.:1.669e+09   Class :character  
##  Median : 50.50   Mode  :character   Median :1.670e+09   Mode  :character  
##  Mean   : 50.50                      Mean   :1.670e+09                     
##  3rd Qu.: 75.25                      3rd Qu.:1.670e+09                     
##  Max.   :100.00                      Max.   :1.671e+09                     
##   post_title           flair               score        num_comments   
##  Length:100         Length:100         Min.   : 3818   Min.   :  39.0  
##  Class :character   Class :character   1st Qu.: 6746   1st Qu.: 367.2  
##  Mode  :character   Mode  :character   Median :12932   Median : 715.0  
##                                        Mean   :18299   Mean   :1054.4  
##                                        3rd Qu.:25152   3rd Qu.:1342.0  
##                                        Max.   :75014   Max.   :6759.0  
##   upvote_ratio     created_utc                        domain         
##  Min.   :0.6700   Min.   :2022-11-12 11:52:32.00   Length:100        
##  1st Qu.:0.8875   1st Qu.:2022-11-20 10:02:02.00   Class :character  
##  Median :0.9400   Median :2022-11-28 00:23:33.00   Mode  :character  
##  Mean   :0.9114   Mean   :2022-11-27 14:23:23.82                     
##  3rd Qu.:0.9600   3rd Qu.:2022-12-04 23:15:25.25                     
##  Max.   :0.9800   Max.   :2022-12-11 18:14:55.00

Understanding Post Scores

Reddit uses a complex sorting algorithm to rank posts. I confirmed that post rank roughly scaled with post score – the higher a post is ranked, the more people upvoted it, representing engagement with and interest in the post. A notable outlier is present in the past-month graph.

All time

all_clean %>% 
  ggplot(aes(x=all_rank, y = score)) +
  geom_point() +
  geom_smooth() +
    labs(
    title = "Relationship between post rank and score",
    subtitle = "Top 100 posts of all time",
    x = "Rank",
    y = "Score"
  ) +
  theme_minimal()

Past year

year_clean %>% 
  ggplot(aes(x=year_rank, y = score)) +
  geom_point() +
  geom_smooth() +
    labs(
    title = "Relationship between post rank and score",
    subtitle = "Top 100 posts last year (2022)",
    x = "Rank",
    y = "Score"
  ) +
  theme_minimal()

Past month

The no.6 ranked post has score 71142, which is the second highest score for this time period. The no.6 post also has an upvote ratio of 0.78, which is lower than the mean upvote ratio 0.91. Notably, the no.2 ranked post also has a high score and low upvote ratio (64060, 0.75). These posts are about controversial Social Science topics.

month_clean %>% 
  ggplot(aes(x=month_rank, y = score)) +
  geom_point() +
  geom_smooth() +
    labs(
    title = "Relationship between post rank and score",
    subtitle = "Top 100 posts last month (December 2022)",
    x = "Rank",
    y = "Score"
  ) +
  theme_minimal()

month_clean %>% 
  filter(score == 71142 | score == 64060) %>% 
  select(post_title, month_rank)

##                                                                                                                                                                                                                         post_title
## 1 Greta Thunberg effect evident among Norwegian youth. Norwegian youth from all over the country and across social affiliations cite teen activist Greta Thunberg as a role model and source of inspiration for climate engagement
## 2                                                                                                                When women do more household labor, they see their partner as a dependent and sexual desire dwindles, study finds
##   month_rank
## 1          2
## 2          6

Analysis and Visualizations

Finding 1: Relationship Between Post Topic and Score

Three topics dominate the top-post lists: Psychology, Health, and Social Science.

All time

Psychology, Health, and Social Science make up 51% of the top posts of all time

# all-time
by_flair_all <- top_all %>% 
  group_by(flair) %>% 
  summarize(count_id=n_distinct(id)) %>% 
  arrange(desc(count_id)) %>% 
  ggplot(aes(x = count_id, y=reorder(flair, count_id), fill=flair)) +
  geom_col(show.legend=FALSE) +
  gghighlight(count_id > 10) +
  labs(
    title = "Top r/science posts by topic",
    subtitle = "Top 100 posts of all time",
    x = "Number of Posts",
    y = "Topic"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=3)) 

by_flair_all

Past year

Psychology, Health, and Social Science make up 56% of the top posts of the past year.

# past year
by_flair_year <- top_year %>% 
  group_by(flair) %>% 
  summarize(count_id=n_distinct(id)) %>% 
  arrange(desc(count_id)) %>% 
  ggplot(aes(x = count_id, y=reorder(flair, count_id), fill=flair)) +
  geom_col(show.legend=FALSE) +
  gghighlight(count_id > 10) +
  labs(
    title = "Top r/science posts by topic",
    subtitle = "Top 100 posts last year (2022)",
    x = "Number of Posts",
    y = "Topic"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=3)) 
by_flair_year

Past month

Psychology, Health, and Social Science make up 59% of the top posts of the past month

# past month
by_flair_month <- top_month %>% 
  group_by(flair) %>% 
  summarize(count_id=n_distinct(id)) %>% 
  arrange(desc(count_id)) %>% 
  ggplot(aes(x = count_id, y=reorder(flair, count_id), fill=flair)) +
  geom_col(show.legend=FALSE) +
  gghighlight(count_id > 10) +
  labs(
    title = "Top r/science posts by topic",
    subtitle = "Top 100 posts last month (December 2022)",
    x = "Number of Posts",
    y = "Topic"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=3)) 

by_flair_month

Finding 2: Relationship Between Post Source and Score

R/science posts come from 60+ source websites (identified by domain name, e.g. academictimes.com).There has been an significant increase in the popularity of posts from psypost.com in the past year and past month. The all-time sources are more heterogeneous than the past-year and past-month sources.

All time

# all-time
by_domain_all <- top_all %>% 
  group_by(domain) %>% 
  summarize(count_id=n_distinct(id)) %>% 
  arrange(desc(count_id)) %>% 
  head(10) %>% 
  ggplot(aes(x = count_id, y=reorder(domain, count_id), fill=domain)) +
  geom_col(show.legend=FALSE) +
  gghighlight(count_id > 10) +
  labs(
    title = "Most common sources for r/science posts",
    subtitle = "Top 100 posts of all time",
    x = "Number of Posts",
    y = "Source"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=3)) 

by_domain_all

Past year

# past year
by_domain_year <- top_year %>% 
  group_by(domain) %>% 
  summarize(count_id=n_distinct(id)) %>% 
  arrange(desc(count_id)) %>% 
  head(10) %>% 
  ggplot(aes(x = count_id, y=reorder(domain, count_id), fill=domain)) +
  geom_col(show.legend=FALSE) +
  gghighlight(count_id > 10 ) +
  labs(
    title = "Most common sources for r/science posts",
    subtitle = "Top 100 posts last year (2022)",
    x = "Number of Posts",
    y = "Source"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=3)) 

by_domain_year

Past month

# past month
by_domain_month <- top_month %>% 
  group_by(domain) %>% 
  summarize(count_id=n_distinct(id)) %>% 
  arrange(desc(count_id)) %>% 
  head(10) %>% 
  ggplot(aes(x = count_id, y=reorder(domain, count_id), fill=domain)) +
  geom_col(show.legend=FALSE) +
  gghighlight(count_id > 10) +
  labs(
    title = "Most common sources for r/science posts",
    subtitle = "Top 100 posts last month (December 2022)",
    x = "Number of Posts",
    y = "Source"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=3)) 

by_domain_month

Finding 3: Common Words in Post Titles

I created word clouds to show the most frequent words in the titles of the top posts.

All time

all_words <- text_all_clean %>% 
  count(word, sort=TRUE)
wordcloud2(all_words)

Past year

year_words <- text_year_clean %>% count(word, sort=TRUE)
wordcloud2(year_words)

Past month

month_words <- text_month_clean %>% count(word, sort=TRUE)
wordcloud2(month_words)

Finding 4: Sentiment Analysis of Titles

I used the NRC Word-Emotion Association Lexicon (C) to conduct sentiment analysis on post titles. Posts contain more positive words than negative words. Other common emotions in post titles are “trust”, “anticipation”, and “fear”. Sentiment analysis showed similar results for all 3 time periods.

All time

# sentiment analysis

get_sentiments("nrc")

all_time_sentiment <- text_all_clean %>% 
  inner_join(get_sentiments("nrc"), by = "word")

sentiment_all_plot <- all_time_sentiment %>% 
  group_by(sentiment) %>% 
  summarize(num_words = n()) %>% 
  arrange(desc(num_words)) %>% 
  ggplot(aes(x = num_words, y=reorder(sentiment, num_words), fill=sentiment)) +
  geom_col(show.legend=FALSE) +
  gghighlight(num_words > 100) +
  labs(
    title = "Sentiment analysis of post titles",
    subtitle = "Top 100 posts of all time",
    x = "Number of words",
    y = "Sentiment"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=4)) 


sentiment_all_plot

Past year

# sentiment analysis

get_sentiments("nrc")

year_sentiment <- text_year_clean %>% 
  inner_join(get_sentiments("nrc"), by = "word")

sentiment_year_plot <- year_sentiment %>% 
  group_by(sentiment) %>% 
  summarize(num_words = n()) %>% 
  arrange(desc(num_words)) %>% 
  ggplot(aes(x = num_words, y=reorder(sentiment, num_words), fill=sentiment)) +
  geom_col(show.legend=FALSE) +
  gghighlight(num_words > 100) +
  labs(
    title = "Sentiment analysis of post titles",
    subtitle = "Top 100 posts of all time",
    x = "Number of words",
    y = "Sentiment"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=4)) 


sentiment_year_plot

Past month

# sentiment analysis

get_sentiments("nrc")
month_sentiment <- text_month_clean %>% 
  inner_join(get_sentiments("nrc"), by = "word")

sentiment_month_plot <- month_sentiment %>% 
  group_by(sentiment) %>% 
  summarize(num_words = n()) %>% 
  arrange(desc(num_words)) %>% 
  ggplot(aes(x = num_words, y=reorder(sentiment, num_words), fill=sentiment)) +
  geom_col(show.legend=FALSE) +
  gghighlight(num_words > 100) +
  labs(
    title = "Sentiment analysis of post titles",
    subtitle = "Top 100 posts last month (Dec 2022)",
    x = "Number of words",
    y = "Sentiment"
  ) +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("GrandBudapest1",n=4)) 


sentiment_month_plot

Finding 5: Comparing Title Word Frequency to Another Website

The words in r/science top posts titles are very weakly correlated with the words from the homepages of a popular science source, Frontiers. I chose Frontiers because it is an open-source journal with a webpage that posts article summaries and science news, much like r/science does. There are some similarities between this word cloud and the all-time word cloud, but some of the more frequently-used words in the Reddit word cloud are missing from the Frontiers word cloud (e.g. “sex”, “life”, and “U.S.”.

# cleaned same as reddit titles - data_analysis.R

frontiers_clean <- read_csv("~/Documents/Projects/reddit-science/frontiers_clean.csv", show_col_types = FALSE)

# word cloud
frontiers_words <- frontiers_clean %>% count(word, sort=TRUE)
wordcloud2(frontiers_words, size = 1.6)

Conclusion

Scientific discourse on a social media forum like r/science is quite different from that of scientific journals. This is understandable, given that journals are difficult to access for most people. Social media algorithms also elevate content that is controversial in nature. Websites that summarize science news get more engagement on r/science. Recently, a trend toward pop-sci and pop-psych content has emerged, and a disproportionate number of posts come from psypost.org.

Citations

This report makes use of the NRC Word-Emotion Association Lexicon (C), created by Dr. Saif M. Mohammad and Dr. Peter Turney(s) at the National Research Council Canada.”

Lexicon homepage

Contact email)

Analyzing post topics and titles on reddit.com/r/science

Laura Fontanills

30 January 2023

Introduction

Scraping Reddit with PRAW

Data Cleaning

Get the Raw Data

All time

Past year

Past month

Data Transformation

Cleaning Text

Word Frequencies Lists

All time

Past year

Past month

Data Exploration

Get the Clean Data

All time

Past year

Past month

Understanding Post Scores

All time

Past year

Past month

Analysis and Visualizations

Finding 1: Relationship Between Post Topic and Score

All time

Past year

Past month

Finding 2: Relationship Between Post Source and Score

All time

Past year

Past month

Finding 3: Common Words in Post Titles

All time

Past year

Past month

Finding 4: Sentiment Analysis of Titles

All time

Past year

Past month

Finding 5: Comparing Title Word Frequency to Another Website

Conclusion

Citations