A few months ago I noticed that one subreddit I frequent, r/science, had a growing proportion of pop-psychology posts getting many upvotes. Reddit is a social media site, and scientific journals are notoriously hard to access, so I expected a gap between the types of science content they feature. Still, I decided to test my gut feeling that the subreddit’s content was changing.
In this project I analyzed the top posts of r/science, and investigated which topics interest r/science users the most. I also analyzed post titles through sentiment analysis and word clouds, since titles are what users first see when engaging with content.
I used a Python script to scrape r/science top 100 posts over three time periods: the past month, the past year, and all-time. Within these time periods I obtained data for 9 different variables:
...1 ranks the post from 0-99 (0 being the highest
rank).id is a unique string of letters and numbers that
identify each post.created_unix_utc is the time the post was created (as a
unix number).post_url is the url for the website the post links to.
This can be from an online news outlet, journal, etc.post_title is the title of the post.flair is a tag attached to each post by a moderator
that places that post in a category. It represents the post’s
topic.score is the number of times a post was upvoted (each
user can upvote a post once).num_comments is the number of unique comments on a
post.upvote_ratio is the ratio of upvotes (indicating user
approval) to downvotes (indicating user disapproval).The script won’t run here; to run this code yourself, you’ll need to follow the instructions in this guide using your own credentials.
# import packages
import praw
import pandas
# read-only instance
reddit_read_only = praw.Reddit(
client_id="", #your info here
client_secret="", #your info here
user_agent="", #your info here
)
# extract subreddit information
subreddit = reddit_read_only.subreddit("science")
# display subreddit name
print("Display Name:", subreddit.display_name)
# display subreddit title
print("Title:", subreddit.title)
# display subreddit description
print("Description:", subreddit.description)
# get top posts this from time period
# all = all time
# year = past year
# month = past month
posts = subreddit.top("all")
posts_dict = {
"id": [],
"created_unix_utc": [],
"post_url": [],
"post_title": [],
"flair": [],
"score": [],
"num_comments": [],
"upvote_ratio": []
}
for post in posts:
posts_dict["id"].append(post.id)
posts_dict["created_unix_utc"].append(post.created_utc)
posts_dict["post_url"].append(post.url)
posts_dict["post_title"].append(post.title)
posts_dict["flair"].append(post.link_flair_text)
posts_dict["score"].append(post.score)
posts_dict["num_comments"].append(post.num_comments)
posts_dict["upvote_ratio"].append(post.upvote_ratio)
# change this when scraping different time periods
top_posts_all= pandas.DataFrame(posts_dict)
top_posts_all
top_posts_all.to_csv("Top-Posts-All.csv")
# repeat for past year, past month
I conducted all cleaning, analysis, and visualization in R.
library(tidyverse) # data processing and analysis
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate) # wrangle dates
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(skimr) # skim data frames
library(urltools) # wrangle urls
library(tidytext) # NLP toolkit
library(textdata) # Sentiment analysis
library(wordcloud2) # Word clouds
library(gghighlight) # Adding highlight to graphs
library(wesanderson) # graph colors
# create data frame with top 100 posts from all time
top_all <- read_csv("Top-Posts-All.csv")
## New names:
## Rows: 100 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): id, post_url, post_title, flair dbl (5): ...1, created_unix_utc, score,
## num_comments, upvote_ratio
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
summary(top_all)
## ...1 id created_unix_utc post_url
## Min. : 0.00 Length:100 Min. :1.438e+09 Length:100
## 1st Qu.:24.75 Class :character 1st Qu.:1.555e+09 Class :character
## Median :49.50 Mode :character Median :1.590e+09 Mode :character
## Mean :49.50 Mean :1.581e+09
## 3rd Qu.:74.25 3rd Qu.:1.610e+09
## Max. :99.00 Max. :1.668e+09
## post_title flair score num_comments
## Length:100 Length:100 Min. : 74211 Min. : 1113
## Class :character Class :character 1st Qu.: 77535 1st Qu.: 1950
## Mode :character Mode :character Median : 83020 Median : 2882
## Mean : 88449 Mean : 3358
## 3rd Qu.: 91584 3rd Qu.: 4048
## Max. :199299 Max. :10930
## upvote_ratio
## Min. :0.7100
## 1st Qu.:0.8400
## Median :0.9000
## Mean :0.8819
## 3rd Qu.:0.9225
## Max. :0.9600
# create data frame with top 100 posts from last year
top_year <- read_csv("Top-Posts-Year.csv")
## New names:
## Rows: 100 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): id, post_url, post_title, flair dbl (5): ...1, created_unix_utc, score,
## num_comments, upvote_ratio
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
summary(top_all)
## ...1 id created_unix_utc post_url
## Min. : 0.00 Length:100 Min. :1.438e+09 Length:100
## 1st Qu.:24.75 Class :character 1st Qu.:1.555e+09 Class :character
## Median :49.50 Mode :character Median :1.590e+09 Mode :character
## Mean :49.50 Mean :1.581e+09
## 3rd Qu.:74.25 3rd Qu.:1.610e+09
## Max. :99.00 Max. :1.668e+09
## post_title flair score num_comments
## Length:100 Length:100 Min. : 74211 Min. : 1113
## Class :character Class :character 1st Qu.: 77535 1st Qu.: 1950
## Mode :character Mode :character Median : 83020 Median : 2882
## Mean : 88449 Mean : 3358
## 3rd Qu.: 91584 3rd Qu.: 4048
## Max. :199299 Max. :10930
## upvote_ratio
## Min. :0.7100
## 1st Qu.:0.8400
## Median :0.9000
## Mean :0.8819
## 3rd Qu.:0.9225
## Max. :0.9600
# create data frame with top 100 posts from last year
top_month <- read_csv("Top-Posts-Month.csv")
## New names:
## Rows: 100 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): id, post_url, post_title, flair dbl (5): ...1, created_unix_utc, score,
## num_comments, upvote_ratio
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
summary(top_month)
## ...1 id created_unix_utc post_url
## Min. : 0.00 Length:100 Min. :1.668e+09 Length:100
## 1st Qu.:24.75 Class :character 1st Qu.:1.669e+09 Class :character
## Median :49.50 Mode :character Median :1.670e+09 Mode :character
## Mean :49.50 Mean :1.670e+09
## 3rd Qu.:74.25 3rd Qu.:1.670e+09
## Max. :99.00 Max. :1.671e+09
## post_title flair score num_comments
## Length:100 Length:100 Min. : 3818 Min. : 39.0
## Class :character Class :character 1st Qu.: 6746 1st Qu.: 367.2
## Mode :character Mode :character Median :12932 Median : 715.0
## Mean :18299 Mean :1054.4
## 3rd Qu.:25152 3rd Qu.:1342.0
## Max. :75014 Max. :6759.0
## upvote_ratio
## Min. :0.6700
## 1st Qu.:0.8875
## Median :0.9400
## Mean :0.9114
## 3rd Qu.:0.9600
## Max. :0.9800
Column 1 represents post rank. I changed the column name and added 1 to all rankings (so that posts would be ranked from 1 to 100).
# change column 1 name from ...1 to rank
colnames(top_all)[1] <- "all_rank"
colnames(top_year)[1] <- "year_rank"
colnames(top_month)[1] <- "month_rank"
# add 1 to all the rankings for clarity
top_all$all_rank <- as.numeric(top_all$all_rank) + 1
top_year$year_rank <- as.numeric(top_year$year_rank) + 1
top_month$month_rank <- as.numeric(top_month$month_rank) + 1
I converted the created_unix field from a number to a
datetime, and save this as created_utc. This represents the
time each post was made in UTC.
top_all$created_utc <- as_datetime(top_all$created_unix_utc)
top_year$created_utc <- as_datetime(top_year$created_unix_utc)
top_month$created_utc <- as_datetime(top_month$created_unix_utc)
I isolated the domain name from each url, and saved this as
post_url. The domain is the source website for each post.
Sources include news outlets, science magazines, journals, blogs,
etc.
top_all$domain <- domain(top_all$post_url)
top_year$domain <- domain(top_year$post_url)
top_month$domain <- domain(top_month$post_url)
I saved the cleaned and transformed data to fresh data frames. I also created data frames containing just the post titles.
# create clean df - all data
all_clean <- top_all
year_clean <- top_year
month_clean <- top_month
# create text df - titles only
text_month <- all_clean[c("post_title")]
text_year <- year_clean[c("post_title")]
text_all <- month_clean[c("post_title")]
I analyzed word frequencies in post titles. To do this I needed to restructure the data frame so that each title word has its own column (each row still represents one post).
# restructure one token per row: unnest tokens
text_all <- text_all %>%
unnest_tokens(word, post_title)
text_month <-text_month %>%
unnest_tokens(word, post_title)
text_year <- text_year %>%
unnest_tokens(word, post_title)
I removed stopwords using the tidytext package’s stopword list. Stopwords are common words like “a” and “the” that are very common and not meaningful. There were a few words generic science words in the data sets that I wanted to filter out, such as “study”, “found”, “suggests”, and “research”. I also filtered out numerals.
# get stop words list
data("stop_words")
num_text_all <- text_all %>%
arrange(word) %>%
head(75)
num_text_year <- text_year %>%
arrange(word) %>%
head(89)
num_text_month <- text_month %>%
arrange(word) %>%
head(91)
stop_science <- c("study", "found", "scientist", "scientists", "research", "researchers", "suggests", "finding")
stop_science <- as.data.frame(stop_science)
colnames(stop_science)[1] <- "word"
I created lists of the most frequently used words in titles.
# filter stopwords
# filter stopwords
text_all_clean <- text_all %>%
anti_join(stop_words, by = "word")
text_all_clean <- text_all_clean %>%
anti_join(num_text_year, by = "word")
text_all_clean <- text_all_clean %>%
anti_join(stop_science, by = "word")
# show 10 most frequent words
text_all_clean %>%
count(word, sort = TRUE)%>%
head(10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 covid 16
## 2 people 15
## 3 sexual 10
## 4 risk 8
## 5 cancer 7
## 6 women 7
## 7 health 6
## 8 vaccine 6
## 9 white 6
## 10 brain 5
# filter stopwords
text_year_clean <- text_year %>%
anti_join(stop_words, by = "word")
text_year_clean <- text_year_clean %>%
anti_join(num_text_year, by = "word")
text_year_clean <- text_year_clean %>%
anti_join(stop_science, by = "word")
# show 10 most frequent words
text_year_clean %>%
count(word, sort = TRUE) %>%
head(10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 people 11
## 2 black 10
## 3 covid 10
## 4 americans 9
## 5 women 8
## 6 increased 7
## 7 lungs 7
## 8 rates 7
## 9 adults 6
## 10 billion 6
# filter stopwords
text_month_clean <- text_month %>%
anti_join(stop_words, by = "word")
text_month_clean <- text_month_clean %>%
anti_join(num_text_year, by = "word")
text_month_clean <- text_month_clean %>%
anti_join(stop_science, by = "word")
# show 10 most frequent words
text_month_clean %>%
count(word, sort = TRUE) %>%
head(10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 people 12
## 2 human 10
## 3 children 8
## 4 covid 8
## 5 sex 8
## 6 students 8
## 7 time 8
## 8 cannabis 7
## 9 life 7
## 10 administration 6
I created individual data frames for the top 100 posts of all time, the last year, and the last month. I also created data frames containing just the post titles for each time period. I did not combine these data frames because they each represent a different timescale.
The oldest post on the all time list is from 2015, and the median year for posts is from 2020. The subreddit has been growing over time.
# Whole dataset
all_clean <- read.csv("~/Documents/Projects/reddit-science/all_clean.csv")
# Post titles only
text_all_clean <- read.csv("~/Documents/Projects/reddit-science/text_all_clean.csv")
summary(top_all)
## all_rank id created_unix_utc post_url
## Min. : 1.00 Length:100 Min. :1.438e+09 Length:100
## 1st Qu.: 25.75 Class :character 1st Qu.:1.555e+09 Class :character
## Median : 50.50 Mode :character Median :1.590e+09 Mode :character
## Mean : 50.50 Mean :1.581e+09
## 3rd Qu.: 75.25 3rd Qu.:1.610e+09
## Max. :100.00 Max. :1.668e+09
## post_title flair score num_comments
## Length:100 Length:100 Min. : 74211 Min. : 1113
## Class :character Class :character 1st Qu.: 77535 1st Qu.: 1950
## Mode :character Mode :character Median : 83020 Median : 2882
## Mean : 88449 Mean : 3358
## 3rd Qu.: 91584 3rd Qu.: 4048
## Max. :199299 Max. :10930
## upvote_ratio created_utc domain
## Min. :0.7100 Min. :2015-07-27 11:42:28.00 Length:100
## 1st Qu.:0.8400 1st Qu.:2019-04-15 00:35:53.50 Class :character
## Median :0.9000 Median :2020-05-23 04:22:58.00 Mode :character
## Mean :0.8819 Mean :2020-02-06 22:46:05.92
## 3rd Qu.:0.9225 3rd Qu.:2021-01-06 09:58:01.25
## Max. :0.9600 Max. :2022-11-12 11:52:32.00
# Whole dataset
text_year_clean <- read.csv("~/Documents/Projects/reddit-science/text_year_clean.csv")
# Post titles only
year_clean <- read.csv("~/Documents/Projects/reddit-science/year_clean.csv")
summary(top_year)
## year_rank id created_unix_utc post_url
## Min. : 1.00 Length:100 Min. :1.639e+09 Length:100
## 1st Qu.: 25.75 Class :character 1st Qu.:1.647e+09 Class :character
## Median : 50.50 Mode :character Median :1.654e+09 Mode :character
## Mean : 50.50 Mean :1.654e+09
## 3rd Qu.: 75.25 3rd Qu.:1.663e+09
## Max. :100.00 Max. :1.671e+09
## post_title flair score num_comments
## Length:100 Length:100 Min. : 44534 Min. : 659
## Class :character Class :character 1st Qu.: 49362 1st Qu.:1680
## Mode :character Mode :character Median : 54370 Median :2280
## Mean : 56867 Mean :2857
## 3rd Qu.: 60662 3rd Qu.:3239
## Max. :123352 Max. :9765
## upvote_ratio created_utc domain
## Min. :0.6100 Min. :2021-12-12 11:27:10.00 Length:100
## 1st Qu.:0.8000 1st Qu.:2022-03-07 18:18:16.25 Class :character
## Median :0.8700 Median :2022-05-29 08:36:12.00 Mode :character
## Mean :0.8528 Mean :2022-06-05 13:56:23.91
## 3rd Qu.:0.9200 3rd Qu.:2022-09-15 20:45:52.25
## Max. :0.9500 Max. :2022-12-11 18:14:55.00
# Whole dataset
month_clean <- read.csv("~/Documents/Projects/reddit-science/month_clean.csv")
# Post titles only
text_month_clean <- read.csv("~/Documents/Projects/reddit-science/text_month_clean.csv")
summary(top_month)
## month_rank id created_unix_utc post_url
## Min. : 1.00 Length:100 Min. :1.668e+09 Length:100
## 1st Qu.: 25.75 Class :character 1st Qu.:1.669e+09 Class :character
## Median : 50.50 Mode :character Median :1.670e+09 Mode :character
## Mean : 50.50 Mean :1.670e+09
## 3rd Qu.: 75.25 3rd Qu.:1.670e+09
## Max. :100.00 Max. :1.671e+09
## post_title flair score num_comments
## Length:100 Length:100 Min. : 3818 Min. : 39.0
## Class :character Class :character 1st Qu.: 6746 1st Qu.: 367.2
## Mode :character Mode :character Median :12932 Median : 715.0
## Mean :18299 Mean :1054.4
## 3rd Qu.:25152 3rd Qu.:1342.0
## Max. :75014 Max. :6759.0
## upvote_ratio created_utc domain
## Min. :0.6700 Min. :2022-11-12 11:52:32.00 Length:100
## 1st Qu.:0.8875 1st Qu.:2022-11-20 10:02:02.00 Class :character
## Median :0.9400 Median :2022-11-28 00:23:33.00 Mode :character
## Mean :0.9114 Mean :2022-11-27 14:23:23.82
## 3rd Qu.:0.9600 3rd Qu.:2022-12-04 23:15:25.25
## Max. :0.9800 Max. :2022-12-11 18:14:55.00
Reddit uses a complex sorting algorithm to rank posts. I confirmed that post rank roughly scaled with post score – the higher a post is ranked, the more people upvoted it, representing engagement with and interest in the post. A notable outlier is present in the past-month graph.
all_clean %>%
ggplot(aes(x=all_rank, y = score)) +
geom_point() +
geom_smooth() +
labs(
title = "Relationship between post rank and score",
subtitle = "Top 100 posts of all time",
x = "Rank",
y = "Score"
) +
theme_minimal()
year_clean %>%
ggplot(aes(x=year_rank, y = score)) +
geom_point() +
geom_smooth() +
labs(
title = "Relationship between post rank and score",
subtitle = "Top 100 posts last year (2022)",
x = "Rank",
y = "Score"
) +
theme_minimal()
The no.6 ranked post has score 71142, which is the second highest score for this time period. The no.6 post also has an upvote ratio of 0.78, which is lower than the mean upvote ratio 0.91. Notably, the no.2 ranked post also has a high score and low upvote ratio (64060, 0.75). These posts are about controversial Social Science topics.
month_clean %>%
ggplot(aes(x=month_rank, y = score)) +
geom_point() +
geom_smooth() +
labs(
title = "Relationship between post rank and score",
subtitle = "Top 100 posts last month (December 2022)",
x = "Rank",
y = "Score"
) +
theme_minimal()
month_clean %>%
filter(score == 71142 | score == 64060) %>%
select(post_title, month_rank)
## post_title
## 1 Greta Thunberg effect evident among Norwegian youth. Norwegian youth from all over the country and across social affiliations cite teen activist Greta Thunberg as a role model and source of inspiration for climate engagement
## 2 When women do more household labor, they see their partner as a dependent and sexual desire dwindles, study finds
## month_rank
## 1 2
## 2 6
Three topics dominate the top-post lists: Psychology, Health, and Social Science.
Psychology, Health, and Social Science make up 51% of the top posts of all time
# all-time
by_flair_all <- top_all %>%
group_by(flair) %>%
summarize(count_id=n_distinct(id)) %>%
arrange(desc(count_id)) %>%
ggplot(aes(x = count_id, y=reorder(flair, count_id), fill=flair)) +
geom_col(show.legend=FALSE) +
gghighlight(count_id > 10) +
labs(
title = "Top r/science posts by topic",
subtitle = "Top 100 posts of all time",
x = "Number of Posts",
y = "Topic"
) +
theme_minimal() +
scale_fill_manual(values=wes_palette("GrandBudapest1",n=3))
by_flair_all
Psychology, Health, and Social Science make up 56% of the top posts of the past year.
# past year
by_flair_year <- top_year %>%
group_by(flair) %>%
summarize(count_id=n_distinct(id)) %>%
arrange(desc(count_id)) %>%
ggplot(aes(x = count_id, y=reorder(flair, count_id), fill=flair)) +
geom_col(show.legend=FALSE) +
gghighlight(count_id > 10) +
labs(
title = "Top r/science posts by topic",
subtitle = "Top 100 posts last year (2022)",
x = "Number of Posts",
y = "Topic"
) +
theme_minimal() +
scale_fill_manual(values=wes_palette("GrandBudapest1",n=3))
by_flair_year
Psychology, Health, and Social Science make up 59% of the top posts of the past month
# past month
by_flair_month <- top_month %>%
group_by(flair) %>%
summarize(count_id=n_distinct(id)) %>%
arrange(desc(count_id)) %>%
ggplot(aes(x = count_id, y=reorder(flair, count_id), fill=flair)) +
geom_col(show.legend=FALSE) +
gghighlight(count_id > 10) +
labs(
title = "Top r/science posts by topic",
subtitle = "Top 100 posts last month (December 2022)",
x = "Number of Posts",
y = "Topic"
) +
theme_minimal() +
scale_fill_manual(values=wes_palette("GrandBudapest1",n=3))
by_flair_month
R/science posts come from 60+ source websites (identified by domain name, e.g. academictimes.com).There has been an significant increase in the popularity of posts from psypost.com in the past year and past month. The all-time sources are more heterogeneous than the past-year and past-month sources.
# all-time
by_domain_all <- top_all %>%
group_by(domain) %>%
summarize(count_id=n_distinct(id)) %>%
arrange(desc(count_id)) %>%
head(10) %>%
ggplot(aes(x = count_id, y=reorder(domain, count_id), fill=domain)) +
geom_col(show.legend=FALSE) +
gghighlight(count_id > 10) +
labs(
title = "Most common sources for r/science posts",
subtitle = "Top 100 posts of all time",
x = "Number of Posts",
y = "Source"
) +
theme_minimal() +
scale_fill_manual(values=wes_palette("GrandBudapest1",n=3))
by_domain_all
# past year
by_domain_year <- top_year %>%
group_by(domain) %>%
summarize(count_id=n_distinct(id)) %>%
arrange(desc(count_id)) %>%
head(10) %>%
ggplot(aes(x = count_id, y=reorder(domain, count_id), fill=domain)) +
geom_col(show.legend=FALSE) +
gghighlight(count_id > 10 ) +
labs(
title = "Most common sources for r/science posts",
subtitle = "Top 100 posts last year (2022)",
x = "Number of Posts",
y = "Source"
) +
theme_minimal() +
scale_fill_manual(values=wes_palette("GrandBudapest1",n=3))
by_domain_year
# past month
by_domain_month <- top_month %>%
group_by(domain) %>%
summarize(count_id=n_distinct(id)) %>%
arrange(desc(count_id)) %>%
head(10) %>%
ggplot(aes(x = count_id, y=reorder(domain, count_id), fill=domain)) +
geom_col(show.legend=FALSE) +
gghighlight(count_id > 10) +
labs(
title = "Most common sources for r/science posts",
subtitle = "Top 100 posts last month (December 2022)",
x = "Number of Posts",
y = "Source"
) +
theme_minimal() +
scale_fill_manual(values=wes_palette("GrandBudapest1",n=3))
by_domain_month
I created word clouds to show the most frequent words in the titles of the top posts.
all_words <- text_all_clean %>%
count(word, sort=TRUE)
wordcloud2(all_words)
year_words <- text_year_clean %>% count(word, sort=TRUE)
wordcloud2(year_words)
month_words <- text_month_clean %>% count(word, sort=TRUE)
wordcloud2(month_words)
I used the NRC Word-Emotion Association Lexicon (C) to conduct sentiment analysis on post titles. Posts contain more positive words than negative words. Other common emotions in post titles are “trust”, “anticipation”, and “fear”. Sentiment analysis showed similar results for all 3 time periods.
# sentiment analysis
get_sentiments("nrc")
all_time_sentiment <- text_all_clean %>%
inner_join(get_sentiments("nrc"), by = "word")
sentiment_all_plot <- all_time_sentiment %>%
group_by(sentiment) %>%
summarize(num_words = n()) %>%
arrange(desc(num_words)) %>%
ggplot(aes(x = num_words, y=reorder(sentiment, num_words), fill=sentiment)) +
geom_col(show.legend=FALSE) +
gghighlight(num_words > 100) +
labs(
title = "Sentiment analysis of post titles",
subtitle = "Top 100 posts of all time",
x = "Number of words",
y = "Sentiment"
) +
theme_minimal() +
scale_fill_manual(values=wes_palette("GrandBudapest1",n=4))
sentiment_all_plot
The words in r/science top posts titles are very weakly correlated with the words from the homepages of a popular science source, Frontiers. I chose Frontiers because it is an open-source journal with a webpage that posts article summaries and science news, much like r/science does. There are some similarities between this word cloud and the all-time word cloud, but some of the more frequently-used words in the Reddit word cloud are missing from the Frontiers word cloud (e.g. “sex”, “life”, and “U.S.”.
# cleaned same as reddit titles - data_analysis.R
frontiers_clean <- read_csv("~/Documents/Projects/reddit-science/frontiers_clean.csv", show_col_types = FALSE)
# word cloud
frontiers_words <- frontiers_clean %>% count(word, sort=TRUE)
wordcloud2(frontiers_words, size = 1.6)
Scientific discourse on a social media forum like r/science is quite different from that of scientific journals. This is understandable, given that journals are difficult to access for most people. Social media algorithms also elevate content that is controversial in nature. Websites that summarize science news get more engagement on r/science. Recently, a trend toward pop-sci and pop-psych content has emerged, and a disproportionate number of posts come from psypost.org.
This report makes use of the NRC Word-Emotion Association Lexicon (C), created by Dr. Saif M. Mohammad and Dr. Peter Turney(s) at the National Research Council Canada.”