Today we’re going to learn how to source social media data, specifically using APIs and scraping. We will practice using an API by requesting data from Bluesky’s API, and scraping by getting data from Reddit.
First, we look at Bluesky.
Bluesky is a relatively new social media platform built as an alternative to Twitter, primarily. Its key distinction is its federated structure, which means that you can move your data between servers, and even to other platforms that use Bluesky’s protocol, making its data and governance structures more decentralized. The second feature Bluesky provides is greater user control, particulary in terms of algorithmic curation. Any Bluesky user can build their own feed (e.g., a feed for a certain topic, or a feed that prioritizes posts with low engagement, or a feed by users who don’t typically post often, and so on).
Before we get started, you will need to go to bsky.app and create an account.
We will use the atrrr package built by Johannes Gruber, Ben Guinaudeau, Fabio Votta, and Jenny Bryan. You can view the code on GitHub here: https://github.com/JBGruber/atrrr/
First, we install the package and import the library.
#library(atrrr)
Here, you enter your username (e.g., name.bsky.social) in place of “aliaelkattan.com” to access the API. Running this will open your Bluesky account, where you need to create a new “application” to use to request information from the platform. Then, you will choose or generate a password to use each time you access this application. Make sure you store this somewhere safe, as you will be asked to provide it in the R Studio environment.
Note: Uncomment all the double ## lines to run the code in your environment.
#Uncomment the below file in your RStudio environment
#auth("aliaelkattan.com")
Here, we can collect the followers and followed accounts of any Bluesky account, such as U.S. Rep. Alexandria Ocasio-Cortez.
#followers <- get_followers(actor="aoc.bsky.social",limit=4000)
#follows <- get_follows(actor="aoc.bsky.social",limit=4000)
There are many things we can do with this data, such as finding out who someone’s followers are (people they followed and are followed by.)
##library(dplyr)
##find mutuals -- account follows them & is followed back by them
##mutuals <- followers %>% dplyr::inner_join(followers, followed, by="did")
Or people who follow them but they don’t follow back, or vice versa.
We can filter the people they follow by specific keywords in their profile bio, such as all the accounts that have the word “news” in the actor description column.
# Filter out keyword e.g. "left" or "news"
##filtered_list <- follows %>% filter(grepl("news",actor_description,ignore.case=TRUE))
# make a clickable URL
##filtered_list$profile.url <- paste0("https://bsky.app/profile/",filtered_list$actor_handle)
And of course, we can download all their posts to analyze their content and behavior on the platform.
#posts <- get_skeets_authored_by(actor = "aoc.bsky.social", limit=500,parse = TRUE)
#this will convert all data types to characters
#posts_ch <- sapply(posts, c)
#write.csv(posts_ch,"aoc_posts.csv")
Next, we will try scraping, specifically using Reddit.
One of the nice things about Reddit is that they allow for non-authenticated calls to their API by simply adding “.json” to the appropriate part of a given URL. To get started let’s look at the main politics subreddit and then compare to the same URL but with “.json” appended. If we want to read this information into R we can do the following:
library(jsonlite)
library(tidyverse)
library(plyr)
library(dplyr)
library(httr)
library(xml2)
library(rvest)
library(RSelenium)
library(knitr)
url <- "https://www.reddit.com/r/politics/"
pol <- fromJSON(paste0(url,".json"))
Easy peasy. To access information which is relevant to our purposes, we need to parse through this list a little bit as follows, the relevant fields being defined here.
out <- pol$data$children$data
as_tibble(out)
## # A tibble: 26 × 108
## approved_at_utc subreddit selftext author_fullname saved mod_reason_title
## <lgl> <chr> <chr> <chr> <lgl> <lgl>
## 1 NA politics "The addres… t2_onl9u FALSE NA
## 2 NA politics "" t2_58iti FALSE NA
## 3 NA politics "" t2_deaxrt FALSE NA
## 4 NA politics "" t2_9xamyg0fi FALSE NA
## 5 NA politics "" t2_35fk2 FALSE NA
## 6 NA politics "" t2_1ip0s01a FALSE NA
## 7 NA politics "" t2_avat FALSE NA
## 8 NA politics "" t2_jfh27y2c FALSE NA
## 9 NA politics "" t2_35fk2 FALSE NA
## 10 NA politics "" t2_edeol FALSE NA
## # ℹ 16 more rows
## # ℹ 102 more variables: gilded <int>, clicked <lgl>, title <chr>,
## # link_flair_richtext <list>, subreddit_name_prefixed <chr>, hidden <lgl>,
## # pwls <int>, link_flair_css_class <chr>, downs <int>,
## # thumbnail_height <int>, top_awarded_type <lgl>, hide_score <lgl>,
## # name <chr>, quarantine <lgl>, link_flair_text_color <chr>,
## # upvote_ratio <dbl>, author_flair_background_color <chr>, …
There are a few fields which may be of interest to us here. The first is the “title” field:
out$title
## [1] "Discussion Thread: President Biden Delivers Final Foreign Policy Address"
## [2] "Marjorie Taylor Greene Launches Unhinged Call for Officials to Manipulate the Weather to Stop L.A. Wildfires"
## [3] "Suddenly Donald Trump doesn’t want to talk so much about the economy"
## [4] "Why the legacy media suddenly sound like Bernie Sanders | \nBernie Sanders was right"
## [5] "Team Trump Suddenly Backtracks on Key Campaign Promise - Donald Trump’s Ukraine envoy made a damning confession on the likelihood of the war ending."
## [6] "Gov. Gavin Newsom launches website to fight misinformation about California’s fires"
## [7] "Group of Experts Says R.F.K. Jr. Would ‘Significantly Undermine’ Public Health"
## [8] "Insurrectionists Melt Down After Vance Says Trump Shouldn’t Pardon Violent J6ers"
## [9] "Democrats Raise Alarm Over Pete Hegseth's FBI Check: 'Cover-Up'"
## [10] "Defense pick Pete Hegseth repeatedly criticized removing names of Confederate generals from US bases"
## [11] "\"There will be strings attached\": GOP Sen. says Los Angeles wildfire aid won't be \"blank check\""
## [12] "‘I Think Things Are Going to Be Bad, Really Bad’: The US Military Debates Possible Deployment on US Soil Under Trump"
## [13] "Donald Trump isn’t even in office yet and silly season has already begun"
## [14] "Kansas House speaker bans reporters from chamber floor, doesn’t say why"
## [15] "Trump appears desperate to keep Jack Smith’s findings under wraps"
## [16] "Steve Bannon Vows To Oust Elon Musk From Donald Trump’s Inner Circle Before The Inauguration: ‘I made it my personal thing to take this guy down”"
## [17] "Trump’s tariffs will create a hunger games landscape where the little guy is guaranteed to lose"
## [18] "Florida Voters Sue Gov. DeSantis For Failing to Call Special Elections"
## [19] "Trump’s deportation vows only for ‘rabid’ Republicans and will fail, says Newt Gingrich"
## [20] "Medical debt banned from credit reports by new Biden administration rule"
## [21] "Project 2025’s Plan to Dismantle Public Education—And Screw Over Disabled Kids"
## [22] "Incoming Trump team is questioning civil servants at National Security Council about their loyalty"
## [23] "Wray defends Mar-a-Lago search: ‘Part of the FBI’s job is to safeguard classified information’"
## [24] "Russia once floated Greenland invasion in fake letter to Trump ally"
## [25] "Elon Musk Is an ‘Evil Person,’ Steve Bannon Says"
## [26] "Enabling Trump is a bad look for Fetterman | Pennsylvania's senior senator was elected as a progressive Democrat. His normalization of Donald Trump is the epitome of a sellout."
This all looks great, but there are only a few posts! What if we wanted to grab more of them? The trick is to use an older version of Reddit which did not feature the “never ending reddit” feature, and used pages instead! The technique is exactly the same:
url <- "https://old.reddit.com/r/politics/"
pol_old <- fromJSON(paste0(url,".json"))
out_old <- pol_old$data$children$data
out_old$title
## [1] "Discussion Thread: President Biden Delivers Final Foreign Policy Address"
## [2] "Marjorie Taylor Greene Launches Unhinged Call for Officials to Manipulate the Weather to Stop L.A. Wildfires"
## [3] "Suddenly Donald Trump doesn’t want to talk so much about the economy"
## [4] "Why the legacy media suddenly sound like Bernie Sanders | \nBernie Sanders was right"
## [5] "Team Trump Suddenly Backtracks on Key Campaign Promise - Donald Trump’s Ukraine envoy made a damning confession on the likelihood of the war ending."
## [6] "Gov. Gavin Newsom launches website to fight misinformation about California’s fires"
## [7] "Group of Experts Says R.F.K. Jr. Would ‘Significantly Undermine’ Public Health"
## [8] "Insurrectionists Melt Down After Vance Says Trump Shouldn’t Pardon Violent J6ers"
## [9] "Democrats Raise Alarm Over Pete Hegseth's FBI Check: 'Cover-Up'"
## [10] "Defense pick Pete Hegseth repeatedly criticized removing names of Confederate generals from US bases"
## [11] "\"There will be strings attached\": GOP Sen. says Los Angeles wildfire aid won't be \"blank check\""
## [12] "‘I Think Things Are Going to Be Bad, Really Bad’: The US Military Debates Possible Deployment on US Soil Under Trump"
## [13] "Donald Trump isn’t even in office yet and silly season has already begun"
## [14] "Kansas House speaker bans reporters from chamber floor, doesn’t say why"
## [15] "Trump appears desperate to keep Jack Smith’s findings under wraps"
## [16] "Steve Bannon Vows To Oust Elon Musk From Donald Trump’s Inner Circle Before The Inauguration: ‘I made it my personal thing to take this guy down”"
## [17] "Trump’s tariffs will create a hunger games landscape where the little guy is guaranteed to lose"
## [18] "Florida Voters Sue Gov. DeSantis For Failing to Call Special Elections"
## [19] "Trump’s deportation vows only for ‘rabid’ Republicans and will fail, says Newt Gingrich"
## [20] "Medical debt banned from credit reports by new Biden administration rule"
## [21] "Project 2025’s Plan to Dismantle Public Education—And Screw Over Disabled Kids"
## [22] "Incoming Trump team is questioning civil servants at National Security Council about their loyalty"
## [23] "Wray defends Mar-a-Lago search: ‘Part of the FBI’s job is to safeguard classified information’"
## [24] "Russia once floated Greenland invasion in fake letter to Trump ally"
## [25] "Elon Musk Is an ‘Evil Person,’ Steve Bannon Says"
## [26] "Enabling Trump is a bad look for Fetterman | Pennsylvania's senior senator was elected as a progressive Democrat. His normalization of Donald Trump is the epitome of a sellout."
To make this actionable, we need to make a few modifications after studying how the links work. After clicking “next” the url for me yields the following, “.json” being inserted into the appropriate place:
out_old$name
## [1] "t3_1i0iirv" "t3_1i0d68v" "t3_1i0dzpj" "t3_1i0fojn" "t3_1i0i2dy"
## [6] "t3_1i0c7y9" "t3_1i0c72q" "t3_1i03lwn" "t3_1i0gi2g" "t3_1i0btwv"
## [11] "t3_1hzyqam" "t3_1i00zl3" "t3_1i0bmvp" "t3_1i0fifu" "t3_1i0jgjn"
## [16] "t3_1i02g8h" "t3_1i05wbs" "t3_1hzviqu" "t3_1i0chxd" "t3_1hzwck4"
## [21] "t3_1i0d1m6" "t3_1i0dncq" "t3_1i0hgu4" "t3_1i0dztm" "t3_1i0h9fd"
## [26] "t3_1hzohs4"
next_page <- paste0("https://old.reddit.com/r/politics/.json?count=25&after=",out_old$name[nrow(out_old)])
pol_next <- fromJSON(next_page)
out_next <- pol_next$data$children$data
After playing around a little bit, the following works for grabbing multiple pages:
moar_reddit <- function(subreddit,n_pages=1){
base_url <- paste0("https://old.reddit.com/r/",subreddit,"/.json")
out <- list()
out[[1]] <- fromJSON(base_url,flatten=T)$data$children
for(i in 2:n_pages){
Sys.sleep(1)
tmp <- out[[i-1]]
last_id <- tmp[nrow(tmp),"data.name"]
out[[i]] <- fromJSON(paste0(base_url,"?after=",last_id),flatten = T)$data$children
}
do.call("rbind.fill",out)
}
Let’s take it for a spin!
##some_more_reddit <- moar_reddit("politics",5)
##as_tibble(some_more_reddit)
Excellent! What sort of useful text can we grab from this data? Most immediately, we might grab the titles of the posts:
##head(some_more_reddit$data.title)
1a. Write a script that collects the follower and followed data for a Republican and a Democrat-leaning political actors in a category of your choice (e.g., politican, activist, news).
1b. How many followers do they each have? How many users follow them?
1c. Produce a dataframe of the follower accounts they have in common. How many users follow both of them?
2a. Write a script that collects the posts (or “skeets”) made by the same user. Save this to your computer as a csv; you will need it next class.
2b. Create a column that identifies whether a post is an original or a repost. (Hint: one way is by checking the ‘author_handle’ column).
2c. Are posts or reposts more likely to receive higher count of likes? Answer this question using a regression.