An R script for capturing the text, creation timestamp, retweet count, favorite count, language, and URL for each of the most recent 3,500 tweets on a specified Twitter user’s timeline. It also exports the data to a .csv file stored in the same R subdirectory as the script. The script requires a one-time-per-machine Twitter API authorization using a valid Twitter login and password.
For demonstration purposes, this script captures tweets from the timeline of @sandyhook, the Twitter account of Sandy Hook Promise, a gun safety advocacy group. The output from each block of code appears below the block.
This script was developed using R version 4.2.1 (2022-06-23 ucrt).
The RTweet package is required, as are three supplementary packages. This code checks to see whether each package is installed already and installs the package if it isn’t. The code then loads each package into memory.
if (!require("rtweet")) install.packages("rtweet")
## Loading required package: rtweet
if (!require("httpuv")) install.packages("httpuv")
## Loading required package: httpuv
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::flatten() masks rtweet::flatten()
## ✖ dplyr::lag() masks stats::lag()
if (!require("readr")) install.packages("readr")
library(rtweet)
library(httpuv)
library(tidyverse)
library(readr)
The script pulls data from the Twitter Application Programming Interface, or API, and the API requires the user to provide access credentials. The first time you run the code below, the script will open a browser window and ask you to provide a valid Twitter login and password. The script will save both on the host computer. In all future executions of the script on the host computer, the script will automatically retrieve and submit the stored credentials.
auth_setup_default()
## Using default authentication available.
## Reading auth from 'C:\Users\kblake\AppData\Roaming/R/config/R/rtweet/default.rds'
The user name sandyhook
can be replaced with the valid
name of any other Twitter user. The target of 3500
tweets
can be adjusted downward. The Twitter API is not guaranteed to provide
the exact number of tweets specified by the target. The number retrieve
depends partly on the API’s built-in pagination.
UserTweets <- get_timeline("sandyhook",
n = 3500,
retryonratelimit = TRUE,
verbose = TRUE)
The initial data pull results in a 43-variable data frame called
UserTweets
. Some of the variables are entirely blank.
Others hold their data in nested lists, which beginning users may find
difficult to unpack.
This code creates a second data frame, FinalData
, that
omits all but six variables: the creation timestamp, ID, text, retweet
count, favorite count, and source language of each captured tweet. If
you open the data frame and see additional variables you would like to
retain, you may add their names to those in the
UserTweetsSubset
list. Just be sure to match the syntax
shown. Specifically, surround each variable name with quotes, and place
a comma after it, unless, like lang
, it is the last item in
the list definition.
The code also uses the rm()
function to delete the
UserTweets
dataframe and the UserTweetsSubset
list. You may delete or skip that link if you want to retain these items
for some reason.
UserTweetsSubset <- c("created_at",
"id_str",
"full_text",
"retweet_count",
"favorite_count",
"lang")
FinalData <- data.frame(UserTweets[UserTweetsSubset])
rm("UserTweets", "UserTweetsSubset")
By default, each tweet’s time stamp is retrieve in Universal
Coordinated Time. This code adds a localtime
column that
transforms the timestamp into a user-specified time zone. This
demonstration uses America/Chicago
to set the local time to
U.S. Central Time. You may change the time zone specification. U.S.
options include America/New_York
America/Denver
, and America/Los_Angeles
. To
learn the specification for the time zone your computer is using, run
the function Sys.timezone()
.
Additionally, the code uses each tweet’s ID to create a URL that, if
pasted into a web browser, will retrieve the associated tweet, assuming
the tweet hasn’t been deleted since collection took place. The code
stores the URL in a column labeled URL
and deletes the
id_str
column that contained only the ID.
FinalData$localtime <- as.POSIXct(FinalData$created_at,tz="GMT")
FinalData$localtime <- format(FinalData$localtime, tz = "America/Chicago", usetz = TRUE)
FinalData$URL <- paste("https://twitter.com/user/status/",FinalData$id_str, sep = "")
FinalData <- subset(FinalData, select = -c(id_str))
Finally, this code exports the FinalData data frame to a
comma-separated value file called RetrievedTweets.csv
. The
name of the .csv file can be customized as needed. If R finds that the
specified file already exists, R will overwrite it with neither
permission nor apology, so beware. The file will be stored in the same R
subdirectory as the script. This file can be imported back into R as
needed, or imported into other applications, including Microsoft
Excel.
write_excel_csv(FinalData, file = "RetrievedTweets.csv")