Overview

An R script for capturing the text, creation timestamp, retweet count, favorite count, language, and URL for each of the most recent 3,500 tweets on a specified Twitter user’s timeline. It also exports the data to a .csv file stored in the same R subdirectory as the script. The script requires a one-time-per-machine Twitter API authorization using a valid Twitter login and password.

For demonstration purposes, this script captures tweets from the timeline of @sandyhook, the Twitter account of Sandy Hook Promise, a gun safety advocacy group. The output from each block of code appears below the block.

This script was developed using R version 4.2.1 (2022-06-23 ucrt).

Required packages

The RTweet package is required, as are three supplementary packages. This code checks to see whether each package is installed already and installs the package if it isn’t. The code then loads each package into memory.

if (!require("rtweet")) install.packages("rtweet")
## Loading required package: rtweet
if (!require("httpuv")) install.packages("httpuv")
## Loading required package: httpuv
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks rtweet::flatten()
## ✖ dplyr::lag()     masks stats::lag()
if (!require("readr")) install.packages("readr")

library(rtweet)
library(httpuv)
library(tidyverse)
library(readr)

Authentication

The script pulls data from the Twitter Application Programming Interface, or API, and the API requires the user to provide access credentials. The first time you run the code below, the script will open a browser window and ask you to provide a valid Twitter login and password. The script will save both on the host computer. In all future executions of the script on the host computer, the script will automatically retrieve and submit the stored credentials.

auth_setup_default()
## Using default authentication available.
## Reading auth from 'C:\Users\kblake\AppData\Roaming/R/config/R/rtweet/default.rds'

Querying the Twitter API

The user name sandyhook can be replaced with the valid name of any other Twitter user. The target of 3500 tweets can be adjusted downward. The Twitter API is not guaranteed to provide the exact number of tweets specified by the target. The number retrieve depends partly on the API’s built-in pagination.

UserTweets <- get_timeline("sandyhook",
                           n = 3500,
                           retryonratelimit = TRUE,
                           verbose = TRUE)

Paring the data frame

The initial data pull results in a 43-variable data frame called UserTweets. Some of the variables are entirely blank. Others hold their data in nested lists, which beginning users may find difficult to unpack.

This code creates a second data frame, FinalData, that omits all but six variables: the creation timestamp, ID, text, retweet count, favorite count, and source language of each captured tweet. If you open the data frame and see additional variables you would like to retain, you may add their names to those in the UserTweetsSubset list. Just be sure to match the syntax shown. Specifically, surround each variable name with quotes, and place a comma after it, unless, like lang, it is the last item in the list definition.

The code also uses the rm() function to delete the UserTweets dataframe and the UserTweetsSubset list. You may delete or skip that link if you want to retain these items for some reason.

UserTweetsSubset <- c("created_at",
                  "id_str",
                  "full_text",
                  "retweet_count",
                  "favorite_count",
                  "lang")
FinalData <- data.frame(UserTweets[UserTweetsSubset])
rm("UserTweets", "UserTweetsSubset")

Adding some useful columns

By default, each tweet’s time stamp is retrieve in Universal Coordinated Time. This code adds a localtime column that transforms the timestamp into a user-specified time zone. This demonstration uses America/Chicago to set the local time to U.S. Central Time. You may change the time zone specification. U.S. options include America/New_York America/Denver, and America/Los_Angeles. To learn the specification for the time zone your computer is using, run the function Sys.timezone().

Additionally, the code uses each tweet’s ID to create a URL that, if pasted into a web browser, will retrieve the associated tweet, assuming the tweet hasn’t been deleted since collection took place. The code stores the URL in a column labeled URL and deletes the id_str column that contained only the ID.

FinalData$localtime <- as.POSIXct(FinalData$created_at,tz="GMT")
FinalData$localtime <- format(FinalData$localtime, tz = "America/Chicago", usetz = TRUE)
FinalData$URL <- paste("https://twitter.com/user/status/",FinalData$id_str, sep = "")
FinalData <- subset(FinalData, select = -c(id_str))

Exporting the data

Finally, this code exports the FinalData data frame to a comma-separated value file called RetrievedTweets.csv. The name of the .csv file can be customized as needed. If R finds that the specified file already exists, R will overwrite it with neither permission nor apology, so beware. The file will be stored in the same R subdirectory as the script. This file can be imported back into R as needed, or imported into other applications, including Microsoft Excel.

write_excel_csv(FinalData, file = "RetrievedTweets.csv")