Intro

This R script will use the GDELT 2.0 DOC API to fetch individual article URLs, headlines and publication dates for online news articles that mention one or more user-specified keywords or phrases in their body copy and were published by one or more user-specified online news outlets during a user-specified time period. The GDELT API offers data for stories published in January 2017 or later.

Additionally, the script can export the fetched data in comma-separated-value format, produce a sorted word frequency list of the headlines text, code user-specified keywords as either present (1) or absent (0) in each headline, aggregate the data to produce weekly and daily counts of headlines that mention each issue, and graph these weekly and daily counts as interactive stacked-area charts.

Questions about this script may be directed to Dr. Ken Blake. See: https://drkblake.com/. Scroll to the end of this page for the complete script. The sections immediately below divide the script into code chunks and offer instructions and explanations.

Required packages

The script requires several R packages. This first batch is needed for the script’s basic retrieval, export, and word count functions. The code installs any packages that haven’t already been installed and ensures that the requisite package libraries are activated for the session.

#########################
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("readr")) install.packages("readr")
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidytext")) install.packages("tidytext")
library(tidyverse)
library(readr)
library(dplyr)
library(tidytext)
library(stringr) # Part of the tidyverse package

Query setup

The next block of code is where you tell the script which terms to look for, which news outlets to search, and the date range to be queried. Single words, multi-word phrases, and combinations of the two can be searched. Here are some example queries and the word or word combinations each will search stories for:

query <- "'Biden'" : The word “Biden”

query <- "'Joe Biden'" : The phrase “Joe Biden”

query <- "'Joe Biden' 'Jill Biden'" : Both “Joe Biden” and “Jill Biden”

query <- "('Joe Biden' OR 'Jill Biden')" : Either “Joe Biden” or “Jill Biden,” or both

query <- "'Joe Biden' AND ('Jill Biden' OR 'Hunter Biden' OR 'Beau Biden')" : “Joe Biden” and at least one of the other Biden family members named

Punctuation is critical in all of these query variations. Be sure each double apostrophe, single apostrophe and, if present, parentheses, is included and placed correctly.

This example sets up a search of two news outlets for all stories that mention the phrase “Joe Biden” and that were published between Jan. 2 and Feb. 19 of 2023.

#########################
query <- "'Joe Biden'" #Enter search term(s)
startdate <- "20230102" #Enter preferred start date
enddate <- "20230219" #Enter preferred end date
sources <- c("washingtonpost.com", 
             "nytimes.com") #Enter sources to search

Estimating run time

Each fetch will take approximately 2.5 seconds. Two of the seconds come from a pause built into the code to avoid exceeding the GDELT API’s rate limit. The number of fetches will equal the number of specified sources, multiplied by the number of days in the specified date range. Thus, run times can lengthen quickly if you search many sources over a long period of time. Running this code will display an estimate of how much time, both in minutes and in hours, your query will require.

#########################
#Generating a sequence of dates
startdate2 <- as.Date(startdate,"%Y%m%d")
enddate2 <- as.Date(enddate,"%Y%m%d")
dates <- seq(as.Date(startdate2), as.Date(enddate2), "days")
dates <- format(dates, "%Y%m%d")
#Estimating run time for query
Minutes <- round((length(sources)*(length(dates)*2.5/60)), digits = 1)
Hours <- round((length(sources)*(length(dates)*2.5/3600)), digits = 1)
print(paste0("In minutes: ",Minutes,". In hours: ",Hours,"."))
## [1] "In minutes: 4.1. In hours: 0.1."

Getting loopy

If your query’s estimated run time is lengthy, you can go do things in other applications while the nested “for loops” in this block of code run, or you can go to bed and let the code run while you sleep. The code will display information about each fetch as it cycles through the loops. Just be sure you have a reliable internet connection. A service disruption - sometimes even a slight one - can break the loops and halt the program.

In sum, the code iterates through your query, building a Headlines data frame as it goes. Once finished, it eliminates any duplicate records, shows the CountsBySource data frame containing the number of articles retrieved per source, exports the data as a .csv file with a name consisting of the query used and the date range searched (e.g., 'Joe Biden'20230102to20230219.csv), stores this .csv file on your computer in the same sub-directory as this R script, deletes data frames and variables that were generated during the query but that are no longer needed, and finally opens and displays the Headlines data frame.

One tip: The optional code line:

#URL_p2 <- " repeat3:\"Biden\" domainis:"

… gives you a way to retrieve only articles that mention “Biden” three or more times. Uncommenting the line can help avoid retrieving articles that mention “Biden” only incidentally, or as part of a teaser or link for a different story. You also can edit the line to specify some number of repittions of som eother term. For example, if you were searching for articles mentioning “Kamala Harris,” you might want to change the code to read:

URL_p2 <- " repeat5:\"Harris\" domainis:"

… to retrieve only those articles that mention “Obama” five or more times. Don’t accidentally omit either of the \" characters. Doing so will cause the script to malfunction.

#########################
#Creating the dataframe
Headlines = data.frame(
  Source = character(),
  URL = character(),
  MobileURL = character(),
  Date = character(),
  Title = character(),
  stringsAsFactors = FALSE)

#Defining the query URL parts
URL_p1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
URL_p2 <- " domainis:"
#URL_p2 <- " repeat3:\"Biden\" domainis:"
URL_p3 <- "&mode=artlist&maxrecords=250&sort=datedesc&startdatetime="
URL_p4 <- "000000&enddatetime="
URL_p5 <- "235959&format=CSV"

# For loops
for (thissource in sources)
{  
  for (thisdate in dates)
  {
    URL_raw <- paste0(URL_p1,
                      query,
                      URL_p2,
                      thissource,
                      URL_p3,
                      thisdate,
                      URL_p4,
                      thisdate,
                      URL_p5)
    URL_encoded <- URLencode(URL_raw)
    print("---------------------------------")
    print("Getting data for")
    print(paste0(thisdate," ",thissource))
    print("---------------------------------")
    Thisfetch <- read_csv(URL_encoded, show_col_types = FALSE)
    Thisfetch$Source <- thissource
    print(paste0(nrow(Thisfetch)," rows"))
    Headlines <- rbind(Headlines,Thisfetch)
    Sys.sleep(2)
  }
}
#Deduplicate, check, and export data
Headlines <- Headlines[!duplicated(Headlines$URL),]
CountsBySource <- Headlines %>% 
  group_by(Source) %>% 
  summarize(HeadlineCount = n())
#Show counts by source
View(CountsBySource)
filename <- paste0(query,startdate,"to",enddate,".csv")
write_excel_csv(Headlines,filename)
#Cleanup
rm(Thisfetch,
   dates,
   enddate,
   enddate2,
   filename,
   query,
   sources,
   startdate,
   startdate2,
   thisdate,
   thissource,
   URL_encoded,
   URL_p1,
   URL_p2,
   URL_p3,
   URL_p4,
   URL_p5,
   URL_raw)
#View raw data frame
View(Headlines)

Headline word counts

You might like to know which words appeared most often in the headlines of the articles your query retrieved. This code breaks the headlines into their component words (“tokenizing”), counts the number of times each word appears, and shows you a data frame called WordCounts with the words sorted in descending order of frequency. Common “stop words,” like “a,” “an,” and “the” are deleted from the list for your convenience. You may edit the c("and","the","etc.") list before running the code to omit any additional words you don’t want to include.

#########################
#Headline word counts
WordCounts <- Headlines %>% 
  unnest_tokens(word,Title) %>% 
  count(word, sort = TRUE)
# Deleting standard stop words
data("stop_words")
WordCounts <- WordCounts %>%
  anti_join(stop_words)
# Deleting custom stop words
my_stopwords <- tibble(word = c("and",
                                "the",
                                "etc."))
WordCounts <- WordCounts %>% 
  anti_join(my_stopwords)
rm(stop_words,
   my_stopwords)
#Viewing word counts
View(WordCounts)

Coding the headlines

Optionally, you can edit the code below to have R automatically categorize each headline as containing (1), or not containing (0), any of one or more words or phrases you specify. You’ll need one code block per categorization task. Each code block goes from searchterms <- to ignore.case = TRUE),1,0). Put the words or phrases you want R to look for between the double quote marks after the code block’s searchterms <- portion. Putting a | character between words or phrases means “or.” Also, put a unique variable name after the block’s Headlines$ code. In the first code block in the example below, the search terms chosen are "trump|maga", and Trump is the chosen name of the variable that will indicate whether “trump” or “maga” appeared in each headline.

#########################
#Headline coding
searchterms <- "trump|maga"
Headlines$Trump <- ifelse(grepl(searchterms,
                               Headlines$Title,
                               ignore.case = TRUE),1,0)
searchterms <- "house|gop|republican|mccarthy"
Headlines$Republicans <- ifelse(grepl(searchterms,
                               Headlines$Title,
                               ignore.case = TRUE),1,0)
searchterms <- "classified documents"
Headlines$Documents <- ifelse(grepl(searchterms,
                                   Headlines$Title,
                                   ignore.case = TRUE),1,0)
searchterms <- "debt|ceiling"
Headlines$Debt <- ifelse(grepl(searchterms,
                                    Headlines$Title,
                                    ignore.case = TRUE),1,0)
searchterms <- "Jan. 6|capitol riot"
Headlines$Jan6 <- ifelse(grepl(searchterms,
                               Headlines$Title,
                               ignore.case = TRUE),1,0)
searchterms <- "China|Chinese"
Headlines$China <- ifelse(grepl(searchterms,
                               Headlines$Title,
                               ignore.case = TRUE),1,0)
searchterms <- "Ukraine|Ukranian|Russia"
Headlines$Ukraine <- ifelse(grepl(searchterms,
                                Headlines$Title,
                                ignore.case = TRUE),1,0)
searchterms <- "border|migra"
Headlines$Immigration <- ifelse(grepl(searchterms,
                                  Headlines$Title,
                                  ignore.case = TRUE),1,0)

Graphing the data, Part 1

You might like to see a graph summarizing how many times each coded word or phrase appears in a headline during a given day or week, and how these frequencies have changed over time. That’s the goal this part of the script is heading toward. First, though, the script needs to aggregate those headline categorizations, both by day and by week. To make this block of code work, make sure that any variable name changes you make in the preceding block of code get made in this block of code, too. For example, if you add code to the previous block asking R to created headline coding for a variable called Hunter (as in, the president’s younger son), ensure that there is a line after each summarize( in the code below that reads Hunter = sum(Hunter, na.rm=TRUE), .

#########################
#Categorizing by day and also by week
#The lubridate package is required for this code chunk.
if (!require("lubridate")) install.packages("lubridate")
library(lubridate)
#Add "day" and "WeekOf" variables to the data frame
Headlines$Day <- round_date(Headlines$Date,
                            unit = "day")
Headlines$WeekOf <- round_date(Headlines$Date,
                               unit = "week",
                               week_start = getOption("lubridate.week.start",1))
#Aggregating by week
AggByWeek <- Headlines %>%
  group_by(WeekOf) %>% 
  summarize(Trump = sum(Trump, na.rm=TRUE),
            Republicans = sum(Republicans, na.rm=TRUE),
            Documents = sum(Documents, na.rm=TRUE),
            Debt = sum(Debt, na.rm=TRUE),
            Jan6 = sum(Jan6, na.rm=TRUE),
            China = sum(China, na.rm=TRUE),
            Ukraine = sum(Ukraine, na.rm=TRUE),
            Immigration = sum(Immigration, na.rm=TRUE),
            ArticleCount = n())
#Aggregating by day
AggByDay <- Headlines %>%
  group_by(Day) %>% 
  summarize(Trump = sum(Trump, na.rm=TRUE),
            Republicans = sum(Republicans, na.rm=TRUE),
            Documents = sum(Documents, na.rm=TRUE),
            Debt = sum(Debt, na.rm=TRUE),
            Jan6 = sum(Jan6, na.rm=TRUE),
            China = sum(China, na.rm=TRUE),
            Ukraine = sum(Ukraine, na.rm=TRUE),
            Immigration = sum(Immigration, na.rm=TRUE),
            ArticleCount = n())

Graphing the data, Part 2

Ready to see the graphs? This code will produce the weekly graph, then the daily graph. Each graph layer’s color can be controlled by editing the six-digit hex color code in the layer’s fillcolor = line. The web site https://coolors.co/palettes/trending is an excellent source of hex codes for a selection of color palettes.

#########################
#Graphing by week and by day
if (!require("plotly")) install.packages("plotly")
library(plotly)
# Color palettes: https://coolors.co/palettes/trending
# Graphing by week
fig <- plot_ly(AggByWeek, x = ~WeekOf, y = ~Trump, 
               name = 'Trump', type = 'scatter', 
               mode = 'none', stackgroup = 'one', 
               fillcolor = '#F94144')
fig <- fig %>% add_trace(y = ~Republicans, 
                         name = 'Republicans', 
                         fillcolor = '#F3722C')
fig <- fig %>% add_trace(y = ~Documents, name = 'Documents', 
                         fillcolor = '#F8961E')
fig <- fig %>% add_trace(y = ~Debt, name = 'Debt', 
                         fillcolor = '#F9844A')
fig <- fig %>% add_trace(y = ~Jan6, name = 'Jan. 6', 
                         fillcolor = '#90BE6D')
fig <- fig %>% add_trace(y = ~China, name = 'China', 
                         fillcolor = '#43AA8B')
fig <- fig %>% add_trace(y = ~Ukraine, name = 'Ukraine', 
                         fillcolor = '#4D908E')
fig <- fig %>% add_trace(y = ~Immigration, name = 'Immigration', 
                         fillcolor = '#577590')
fig <- fig %>% layout(title = 'Headline counts, by topic and week',
                      xaxis = list(title = "Week of",
                                   showgrid = FALSE),
                      yaxis = list(title = "Count",
                                   showgrid = TRUE))

fig
# Graphing by day
fig <- plot_ly(AggByDay, x = ~Day, y = ~Trump, 
               name = 'Trump', type = 'scatter', 
               mode = 'none', stackgroup = 'one', 
               fillcolor = '#F94144')
fig <- fig %>% add_trace(y = ~Republicans, 
                         name = 'Republicans', 
                         fillcolor = '#F3722C')
fig <- fig %>% add_trace(y = ~Documents, name = 'Documents', 
                         fillcolor = '#F8961E')
fig <- fig %>% add_trace(y = ~Debt, name = 'Debt', 
                         fillcolor = '#F9844A')
fig <- fig %>% add_trace(y = ~Jan6, name = 'Jan. 6', 
                         fillcolor = '#90BE6D')
fig <- fig %>% add_trace(y = ~China, name = 'China', 
                         fillcolor = '#43AA8B')
fig <- fig %>% add_trace(y = ~Ukraine, name = 'Ukraine', 
                         fillcolor = '#4D908E')
fig <- fig %>% add_trace(y = ~Immigration, name = 'Immigration', 
                         fillcolor = '#577590')
fig <- fig %>% layout(title = 'Headline counts, by topic and week',
                      xaxis = list(title = "Week of",
                                   showgrid = FALSE),
                      yaxis = list(title = "Count",
                                   showgrid = TRUE))

fig

The whole script

Finally, here’s the entire script, all in one piece, with a few embedded notes and hints, so you can conveniently copy and paste it into your own R script file.

#########################
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("readr")) install.packages("readr")
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidytext")) install.packages("tidytext")
library(tidyverse)
library(readr)
library(dplyr)
library(tidytext)
library(stringr) # Part of the tidyverse package

#########################
query <- "'Joe Biden'" #Enter search term(s)
startdate <- "20230102" #Enter preferred start date
enddate <- "20230219" #Enter preferred end date
sources <- c("washingtonpost.com", 
             "nytimes.com") #Enter sources to search

#########################
#Generating a sequence of dates
startdate2 <- as.Date(startdate,"%Y%m%d")
enddate2 <- as.Date(enddate,"%Y%m%d")
dates <- seq(as.Date(startdate2), as.Date(enddate2), "days")
dates <- format(dates, "%Y%m%d")
#Estimating run time for query
Minutes <- round((length(sources)*(length(dates)*2.5/60)), digits = 1)
Hours <- round((length(sources)*(length(dates)*2.5/3600)), digits = 1)
print(paste0("In minutes: ",Minutes,". In hours: ",Hours,"."))

#########################
#Creating the dataframe
Headlines = data.frame(
  Source = character(),
  URL = character(),
  MobileURL = character(),
  Date = character(),
  Title = character(),
  stringsAsFactors = FALSE)

#Defining the query URL parts
URL_p1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
URL_p2 <- " domainis:"
#URL_p2 <- " repeat3:\"Biden\" domainis:"
URL_p3 <- "&mode=artlist&maxrecords=250&sort=datedesc&startdatetime="
URL_p4 <- "000000&enddatetime="
URL_p5 <- "235959&format=CSV"

# For loops
for (thissource in sources)
{  
  for (thisdate in dates)
  {
    URL_raw <- paste0(URL_p1,
                      query,
                      URL_p2,
                      thissource,
                      URL_p3,
                      thisdate,
                      URL_p4,
                      thisdate,
                      URL_p5)
    URL_encoded <- URLencode(URL_raw)
    print("---------------------------------")
    print("Getting data for")
    print(paste0(thisdate," ",thissource))
    print("---------------------------------")
    Thisfetch <- read_csv(URL_encoded, show_col_types = FALSE)
    Thisfetch$Source <- thissource
    print(paste0(nrow(Thisfetch)," rows"))
    Headlines <- rbind(Headlines,Thisfetch)
    Sys.sleep(2)
  }
}
#Deduplicate, check, and export data
Headlines <- Headlines[!duplicated(Headlines$URL),]
CountsBySource <- Headlines %>% 
  group_by(Source) %>% 
  summarize(HeadlineCount = n())
View(CountsBySource)
filename <- paste0(query,startdate,"to",enddate,".csv")
write_excel_csv(Headlines,filename)
#Cleanup
rm(Thisfetch,
   dates,
   enddate,
   enddate2,
   filename,
   query,
   sources,
   startdate,
   startdate2,
   thisdate,
   thissource,
   URL_encoded,
   URL_p1,
   URL_p2,
   URL_p3,
   URL_p4,
   URL_p5,
   URL_raw)
#View raw data frame
View(Headlines)

#########################
#Headline word counts
WordCounts <- Headlines %>% 
  unnest_tokens(word,Title) %>% 
  count(word, sort = TRUE)
# Deleting standard stop words
data("stop_words")
WordCounts <- WordCounts %>%
  anti_join(stop_words)
# Deleting custom stop words
my_stopwords <- tibble(word = c("and",
                                "the",
                                "etc."))
WordCounts <- WordCounts %>% 
  anti_join(my_stopwords)
rm(stop_words,
   my_stopwords)
#Viewing word counts
View(WordCounts)

#########################
#Headline coding
searchterms <- "trump|maga"
Headlines$Trump <- ifelse(grepl(searchterms,
                               Headlines$Title,
                               ignore.case = TRUE),1,0)
searchterms <- "house|gop|republican|mccarthy"
Headlines$Republicans <- ifelse(grepl(searchterms,
                               Headlines$Title,
                               ignore.case = TRUE),1,0)
searchterms <- "classified documents"
Headlines$Documents <- ifelse(grepl(searchterms,
                                   Headlines$Title,
                                   ignore.case = TRUE),1,0)
searchterms <- "debt|ceiling"
Headlines$Debt <- ifelse(grepl(searchterms,
                                    Headlines$Title,
                                    ignore.case = TRUE),1,0)
searchterms <- "Jan. 6|capitol riot"
Headlines$Jan6 <- ifelse(grepl(searchterms,
                               Headlines$Title,
                               ignore.case = TRUE),1,0)
searchterms <- "China|Chinese"
Headlines$China <- ifelse(grepl(searchterms,
                               Headlines$Title,
                               ignore.case = TRUE),1,0)
searchterms <- "Ukraine|Ukranian|Russia"
Headlines$Ukraine <- ifelse(grepl(searchterms,
                                Headlines$Title,
                                ignore.case = TRUE),1,0)
searchterms <- "border|migra"
Headlines$Immigration <- ifelse(grepl(searchterms,
                                  Headlines$Title,
                                  ignore.case = TRUE),1,0)

#########################
# Categorizing by day and also by week
if (!require("lubridate")) install.packages("lubridate")
library(lubridate)
Headlines$Day <- round_date(Headlines$Date,
                            unit = "day")
Headlines$WeekOf <- round_date(Headlines$Date,
                               unit = "week",
                               week_start = getOption("lubridate.week.start",1))
#Aggregating by week
AggByWeek <- Headlines %>%
  group_by(WeekOf) %>% 
  summarize(Trump = sum(Trump, na.rm=TRUE),
            Republicans = sum(Republicans, na.rm=TRUE),
            Documents = sum(Documents, na.rm=TRUE),
            Debt = sum(Debt, na.rm=TRUE),
            Jan6 = sum(Jan6, na.rm=TRUE),
            China = sum(China, na.rm=TRUE),
            Ukraine = sum(Ukraine, na.rm=TRUE),
            Immigration = sum(Immigration, na.rm=TRUE),
            ArticleCount = n())
#Aggregating by day
AggByDay <- Headlines %>%
  group_by(Day) %>% 
  summarize(Trump = sum(Trump, na.rm=TRUE),
            Republicans = sum(Republicans, na.rm=TRUE),
            Documents = sum(Documents, na.rm=TRUE),
            Debt = sum(Debt, na.rm=TRUE),
            Jan6 = sum(Jan6, na.rm=TRUE),
            China = sum(China, na.rm=TRUE),
            Ukraine = sum(Ukraine, na.rm=TRUE),
            Immigration = sum(Immigration, na.rm=TRUE),
            ArticleCount = n())

#########################
#Graphing by week and by day
if (!require("plotly")) install.packages("plotly")
library(plotly)
# Color palettes: https://coolors.co/palettes/trending
# Graphing by week
fig <- plot_ly(AggByWeek, x = ~WeekOf, y = ~Trump, 
               name = 'Trump', type = 'scatter', 
               mode = 'none', stackgroup = 'one', 
               fillcolor = '#F94144')
fig <- fig %>% add_trace(y = ~Republicans, 
                         name = 'Republicans', 
                         fillcolor = '#F3722C')
fig <- fig %>% add_trace(y = ~Documents, name = 'Documents', 
                         fillcolor = '#F8961E')
fig <- fig %>% add_trace(y = ~Debt, name = 'Debt', 
                         fillcolor = '#F9844A')
fig <- fig %>% add_trace(y = ~Jan6, name = 'Jan. 6', 
                         fillcolor = '#90BE6D')
fig <- fig %>% add_trace(y = ~China, name = 'China', 
                         fillcolor = '#43AA8B')
fig <- fig %>% add_trace(y = ~Ukraine, name = 'Ukraine', 
                         fillcolor = '#4D908E')
fig <- fig %>% add_trace(y = ~Immigration, name = 'Immigration', 
                         fillcolor = '#577590')
fig <- fig %>% layout(title = 'Headline counts, by topic and week',
                      xaxis = list(title = "Week of",
                                   showgrid = FALSE),
                      yaxis = list(title = "Count",
                                   showgrid = TRUE))

fig

# Graphing by day
fig <- plot_ly(AggByDay, x = ~Day, y = ~Trump, 
               name = 'Trump', type = 'scatter', 
               mode = 'none', stackgroup = 'one', 
               fillcolor = '#F94144')
fig <- fig %>% add_trace(y = ~Republicans, 
                         name = 'Republicans', 
                         fillcolor = '#F3722C')
fig <- fig %>% add_trace(y = ~Documents, name = 'Documents', 
                         fillcolor = '#F8961E')
fig <- fig %>% add_trace(y = ~Debt, name = 'Debt', 
                         fillcolor = '#F9844A')
fig <- fig %>% add_trace(y = ~Jan6, name = 'Jan. 6', 
                         fillcolor = '#90BE6D')
fig <- fig %>% add_trace(y = ~China, name = 'China', 
                         fillcolor = '#43AA8B')
fig <- fig %>% add_trace(y = ~Ukraine, name = 'Ukraine', 
                         fillcolor = '#4D908E')
fig <- fig %>% add_trace(y = ~Immigration, name = 'Immigration', 
                         fillcolor = '#577590')
fig <- fig %>% layout(title = 'Headline counts, by topic and week',
                      xaxis = list(title = "Week of",
                                   showgrid = FALSE),
                      yaxis = list(title = "Count",
                                   showgrid = TRUE))

fig