Overview

Which person - Donald Trump or Joe Biden - got more coverage in U.S. online news content between Jan. 1 and Dec. 31 of 2022? You might want to know if you were a media researcher, a political strategist, or just a news junkie. This R script will find and graph an answer for you. Furthermore, you can tweak it to get the same information about any two public figures, for any recent range of dates. The coverage data come from the GDELT 2.0 API.

Installing required packages

First, install and load the tidyverse, plotly, and readr packages, which include some tools the script will need. The if(!require() code means that R will skip package installation if you already have the package installed.

#Installing and loading the tidyverse package
if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("plotly"))
  install.packages("plotly")
if (!require("readr"))
  install.packages("readr")
library(tidyverse)
library(dplyr)
library(plotly)
library(ggplot2)
library(readr)

Defining the date range

Now, define the beginning and end dates for the time period you want to search. Use four digits for the year, two digits for the month, and two digits for the day. For example, use 20220101 for Jan. 1, 2022. If you edit the dates below, be sure to leave the quote marks around each date.

startdate <- "20220101"
enddate <- "20221231"

Searching for “Trump” articles

This code is set to search for stories that mention “Trump” and that were published by U.S.-based online news sources during the specified time frame. You can change Trump to another search term, if you like. Most of the code is about constructing and encoding a URL that the GDELT 2.0 API will recognize and respond to, correctly formatting the data’s “Date” variable, and extracting the data into a data frame called “VolumeTrump.”

### Trump
query <- "'Trump' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url

## [1] "https://api.gdeltproject.org/api/v2/doc/doc?query='Trump'%20SourceCountry:US&mode=timelinevolinfo&startdatetime=20220101000000&enddatetime=20221231000000&format=CSV"

Volume <- read_csv(v_url)

## Rows: 365 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (21): Series, TopArtURL1, TopArtTitle1, TopArtURL2, TopArtTitle2, TopAr...
## dbl   (1): Value
## date  (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeTrump <- Volume

Searching for “Biden” articles

A ‘rinse and repeat’ of the above code. This time, though, it searches for stories that mention Biden and extracts the data to a data frame called “VolumeBiden.”

### Biden
query <- "'Biden' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url

## [1] "https://api.gdeltproject.org/api/v2/doc/doc?query='Biden'%20SourceCountry:US&mode=timelinevolinfo&startdatetime=20220101000000&enddatetime=20221231000000&format=CSV"

Volume <- read_csv(v_url)

## Rows: 365 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (21): Series, TopArtURL1, TopArtTitle1, TopArtURL2, TopArtTitle2, TopAr...
## dbl   (1): Value
## date  (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeBiden <- Volume

Merging and plotting the data

This block of code merges the two data frames and interactively plots, by day, the percentage of all GDELT-monitored articles mentioning “Trump,” and the percentage of all GDELT-monitored aticles mentioning “Biden.” In keeping with U.S. political news convention, it colors Biden’s line blue and Trump’s line red.

### Merging
VolumeTrumpBiden <- merge(VolumeTrump, VolumeBiden, by = "Date")
VolumeTrumpBiden$TrumpVolume <- VolumeTrumpBiden$Value.x
VolumeTrumpBiden$BidenVolume <- VolumeTrumpBiden$Value.y

#Plotting volume by date
library(plotly)
fig <- plot_ly(
  VolumeTrumpBiden,
  x = ~ Date,
  y = ~ BidenVolume,
  name = 'Biden',
  type = 'scatter',
  mode = 'lines',
  line = list(color = "#005F73")
)
fig <- fig %>% add_trace(
  y = ~ TrumpVolume,
  name = 'Trump',
  mode = 'lines',
  line = list(color = "#AE2012")
)
fig <-
  fig %>% layout(
    title = 'U.S. coverage volume, Biden v. Trump',
    xaxis = list(title = "Date",
                 showgrid = FALSE),
    yaxis = list(title = "Volume",
                 showgrid = TRUE)
  )
fig

Saving data to a local .csv file

This code will export the merged data in comma-separated value format and save the the exported file in the same directory as the script.

write_csv(VolumeTrumpBiden, "VolumeTrumpBiden.csv")

Performing a paired-samples t-test

Finally, here’s pair-samples t-test code for evaluating the null hypothesis that daily coverage of Biden equaled daily coverage of Trump during the specified time period. In the code and output, V1 represents Biden’s volume, and V2 represents Trump’s volume.

# Read the data
mydata <- VolumeTrumpBiden

# Specify the two variables involved
mydata$V1 <- mydata$TrumpVolume
mydata$V2 <- mydata$BidenVolume

# Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1

ggplot(mydata, aes(x = PairDifferences)) +
  geom_histogram(color = "black", fill = "dodgerblue") +
  geom_vline(aes(xintercept = mean(PairDifferences)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Get descriptive statistics for pair differences
mydata %>%
  select(PairDifferences) %>%
  summarise(
    count = n(),
    mean = mean(PairDifferences, na.rm = TRUE),
    sd = sd(PairDifferences, na.rm = TRUE),
    min = min(PairDifferences, na.rm = TRUE),
    max = max(PairDifferences, na.rm = TRUE)
  )

##   count      mean        sd     min    max
## 1   365 0.5332984 0.4021545 -0.7973 2.3018

A paired-samples t-test assumes that the pair differences are normally distributed. The blue histogram offers a visual way to check that assumption. If the histogram looked non-normal, the analysis would require an alternative procedure that works with such data. In this case, though, the distribution looks reasonably normal. As a result, the paired-samples t-test can proceed.

# If pair differences look non-normal, you can use a Shapiro-Wilk test to check
# whether their distribution differs significantly from normal. If the
# Shapiro-Wilk test p-value is less than 0.05, #use a Wilcoxon signed rank test
# instead of a paired-samples t-test.

# Shapiro-Wilk test
# options(scipen = 999)
# shapiro.test(mydata$PairDifferences)

# If the pair distribution is non-normal, consider # using a Wilcoxon signed rank test instead of a
# paired-samples t-test.

# mydata %>%
#   select(V1, V2) %>%
#   summarise_all(list(Mean = mean, SD = sd))
# wilcox.test(mydata$V1, mydata$V2, paired = TRUE)

# If the pair differences are normally distributed,
# though, you may use a paired-samples t-test.

mydata %>%
  select(V1, V2) %>%
  summarise_all(list(Mean = mean, SD = sd))

##    V1_Mean  V2_Mean     V1_SD     V2_SD
## 1 1.065326 1.598624 0.3238287 0.3729211

It looks like Trump’s coverage volume averaged 1.07, while Biden’s coverage volume averaged 1.60.

options(scipen = 999)
t.test(mydata$V2, mydata$V1,
       paired = TRUE)

## 
##  Paired t-test
## 
## data:  mydata$V2 and mydata$V1
## t = 25.335, df = 364, p-value < 0.00000000000000022
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.4919040 0.5746927
## sample estimates:
## mean difference 
##       0.5332984

On average, Biden’s coverage volume exceeded Trump’s volume coverage volume by an average of 0.53. The paired-samples t-test found the difference to be statistically significant (t(364) = 25.3, p < .05).

Full script

Just want the script? Here it is, all in one piece, ready to copy and paste.

#Installing and loading the tidyverse package
if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("plotly"))
  install.packages("plotly")
if (!require("readr"))
  install.packages("readr")
library(tidyverse)
library(dplyr)
library(plotly)
library(ggplot2)
library(readr)

### Date range
startdate <- "20220101"
enddate <- "20221231"

### Trump
query <- "'Trump' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
Volume <- read_csv(v_url)
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeTrump <- Volume

### Biden
query <- "'Biden' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
Volume <- read_csv(v_url)
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeBiden <- Volume

### Merging
VolumeTrumpBiden <- merge(VolumeTrump, VolumeBiden, by = "Date")
VolumeTrumpBiden$TrumpVolume <- VolumeTrumpBiden$Value.x
VolumeTrumpBiden$BidenVolume <- VolumeTrumpBiden$Value.y

#Plotting volume by date
library(plotly)
fig <- plot_ly(
  VolumeTrumpBiden,
  x = ~ Date,
  y = ~ BidenVolume,
  name = 'Biden',
  type = 'scatter',
  mode = 'lines',
  line = list(color = "#005F73")
)
fig <- fig %>% add_trace(
  y = ~ TrumpVolume,
  name = 'Trump',
  mode = 'lines',
  line = list(color = "#AE2012")
)
fig <-
  fig %>% layout(
    title = 'U.S. coverage volume, Biden v. Trump',
    xaxis = list(title = "Date",
                 showgrid = FALSE),
    yaxis = list(title = "Volume",
                 showgrid = TRUE)
  )
fig


### Saving the data to a local .csv file
write_csv(VolumeTrumpBiden, "VolumeTrumpBiden.csv")

# Paired-samples t-test

# Read the data
mydata <- VolumeTrumpBiden

# Specify the two variables involved
mydata$V1 <- mydata$TrumpVolume
mydata$V2 <- mydata$BidenVolume

# Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1

ggplot(mydata, aes(x = PairDifferences)) +
  geom_histogram(color = "black", fill = "dodgerblue") +
  geom_vline(aes(xintercept = mean(PairDifferences)))

# Get descriptive statistics for pair differences
mydata %>%
  select(PairDifferences) %>%
  summarise(
    count = n(),
    mean = mean(PairDifferences, na.rm = TRUE),
    sd = sd(PairDifferences, na.rm = TRUE),
    min = min(PairDifferences, na.rm = TRUE),
    max = max(PairDifferences, na.rm = TRUE)
  )

# If pair differences look non-normal, you can use a Shapiro-Wilk test to check
# whether their distribution differs significantly from normal. If the
# Shapiro-Wilk test p-value is less than 0.05, #use a Wilcoxon signed rank test
# instead of a paired-samples t-test.

# Shapiro-Wilk test
# options(scipen = 999)
# shapiro.test(mydata$PairDifferences)

# If the pair distribution is non-normal, consider # using a Wilcoxon signed rank test instead of a
# paired-samples t-test.

# mydata %>%
#   select(V1, V2) %>%
#   summarise_all(list(Mean = mean, SD = sd))
# wilcox.test(mydata$V1, mydata$V2, paired = TRUE)

# If the pair differences are normally distributed,
# though, you may use a paired-samples t-test.

mydata %>%
  select(V1, V2) %>%
  summarise_all(list(Mean = mean, SD = sd))
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
       paired = TRUE)

Biden v Trump coverage script

Ken Blake

2023-08-09