Which person - Donald Trump or Joe Biden - got more coverage in U.S. online news content between Jan. 1 and Dec. 31 of 2022? You might want to know if you were a media researcher, a political strategist, or just a news junkie. This R script will find and graph an answer for you. Furthermore, you can tweak it to get the same information about any two public figures, for any recent range of dates. The coverage data come from the GDELT 2.0 API.
First, install and load the tidyverse, plotly, and readr packages,
which include some tools the script will need. The
if(!require()
code means that R will skip package
installation if you already have the package installed.
#Installing and loading the tidyverse package
if (!require("tidyverse"))
install.packages("tidyverse")
if (!require("dplyr"))
install.packages("dplyr")
if (!require("plotly"))
install.packages("plotly")
if (!require("readr"))
install.packages("readr")
library(tidyverse)
library(dplyr)
library(plotly)
library(ggplot2)
library(readr)
Now, define the beginning and end dates for the time period you want
to search. Use four digits for the year, two digits for the month, and
two digits for the day. For example, use 20220101
for
Jan. 1, 2022. If you edit the dates below, be sure to leave the quote
marks around each date.
startdate <- "20220101"
enddate <- "20221231"
This code is set to search for stories that mention “Trump” and that
were published by U.S.-based online news sources during the specified
time frame. You can change Trump
to another search term, if
you like. Most of the code is about constructing and encoding a URL that
the GDELT 2.0 API will recognize and respond to, correctly formatting
the data’s “Date” variable, and extracting the data into a data frame
called “VolumeTrump.”
### Trump
query <- "'Trump' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
## [1] "https://api.gdeltproject.org/api/v2/doc/doc?query='Trump'%20SourceCountry:US&mode=timelinevolinfo&startdatetime=20220101000000&enddatetime=20221231000000&format=CSV"
Volume <- read_csv(v_url)
## Rows: 365 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (21): Series, TopArtURL1, TopArtTitle1, TopArtURL2, TopArtTitle2, TopAr...
## dbl (1): Value
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeTrump <- Volume
A ‘rinse and repeat’ of the above code. This time, though, it
searches for stories that mention Biden
and extracts the
data to a data frame called “VolumeBiden.”
### Biden
query <- "'Biden' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
## [1] "https://api.gdeltproject.org/api/v2/doc/doc?query='Biden'%20SourceCountry:US&mode=timelinevolinfo&startdatetime=20220101000000&enddatetime=20221231000000&format=CSV"
Volume <- read_csv(v_url)
## Rows: 365 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (21): Series, TopArtURL1, TopArtTitle1, TopArtURL2, TopArtTitle2, TopAr...
## dbl (1): Value
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeBiden <- Volume
This block of code merges the two data frames and interactively plots, by day, the percentage of all GDELT-monitored articles mentioning “Trump,” and the percentage of all GDELT-monitored aticles mentioning “Biden.” In keeping with U.S. political news convention, it colors Biden’s line blue and Trump’s line red.
### Merging
VolumeTrumpBiden <- merge(VolumeTrump, VolumeBiden, by = "Date")
VolumeTrumpBiden$TrumpVolume <- VolumeTrumpBiden$Value.x
VolumeTrumpBiden$BidenVolume <- VolumeTrumpBiden$Value.y
#Plotting volume by date
library(plotly)
fig <- plot_ly(
VolumeTrumpBiden,
x = ~ Date,
y = ~ BidenVolume,
name = 'Biden',
type = 'scatter',
mode = 'lines',
line = list(color = "#005F73")
)
fig <- fig %>% add_trace(
y = ~ TrumpVolume,
name = 'Trump',
mode = 'lines',
line = list(color = "#AE2012")
)
fig <-
fig %>% layout(
title = 'U.S. coverage volume, Biden v. Trump',
xaxis = list(title = "Date",
showgrid = FALSE),
yaxis = list(title = "Volume",
showgrid = TRUE)
)
fig
This code will export the merged data in comma-separated value format and save the the exported file in the same directory as the script.
write_csv(VolumeTrumpBiden, "VolumeTrumpBiden.csv")
Finally, here’s pair-samples t-test code for evaluating the null
hypothesis that daily coverage of Biden equaled daily coverage of Trump
during the specified time period. In the code and output,
V1
represents Biden’s volume, and V2
represents Trump’s volume.
# Read the data
mydata <- VolumeTrumpBiden
# Specify the two variables involved
mydata$V1 <- mydata$TrumpVolume
mydata$V2 <- mydata$BidenVolume
# Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1
ggplot(mydata, aes(x = PairDifferences)) +
geom_histogram(color = "black", fill = "dodgerblue") +
geom_vline(aes(xintercept = mean(PairDifferences)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Get descriptive statistics for pair differences
mydata %>%
select(PairDifferences) %>%
summarise(
count = n(),
mean = mean(PairDifferences, na.rm = TRUE),
sd = sd(PairDifferences, na.rm = TRUE),
min = min(PairDifferences, na.rm = TRUE),
max = max(PairDifferences, na.rm = TRUE)
)
## count mean sd min max
## 1 365 0.5332984 0.4021545 -0.7973 2.3018
A paired-samples t-test assumes that the pair differences are normally distributed. The blue histogram offers a visual way to check that assumption. If the histogram looked non-normal, the analysis would require an alternative procedure that works with such data. In this case, though, the distribution looks reasonably normal. As a result, the paired-samples t-test can proceed.
# If pair differences look non-normal, you can use a Shapiro-Wilk test to check
# whether their distribution differs significantly from normal. If the
# Shapiro-Wilk test p-value is less than 0.05, #use a Wilcoxon signed rank test
# instead of a paired-samples t-test.
# Shapiro-Wilk test
# options(scipen = 999)
# shapiro.test(mydata$PairDifferences)
# If the pair distribution is non-normal, consider # using a Wilcoxon signed rank test instead of a
# paired-samples t-test.
# mydata %>%
# select(V1, V2) %>%
# summarise_all(list(Mean = mean, SD = sd))
# wilcox.test(mydata$V1, mydata$V2, paired = TRUE)
# If the pair differences are normally distributed,
# though, you may use a paired-samples t-test.
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean, SD = sd))
## V1_Mean V2_Mean V1_SD V2_SD
## 1 1.065326 1.598624 0.3238287 0.3729211
It looks like Trump’s coverage volume averaged 1.07, while Biden’s coverage volume averaged 1.60.
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
paired = TRUE)
##
## Paired t-test
##
## data: mydata$V2 and mydata$V1
## t = 25.335, df = 364, p-value < 0.00000000000000022
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.4919040 0.5746927
## sample estimates:
## mean difference
## 0.5332984
On average, Biden’s coverage volume exceeded Trump’s volume coverage volume by an average of 0.53. The paired-samples t-test found the difference to be statistically significant (t(364) = 25.3, p < .05).
Just want the script? Here it is, all in one piece, ready to copy and paste.
#Installing and loading the tidyverse package
if (!require("tidyverse"))
install.packages("tidyverse")
if (!require("dplyr"))
install.packages("dplyr")
if (!require("plotly"))
install.packages("plotly")
if (!require("readr"))
install.packages("readr")
library(tidyverse)
library(dplyr)
library(plotly)
library(ggplot2)
library(readr)
### Date range
startdate <- "20220101"
enddate <- "20221231"
### Trump
query <- "'Trump' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
Volume <- read_csv(v_url)
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeTrump <- Volume
### Biden
query <- "'Biden' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
Volume <- read_csv(v_url)
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeBiden <- Volume
### Merging
VolumeTrumpBiden <- merge(VolumeTrump, VolumeBiden, by = "Date")
VolumeTrumpBiden$TrumpVolume <- VolumeTrumpBiden$Value.x
VolumeTrumpBiden$BidenVolume <- VolumeTrumpBiden$Value.y
#Plotting volume by date
library(plotly)
fig <- plot_ly(
VolumeTrumpBiden,
x = ~ Date,
y = ~ BidenVolume,
name = 'Biden',
type = 'scatter',
mode = 'lines',
line = list(color = "#005F73")
)
fig <- fig %>% add_trace(
y = ~ TrumpVolume,
name = 'Trump',
mode = 'lines',
line = list(color = "#AE2012")
)
fig <-
fig %>% layout(
title = 'U.S. coverage volume, Biden v. Trump',
xaxis = list(title = "Date",
showgrid = FALSE),
yaxis = list(title = "Volume",
showgrid = TRUE)
)
fig
### Saving the data to a local .csv file
write_csv(VolumeTrumpBiden, "VolumeTrumpBiden.csv")
# Paired-samples t-test
# Read the data
mydata <- VolumeTrumpBiden
# Specify the two variables involved
mydata$V1 <- mydata$TrumpVolume
mydata$V2 <- mydata$BidenVolume
# Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1
ggplot(mydata, aes(x = PairDifferences)) +
geom_histogram(color = "black", fill = "dodgerblue") +
geom_vline(aes(xintercept = mean(PairDifferences)))
# Get descriptive statistics for pair differences
mydata %>%
select(PairDifferences) %>%
summarise(
count = n(),
mean = mean(PairDifferences, na.rm = TRUE),
sd = sd(PairDifferences, na.rm = TRUE),
min = min(PairDifferences, na.rm = TRUE),
max = max(PairDifferences, na.rm = TRUE)
)
# If pair differences look non-normal, you can use a Shapiro-Wilk test to check
# whether their distribution differs significantly from normal. If the
# Shapiro-Wilk test p-value is less than 0.05, #use a Wilcoxon signed rank test
# instead of a paired-samples t-test.
# Shapiro-Wilk test
# options(scipen = 999)
# shapiro.test(mydata$PairDifferences)
# If the pair distribution is non-normal, consider # using a Wilcoxon signed rank test instead of a
# paired-samples t-test.
# mydata %>%
# select(V1, V2) %>%
# summarise_all(list(Mean = mean, SD = sd))
# wilcox.test(mydata$V1, mydata$V2, paired = TRUE)
# If the pair differences are normally distributed,
# though, you may use a paired-samples t-test.
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean, SD = sd))
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
paired = TRUE)