Which person - Donald Trump or Joe Biden - got more coverage in U.S. online news content between Jan. 1 and Dec. 31 of 2022? You might want to know if you were a media researcher, a political strategist, or just a news junkie. This R script will find and graph an answer for you. Furthermore, you can tweak it to get the same information about any two public figures, for any recent range of dates. The coverage data come from the GDELT 2.0 API.
First, install and load the tidyverse, plotly, and readr packages,
which include some tools the script will need. The
if(!require()
code means that R will skip package
installation if you already have the package installed.
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("plotly")) install.packages("plotly")
if (!require("readr")) install.packages("readr")
library(tidyverse)
library(plotly)
library(readr)
Now, define the beginning and end dates for the time period you want
to search. Use four digits for the year, two digits for the month, and
two digits for the day. For example, use 20220101
for
Jan. 1, 2022. If you edit the dates below, be sure to leave the quote
marks around each date.
startdate <- "20220101"
enddate <- "20221231"
This code is set to search for stories that mention “Trump” and that
were published by U.S.-based online news sources during the specified
time frame. You can change Trump
to another search term, if
you like. Most of the code is about constructing and encoding a URL that
the GDELT 2.0 API will recognize and respond to, correctly formatting
the data’s “Date” variable, and extracting the data into a data frame
called “VolumeTrump.”
query <- "'Donald Trump' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
## [1] "https://api.gdeltproject.org/api/v2/doc/doc?query='Donald%20Trump'%20SourceCountry:US&mode=timelinevolinfo&startdatetime=20220101000000&enddatetime=20221231000000&format=CSV"
Volume <- read_csv(v_url)
## Rows: 365 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (21): Series, TopArtURL1, TopArtTitle1, TopArtURL2, TopArtTitle2, TopAr...
## dbl (1): Value
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeTrump <- Volume
A ‘rinse and repeat’ of the above code. This time, though, it
searches for stories that mention Biden
and extracts the
data to a data frame called “VolumeBiden.”
query <- "'Joe Biden' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
## [1] "https://api.gdeltproject.org/api/v2/doc/doc?query='Joe%20Biden'%20SourceCountry:US&mode=timelinevolinfo&startdatetime=20220101000000&enddatetime=20221231000000&format=CSV"
Volume <- read_csv(v_url)
## Rows: 365 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (21): Series, TopArtURL1, TopArtTitle1, TopArtURL2, TopArtTitle2, TopAr...
## dbl (1): Value
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeBiden <- Volume
This block of code merges the two data frames and interactively plots, by day, the percentage of all GDELT-monitored articles mentioning “Trump,” and the percentage of all GDELT-monitored aticles mentioning “Biden.”
## Merging
VolumeTrumpBiden <- merge(VolumeTrump, VolumeBiden, by = "Date")
VolumeTrumpBiden$TrumpVolume <- VolumeTrumpBiden$Value.x
VolumeTrumpBiden$BidenVolume <- VolumeTrumpBiden$Value.y
#Plotting volume by date
library(plotly)
fig <- plot_ly(VolumeTrumpBiden, x = ~Date, y = ~TrumpVolume,
name = 'Trump',
type = 'scatter',
mode = 'lines+markers')
fig <- fig %>% add_trace(y = ~BidenVolume,
name = 'Biden',
mode = 'lines+markers')
fig
This code will export the merged data in comma-separated value format and save the the exported file in the same directory as the script.
write_csv(VolumeTrumpBiden,"VolumeTrumpBiden.csv")
Finally, here’s pair-samples t-test code for evaluating the null
hypothesis that daily coverage of Biden equaled daily coverage of Trump
during the specified time period. In the code and output,
V1
represents Biden’s volume, and V2
represents Trump’s volume.
mydata <- VolumeTrumpBiden
mydata$V1 <- mydata$BidenVolume
mydata$V2 <- mydata$TrumpVolume
ggplot(mydata, aes(x = V1))+geom_histogram()+
geom_vline(xintercept = mean(mydata$V1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mydata, aes(x = V2))+geom_histogram()+
geom_vline(xintercept = mean(mydata$V2))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mean.V1 <- mean(mydata$V1, na.rm = TRUE)
mean.V2 <- mean(mydata$V2, na.rm = TRUE)
mean.V1
## [1] 1.057778
mean.V2
## [1] 0.8064332
options(scipen = 999)
t.test(mydata$V1, mydata$V2,
paired = TRUE)
##
## Paired t-test
##
## data: mydata$V1 and mydata$V2
## t = 14.898, df = 364, p-value < 0.00000000000000022
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.2181688 0.2845205
## sample estimates:
## mean difference
## 0.2513447
Just want the script? Here it is, all in one piece, ready to copy and paste.
#Installing and loading the tidyverse package
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("plotly")) install.packages("plotly")
if (!require("readr")) install.packages("readr")
library(tidyverse)
library(plotly)
library(readr)
### Date range
startdate <- "20220101"
enddate <- "20221231"
### Trump
query <- "'Trump' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
Volume <- read_csv(v_url)
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeTrump <- Volume
### Biden
query <- "'Biden' SourceCountry:US"
#Building the Volume dataframe
vp1 <- "https://api.gdeltproject.org/api/v2/doc/doc?query="
vp2 <- "&mode=timelinevolinfo&startdatetime="
vp3 <- "000000&enddatetime="
vp4 <- "000000&format=CSV"
text_v_url <- paste0(vp1, query, vp2, startdate, vp3, enddate, vp4)
v_url <- URLencode(text_v_url)
v_url
Volume <- read_csv(v_url)
Volume$Date <- as.Date(Volume$Date, "%Y-%m-%d")
VolumeBiden <- Volume
### Merging
VolumeTrumpBiden <- merge(VolumeTrump, VolumeBiden, by = "Date")
VolumeTrumpBiden$TrumpVolume <- VolumeTrumpBiden$Value.x
VolumeTrumpBiden$BidenVolume <- VolumeTrumpBiden$Value.y
#Plotting volume by date
library(plotly)
fig <- plot_ly(VolumeTrumpBiden, x = ~Date, y = ~TrumpVolume,
name = 'Trump',
type = 'scatter',
mode = 'lines+markers')
fig <- fig %>% add_trace(y = ~BidenVolume,
name = 'Biden',
mode = 'lines+markers')
fig
### Saving the data to a local .csv file
write_csv(VolumeTrumpBiden,"VolumeTrumpBiden.csv")
### Running a paired-samples t-test
mydata <- VolumeTrumpBiden
mydata$V1 <- mydata$BidenVolume
mydata$V2 <- mydata$TrumpVolume
ggplot(mydata, aes(x = V1))+geom_histogram()+
geom_vline(xintercept = mean(mydata$V1))
ggplot(mydata, aes(x = V2))+geom_histogram()+
geom_vline(xintercept = mean(mydata$V2))
mean.V1 <- mean(mydata$V1, na.rm = TRUE)
mean.V2 <- mean(mydata$V2, na.rm = TRUE)
mean.V1
mean.V2
options(scipen = 999)
t.test(mydata$V1, mydata$V2,
paired = TRUE)