Setting things up

Open the needed R packages, then load the data file into a data frame called “mydata.”

# Installing required packages
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("tidyverse"))
  install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)

# Read the data
mydata <- read.csv("GDELT_AI_Artists.csv") #Edit YOURFILENAME.csv

Define the variables

The DV and IV in a paired-samples t-test aren’t directly reflected in the code. Instead, you have “V1” and “V2” variables that give counts of the DV for the different categories of the IV. In this case, “V1” gives the count of stories that include artists, and “V2” gives the count of articles that don’t include artists. The code then shows the distribution of the daily differences between V2 and V1.

# Specify the two variables involved
mydata$V1 <- mydata$AI.articles.including.artists
mydata$V2 <- mydata$AI.articles.omitting.artists

# Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1

ggplot(mydata, aes(x = PairDifferences)) +
  geom_histogram(color = "black", fill = "#1f78b4") +
  geom_vline(aes(xintercept = mean(PairDifferences)))

# Get descriptive statistics for pair differences
mydata %>%
  select(PairDifferences) %>%
  summarise(
    count = n(),
    mean = mean(PairDifferences, na.rm = TRUE),
    sd = sd(PairDifferences, na.rm = TRUE),
    min = min(PairDifferences, na.rm = TRUE),
    max = max(PairDifferences, na.rm = TRUE)
  )
##   count    mean       sd  min max
## 1   706 39.3881 32.63831 -185 232

Running the analysis

The distribution doesn’t look particularly normal. If the number of cases were small - say, 40 or fewer - an alternative test would be needed. But because we have data for a large number of days - 706, to be exact - we can go ahead and use a paired-samples t-test.

mydata %>%
  select(V1, V2) %>%
  summarise_all(list(Mean = mean, SD = sd))
##    V1_Mean  V2_Mean    V1_SD    V2_SD
## 1 6.766289 46.15439 18.10789 34.09016
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
       paired = TRUE)
## 
##  Paired t-test
## 
## data:  mydata$V2 and mydata$V1
## t = 32.066, df = 705, p-value < 0.00000000000000022
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  36.97642 41.79978
## sample estimates:
## mean difference 
##         39.3881

Interpreting the results

There were, on average, about 46 articles a day that mentioned AI and images but excluded the views of artists. By contrast, there were only about seven articles per day that mentioned AI and images but included the views of artists. The t-test is significant, indicating that this disparity is unlikely due to random variation. Because the disparity is probably not due to random variation, it’s probably due instead to something systematic - like news media’s systematic tendency to exclude artist views from articles about AI that generates images.