Complete APNews Analysis - Red Sox vs. Yankees

Rationale

First-level agenda setting theory suggests that the amount of coverage a topic receives can influence the audience’s view on its importance. In other words, if a topic gets more coverage, it is deemed more important.

Based on this theory, the purpose of this analysis is to examine how the media allocates attention to competing topics, like the Boston Red Sox and the New York Yankees.

By counting the APNews articles about each team, we can see which team gets more attention and is perceived as more important.

Hypothesis

Weekly APNews coverage volume of the Boston Red Sox and the New York Yankees differed during the first nine months of 2025.

Variables & Method

The dependent variable in the analysis was the weekly APNews.com coverage volume of the two baseball teams. The independent variable was the team, either “Red Sox” or “Yankees.” Key words and phrases used to identify stories about the Red Sox were: “Red Sox”, “Boston Red Sox”, and “BoSox.” Key words and phrases used to identify stories about the Yankees were: “Yankees”, “New York Yankees”, and “Bronx Bombers.”

Eventually, a paired-samples t-test will be used to assess the statistical significance of coverage volume between the two teams.

Results & Discussion

The figure below summarizes each team’s weekly coverage across the first 40 weeks of the year. It appears that the coverage of the New York Yankees exceeded the coverage of the Boston Red Sox.

The most notable difference in APNews.com’s coverage appeared to be in week 26, which was the week from June 23 to June 29. The Yankees had a story count of 45, compared to only 18 stories about the Red Sox. I did some quick searching and Aaron Judge surpassed 30 home runs on June 29th, which made him the first player in MLB history to reach 30 home runs and 110 hits before the month of July. During that same time, the Red Sox dropped five of six games to the LA Dodgers and the Toronto Blue Jays, both teams that eventually advanced to the World Series.

The peak for Boston Red Sox stories came in week 25, with 25 published stories mentioning the team. Upon a deeper review of their schedule, the Red Sox traveled to San Francisco to face the Giants just one week after trading them their best hitter, Rafael Devers. The first game against his former team likely contributed to the spike in media attention.

Overall, the results suggest differing coverage of the two teams and possible nonrandom patterns in media attention, where spikes for one team may correspond to dips in coverage of the other.

The New York Yankees averaged about 19 articles per week across the analysis period. The Boston Red Sox averaged about 12 articles a week, a mean difference of about 7 articles. The Shapiro-Wilk Normality Test showed the data were not normal (p< 0.05), but with a case count of 40, a paired-sample t-test was still suitable for testing whether the average differences were statistically significant.

Below is a box plot of weekly articles for the Red Sox (V1) and Yankees (V2), followed by the results of the paired-samples t-test.

statistic	parameter	p.value	conf.low	conf.high	method
Paired-Samples t-Test
4.9976	39	0.0000	4.0329	9.5171	Paired t-test

V1_Mean	V2_Mean	V1_SD	V2_SD
Group Means and SDs (t-Test)
11.975	18.750	8.062	13.924

The significant t-test result (p<0.001) supports the hypothesis that the volume of weekly APNews.com coverage differed between the Red Sox and Yankees during the first nine months of 2025.

Code:

Here is the code that produced the figures and paired-samples t-test results.

# ============================================
# APNews text analysis (First-level agenda-setting theory version)
# ============================================

# ============================================
# --- Load required libraries ---
# ============================================

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidytext")) install.packages("tidytext")

library(tidyverse)
library(tidytext)

# ============================================
# --- Load the APNews data ---
# ============================================

# Read the data from the web
FetchedData <- readRDS(url("https://github.com/drkblake/Data/raw/refs/heads/main/APNews.rds"))
# Save the data on your computer
saveRDS(FetchedData, file = "APNews.rds")
# remove the downloaded data from the environment
rm (FetchedData)

APNews <- readRDS("APNews.rds")

# ============================================
# --- Flag Topic1-related stories ---
# ============================================

# --- Define Topic1 phrases ---
phrases <- c(
  "Red Sox",
  "Boston Red Sox",
  "BoSox"
)

# --- Escape regex special characters ---
escaped_phrases <- str_replace_all(
  phrases,
  "([\\^$.|?*+()\\[\\]{}\\\\])",
  "\\\\\\1"
)

# --- Build whole-word/phrase regex pattern ---
pattern <- paste0("\\b", escaped_phrases, "\\b", collapse = "|")

# --- Apply matching to flag Topic1 stories ---
APNews <- APNews %>%
  mutate(
    Full.Text.clean = str_squish(Full.Text),  # normalize whitespace
    Topic1 = if_else(
      str_detect(Full.Text.clean, regex(pattern, ignore_case = TRUE)),
      "Yes",
      "No"
    )
  )

# ============================================
# --- Flag Topic2-related stories ---
# ============================================

# --- Define Topic2 phrases ---
phrases <- c(
  "Yankees",
  "New York Yankees",
  "Bronx Bombers"
)

# --- Escape regex special characters ---
escaped_phrases <- str_replace_all(
  phrases,
  "([\\^$.|?*+()\\[\\]{}\\\\])",
  "\\\\\\1"
)

# --- Build whole-word/phrase regex pattern ---
pattern <- paste0("\\b", escaped_phrases, "\\b", collapse = "|")

# --- Apply matching to flag Topic2 stories ---
APNews <- APNews %>%
  mutate(
    Full.Text.clean = str_squish(Full.Text),
    Topic2 = if_else(
      str_detect(Full.Text.clean, regex(pattern, ignore_case = TRUE)),
      "Yes",
      "No"
    )
  )

# ============================================
# --- Visualize weekly counts of Topic1- and Topic2-related stories ---
# ============================================

# --- Load plotly if needed ---
if (!require("plotly")) install.packages("plotly")
library(plotly)

# --- Summarize weekly counts for Topic1 = "Yes" ---
Topic1_weekly <- APNews %>%
  filter(Topic1 == "Yes") %>%
  group_by(Week) %>%
  summarize(Count = n(), .groups = "drop") %>%
  mutate(Topic = "Red Sox") # Note custom Topic1 label

# --- Summarize weekly counts for Topic2 = "Yes" ---
Topic2_weekly <- APNews %>%
  filter(Topic2 == "Yes") %>%
  group_by(Week) %>%
  summarize(Count = n(), .groups = "drop") %>%
  mutate(Topic = "Yankees") # Note custom Topic2 label

# --- Combine both summaries into one data frame ---
Weekly_counts <- bind_rows(Topic2_weekly, Topic1_weekly)

# --- Fill in missing combinations with zero counts ---
Weekly_counts <- Weekly_counts %>%
  tidyr::complete(
    Topic,
    Week = full_seq(range(Week), 1),  # generate all week numbers
    fill = list(Count = 0)
  ) %>%
  arrange(Topic, Week)

# --- Create interactive plotly line chart ---
AS1 <- plot_ly(
  data = Weekly_counts,
  x = ~Week,
  y = ~Count,
  color = ~Topic,
  colors = c("firebrick", "steelblue"),
  type = "scatter",
  mode = "lines+markers",
  line = list(width = 2),
  marker = list(size = 6)
) %>%
  layout(
    title = "Weekly Coverage: Red Sox vs. Yankees",
    xaxis = list(
      title = "Week Number (starting with Week 1 of 2025)",
      dtick = 1
    ),
    yaxis = list(title = "Number of Articles"),
    legend = list(title = list(text = "Topic")),
    hovermode = "x unified"
  )

# ============================================
# --- Show the chart ---
# ============================================

AS1

# ============================================================
#  Setup: Install and Load Required Packages
# ============================================================
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("plotly")) install.packages("plotly")
if (!require("gt")) install.packages("gt")
if (!require("gtExtras")) install.packages("gtExtras")
if (!require("broom")) install.packages("broom")

library(tidyverse)
library(plotly)
library(gt)
library(gtExtras)
library(broom)

options(scipen = 999)

# ============================================================
#  Data Import
# ============================================================
# Reshape to wide form

mydata <- Weekly_counts %>%
  pivot_wider(names_from = Topic, values_from = Count)

# Specify the two variables involved
mydata$V1 <- mydata$`Red Sox` # <== Customize this
mydata$V2 <- mydata$Yankees # <== Customize this

# ============================================================
#  Compute Pair Differences
# ============================================================
mydata$PairDifferences <- mydata$V2 - mydata$V1

# ============================================================
#  Interactive Histogram of Pair Differences
# ============================================================
hist_plot <- plot_ly(
  data = mydata,
  x = ~PairDifferences,
  type = "histogram",
  marker = list(color = "#1f78b4", line = list(color = "black", width = 1))
) %>%
  layout(
    title = "Distribution of Pair Differences",
    xaxis = list(title = "Pair Differences"),
    yaxis = list(title = "Count"),
    shapes = list(
      list(
        type = "line",
        x0 = mean(mydata$PairDifferences, na.rm = TRUE),
        x1 = mean(mydata$PairDifferences, na.rm = TRUE),
        y0 = 0,
        y1 = max(table(mydata$PairDifferences)),
        line = list(color = "red", dash = "dash")
      )
    )
  )

# ============================================================
#  Descriptive Statistics
# ============================================================
desc_stats <- mydata %>%
  summarise(
    count = n(),
    mean = mean(PairDifferences, na.rm = TRUE),
    sd = sd(PairDifferences, na.rm = TRUE),
    min = min(PairDifferences, na.rm = TRUE),
    max = max(PairDifferences, na.rm = TRUE)
  )

desc_table <- desc_stats %>%
  gt() %>%
  gt_theme_538() %>%
  tab_header(title = "Descriptive Statistics: Pair Differences") %>%
  fmt_number(columns = where(is.numeric), decimals = 3)

# ============================================================
#  Normality Test (Shapiro-Wilk)
# ============================================================
shapiro_res <- shapiro.test(mydata$PairDifferences)
shapiro_table <- tidy(shapiro_res) %>%
  select(statistic, p.value, method) %>%
  gt() %>%
  gt_theme_538() %>%
  tab_header(title = "Normality Test (Shapiro-Wilk)") %>%
  fmt_number(columns = c(statistic, p.value), decimals = 4) %>%
  tab_source_note(
    source_note = "If the P.VALUE is 0.05 or less, the number of pairs is fewer than 40, and the distribution of pair differences shows obvious non-normality or outliers, consider using the Wilcoxon Signed Rank Test results instead of the Paired-Samples t-Test results."
  )

# ============================================================
#  Reshape Data for Repeated-Measures Plot
# ============================================================
df_long <- mydata %>%
  pivot_longer(cols = c(V1, V2),
               names_to = "Measure",
               values_to = "Value")

# ============================================================
#  Repeated-Measures Boxplot (Interactive, with Means)
# ============================================================
group_means <- df_long %>%
  group_by(Measure) %>%
  summarise(mean_value = mean(Value), .groups = "drop")

boxplot_measures <- plot_ly() %>%
  add_trace(
    data = df_long,
    x = ~Measure, y = ~Value,
    type = "box",
    boxpoints = "outliers",   
    marker = list(color = "red", size = 4),
    line = list(color = "black"),
    fillcolor = "royalblue",
    name = ""
  ) %>%
  add_trace(
    data = group_means,
    x = ~Measure, y = ~mean_value,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "diamond", size = 9,
      color = "black", line = list(color = "white", width = 1)
    ),
    text = ~paste0("Mean = ", round(mean_value, 2)),
    hoverinfo = "text",
    name = "Group Mean"
  ) %>%
  layout(
    title = "Boxplot of Repeated Measures (V1 vs V2) with Means",
    xaxis = list(title = "Measure"),
    yaxis = list(title = "Value"),
    showlegend = FALSE
  )

# ============================================================
#  Parametric Test (Paired-Samples t-Test)
# ============================================================
t_res <- t.test(mydata$V2, mydata$V1, paired = TRUE)
t_table <- tidy(t_res) %>%
  select(statistic, parameter, p.value, conf.low, conf.high, method) %>%
  gt() %>%
  gt_theme_538() %>%
  tab_header(title = "Paired-Samples t-Test") %>%
  fmt_number(columns = c(statistic, p.value, conf.low, conf.high), decimals = 4)

t_summary <- mydata %>%
  select(V1, V2) %>%
  summarise_all(list(Mean = mean, SD = sd)) %>%
  gt() %>%
  gt_theme_538() %>%
  tab_header(title = "Group Means and SDs (t-Test)") %>%
  fmt_number(columns = everything(), decimals = 3)

# ============================================================
#  Nonparametric Test (Wilcoxon Signed Rank)
# ============================================================
wilcox_res <- wilcox.test(mydata$V1, mydata$V2, paired = TRUE,
                          exact = FALSE)
wilcox_table <- tidy(wilcox_res) %>%
  select(statistic, p.value, method) %>%
  gt() %>%
  gt_theme_538() %>%
  tab_header(title = "Wilcoxon Signed Rank Test") %>%
  fmt_number(columns = c(statistic, p.value), decimals = 4)

wilcox_summary <- mydata %>%
  select(V1, V2) %>%
  summarise_all(list(Mean = mean, SD = sd)) %>%
  gt() %>%
  gt_theme_538() %>%
  tab_header(title = "Group Means and SDs (Wilcoxon)") %>%
  fmt_number(columns = everything(), decimals = 3)

# ============================================================
#  Results Summary (in specified order)
# ============================================================
hist_plot
desc_table
shapiro_table
boxplot_measures
t_table
t_summary
wilcox_table
wilcox_summary