In this project, we analyze data from Groundhog Day predictions dating back to 1880, focusing on the period from 1920 to 2020. By grouping the data into decades and comparing the frequency of “shadow seen” versus “no shadow,” we aim to determine whether these outcomes have varied significantly over time. Using descriptive statistics and a Chi-Square Test of Independence, we assess whether the distribution of predictions is uniform across decades or if a temporal trend is present.
We filter out missing data, create a decade variable from the year, and then count the number of predictions per decade by shadow outcome. This allows us to analyze trends across time in groundhog predictions.
# Step 1: Create a cleaned version of the data with a new decade column
decade_full <- predictions %>%
filter(!is.na(year), !is.na(shadow)) %>%
mutate(decade = floor(year / 10) * 10) %>%
filter(decade >= 1920, decade <= 2020)
# Step 2: Count predictions by decade and shadow status
summary_table <- decade_full %>%
group_by(decade, shadow) %>%
summarise(count = n(), .groups = "drop")
# Step 3: Convert to base data frame and view if needed
df_groundhog_2 <- as.data.frame(summary_table)
# Step 4: Create a contingency table of counts for the Chi-square test
observed_counts <- df_groundhog_2 %>%
tidyr::pivot_wider(
names_from = shadow,
values_from = count,
values_fill = list(count = 0)
)
# First count predictions
prediction_counts <- predictions %>%
filter(!is.na(year), !is.na(shadow)) %>%
mutate(decade = floor(year / 10) * 10) %>%
group_by(decade, shadow) %>%
summarise(count = n(), .groups = "drop")
# Then calculate descriptive stats based on counts
stats_by_shadow <- prediction_counts %>%
group_by(shadow) %>%
summarise(
mean_count = mean(count),
median_count = median(count),
sd_count = sd(count),
se_count = sd(count) / sqrt(n())
)
stats_by_shadow
## # A tibble: 2 × 5
## shadow mean_count median_count sd_count se_count
## <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 54.3 9 81.3 23.5
## 2 TRUE 44.3 19 62.3 16.1
On average, the number of “shadow seen” predictions per decade (mean = 44.33) was lower than the number of “no shadow” predictions (mean = 54.33). The median number of “shadow seen” predictions was 19, indicating that half of the observed decades recorded fewer than 19 such predictions. The standard deviation within each group (shadow seen vs. no shadow) reflects the degree of variability across decades, with higher values denoting greater dispersion from the mean. Furthermore, a higher standard error indicates increased uncertainty around the mean estimate, suggesting that the decade-to-decade variation weakens confidence in the representativeness of the mean.
ggplot(df_groundhog_2, aes(x = factor(decade), y = count, fill = factor(shadow))) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Prediction Outcomes per Decade",
x = "Decade",
y = "Count of Predictions",
fill = "Shadow Seen"
) +
theme_classic() +
scale_fill_manual(values = c("TRUE" = "#4A90E2", "FALSE" = "#E94B3C")) +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
)
# Find total N
total_N <- sum(df_groundhog_2$count)
total_N
## [1] 1293
# Create contingency table
contingency_table <- df_groundhog_2 %>%
pivot_wider(names_from = shadow, values_from = count, values_fill = 0) %>%
column_to_rownames("decade")
# Chi-square test
chi_result <- chisq.test(as.matrix(contingency_table))
chi_result
##
## Pearson's Chi-squared test
##
## data: as.matrix(contingency_table)
## X-squared = 93.423, df = 10, p-value = 1.119e-15
We conducted a Chi-Square Test of Independence to examine whether the groundhog’s shadow prediction outcomes (shadow seen vs. not seen) were independent of the decade. This test helps us determine if the distribution of outcomes changed over time. The mosaic plot below illustrates both the number of predictions in each decade and the proportion of each outcome, making it easier to identify any potential trends.
# Create Mosaic Plot
mosaicplot(contingency_table, color = c("purple", "#4A90E2","#E94B3C" ), xlab ="Decade", ylab = "Shadow Seen", main = "Shadow Predictions by Decade", cex.axis =0.63)
df_groundhog_2 %>%
group_by(decade) %>%
mutate(proportion = count / sum(count)) %>%
ggplot(aes(x = factor(decade), y = proportion, fill = factor(shadow))) +
geom_bar(stat = "identity", position = "fill") + scale_fill_manual(values = c("TRUE" = "#4A90E2", "FALSE" = "#E94B3C")) +
labs(
title = "Proportion of Shadow Outcomes by Decade",
x = "Decade",
y = "Proportion",
fill = "Shadow Seen"
) +
theme_classic() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5))
This project explored historical Groundhog Day predictions to determine whether the distribution of “shadow” versus “no shadow” outcomes has shifted over time. By analyzing data from 1920 to 2020 and grouping predictions by decade, we identified clear variations in the frequency of outcomes.
The results were statistically significant, X²(10, N = 1293) = 93.42, p < 0.001, suggesting that the distribution of groundhog predictions has changed over time. This finding indicates that the likelihood of a groundhog seeing its shadow varies depending on the decade, rather than remaining constant over the 20th and early 21st centuries.
These results raise interesting questions about what might influence such variability—whether changes in weather patterns, cultural narratives, or even inconsistencies in how predictions are reported. Future research could focus on comparing predictions to actual weather records (e.g., cloud cover, temperature), or expanding the dataset to include other regional groundhog events. Such work would enhance our understanding of whether these traditions have any empirical accuracy—or are simply cultural folklore evolving with time.
Johnson, F. (2024, January 30). Tidytuesday/Data/2024/2024-01-30/readme.md at main · rfordatascience/tidytuesday. GitHub. https://github.com/rfordatascience/tidytuesday/blob/main/data/2024/2024-01-30/readme.md Wikimedia Foundation. (2025, April 4). Punxsutawney Phil. Wikipedia. https://en.wikipedia.org/wiki/Punxsutawney_Phil