Executive Summary

This analysis investigates the distributional characteristics of a global climate dataset using probability-based anomaly detection. By grouping data across multiple categorical dimensions and calculating group probabilities, we identify rare combinations that represent unusual environmental states or reporting patterns. Our investigation reveals critical insights about global climate patterns, energy transitions, and geographic sampling biases.


1. Data Loading and Preprocessing

# Load required libraries
library(tidyverse)
library(knitr)
library(scales)

# Set theme for consistent visualization
theme_set(theme_minimal(base_size = 12))
# Load the climate change dataset
df <- read.csv("climate_change_dataset.csv")

# Display dataset structure
str(df)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Year                       : int  2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
##  $ Country                    : chr  "UK" "USA" "France" "Argentina" ...
##  $ Avg.Temperature...C.       : num  8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
##  $ CO2.Emissions..Tons.Capita.: num  9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
##  $ Sea.Level.Rise..mm.        : num  3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
##  $ Rainfall..mm.              : int  1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
##  $ Population                 : int  530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
##  $ Renewable.Energy....       : num  20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
##  $ Extreme.Weather.Events     : int  14 8 9 7 4 12 10 1 4 5 ...
##  $ Forest.Area....            : num  59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...
# Summary statistics
summary(df)
##       Year        Country          Avg.Temperature...C.
##  Min.   :2000   Length:1000        Min.   : 5.00       
##  1st Qu.:2005   Class :character   1st Qu.:12.18       
##  Median :2012   Mode  :character   Median :20.10       
##  Mean   :2011                      Mean   :19.88       
##  3rd Qu.:2018                      3rd Qu.:27.23       
##  Max.   :2023                      Max.   :34.90       
##  CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm. 
##  Min.   : 0.500              Min.   :1.00        Min.   : 501  
##  1st Qu.: 5.575              1st Qu.:2.00        1st Qu.:1099  
##  Median :10.700              Median :3.00        Median :1726  
##  Mean   :10.426              Mean   :3.01        Mean   :1739  
##  3rd Qu.:15.400              3rd Qu.:4.00        3rd Qu.:2362  
##  Max.   :20.000              Max.   :5.00        Max.   :2999  
##    Population        Renewable.Energy.... Extreme.Weather.Events
##  Min.   :3.661e+06   Min.   : 5.10        Min.   : 0.000        
##  1st Qu.:3.436e+08   1st Qu.:16.10        1st Qu.: 3.000        
##  Median :7.131e+08   Median :27.15        Median : 8.000        
##  Mean   :7.054e+08   Mean   :27.30        Mean   : 7.291        
##  3rd Qu.:1.074e+09   3rd Qu.:38.92        3rd Qu.:11.000        
##  Max.   :1.397e+09   Max.   :50.00        Max.   :14.000        
##  Forest.Area....
##  Min.   :10.10  
##  1st Qu.:25.60  
##  Median :41.15  
##  Mean   :40.57  
##  3rd Qu.:55.80  
##  Max.   :70.00
# Check for missing values
cat("\nMissing values per column:\n")
## 
## Missing values per column:
colSums(is.na(df))
##                        Year                     Country 
##                           0                           0 
##        Avg.Temperature...C. CO2.Emissions..Tons.Capita. 
##                           0                           0 
##         Sea.Level.Rise..mm.               Rainfall..mm. 
##                           0                           0 
##                  Population        Renewable.Energy.... 
##                           0                           0 
##      Extreme.Weather.Events             Forest.Area.... 
##                           0                           0

Data Preprocessing: Creating Categorical Variables

To enable probability analysis, we discretize continuous variables into categorical bins using the cut() function.

# Create categorical bins for analysis
df <- df %>%
  mutate(
    # Categorize rainfall into three levels
    Rainfall_Cat = cut(Rainfall..mm., 
                       breaks = 3, 
                       labels = c("Low", "Medium", "High")),
    
    # Categorize renewable energy adoption
    Renewable_Cat = cut(Renewable.Energy...., 
                        breaks = 3, 
                        labels = c("Early-Stage", "Transitioning", "Advanced")),
    
    # Categorize temperature
    Temp_Cat = cut(Avg.Temperature...C., 
                   breaks = 3,
                   labels = c("Cool", "Moderate", "Warm")),
    
    # Categorize CO2 emissions
    CO2_Cat = cut(CO2.Emissions..Tons.Capita., 
                  breaks = 3,
                  labels = c("Low-Emissions", "Medium-Emissions", "High-Emissions"))
  )

# Preview the transformed data
head(df, 10) %>% kable(caption = "Sample of Preprocessed Data")
Sample of Preprocessed Data
Year Country Avg.Temperature…C. CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm. Population Renewable.Energy…. Extreme.Weather.Events Forest.Area…. Rainfall_Cat Renewable_Cat Temp_Cat CO2_Cat
2006 UK 8.9 9.3 3.1 1441 530911230 20.4 14 59.8 Medium Transitioning Cool Medium-Emissions
2019 USA 31.0 4.8 4.2 2407 107364344 49.2 8 31.0 High Advanced Warm Low-Emissions
2014 France 33.9 2.8 2.2 1241 441101758 33.3 9 35.5 Low Transitioning Warm Low-Emissions
2010 Argentina 5.9 1.8 3.2 1892 1069669579 23.7 7 17.7 Medium Transitioning Cool Low-Emissions
2007 Germany 26.9 5.6 2.4 1743 124079175 12.5 4 17.4 Medium Early-Stage Warm Low-Emissions
2020 China 32.3 1.4 2.7 2100 1202028857 49.4 12 47.2 Medium Advanced Warm Low-Emissions
2006 Argentina 30.7 11.6 3.9 1755 586706107 41.9 10 50.5 Medium Advanced Warm Medium-Emissions
2018 South Africa 33.9 6.0 4.5 827 83947380 17.7 1 56.6 Low Early-Stage Warm Low-Emissions
2022 UK 27.8 16.6 1.5 1966 980305187 8.2 4 43.4 Medium Early-Stage Warm High-Emissions
2010 Australia 18.3 1.9 3.5 2599 849496137 7.5 5 48.7 High Early-Stage Moderate Low-Emissions

2. Investigation 1: Geographic Distribution and Thermal Profiles

Objective

Analyze which countries are most and least represented in the dataset, and examine their average temperature profiles to identify geographic sampling biases.

# Group by Country and calculate statistics
df_1 <- df %>%
  group_by(Country) %>%
  summarise(
    Avg_Temp = mean(Avg.Temperature...C., na.rm = TRUE),
    SD_Temp = sd(Avg.Temperature...C., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_GEOGRAPHY", "Standard")
  ) %>%
  arrange(Group_Count)

# Display the results
df_1 %>% kable(caption = "Country-Level Analysis: Probability Distribution", 
               digits = 3)
Country-Level Analysis: Probability Distribution
Country Avg_Temp SD_Temp Group_Count Probability Tag
Mexico 20.696 9.621 55 0.055 RARE_GEOGRAPHY
Australia 19.449 7.829 57 0.057 Standard
Germany 20.289 8.041 61 0.061 Standard
Japan 20.454 9.132 63 0.063 Standard
UK 18.491 8.200 65 0.065 Standard
France 19.383 8.884 66 0.066 Standard
Argentina 19.299 8.330 67 0.067 Standard
Brazil 20.836 8.970 67 0.067 Standard
Canada 20.012 8.445 67 0.067 Standard
China 20.282 8.788 67 0.067 Standard
India 19.764 8.669 70 0.070 Standard
South Africa 20.738 9.170 73 0.073 Standard
USA 19.049 8.341 73 0.073 Standard
Russia 20.719 7.833 74 0.074 Standard
Indonesia 18.919 8.244 75 0.075 Standard

Analysis of Investigation 1

Key Findings:

# Identify the rarest country
rarest_country <- df_1 %>% filter(Tag == "RARE_GEOGRAPHY")
most_common_country <- df_1 %>% filter(Group_Count == max(Group_Count))

cat("Rarest Country:", rarest_country$Country, "\n")
## Rarest Country: Mexico
cat("Probability:", round(rarest_country$Probability, 4), "\n")
## Probability: 0.055
cat("Count:", rarest_country$Group_Count, "\n\n")
## Count: 55
cat("Most Common Country:", most_common_country$Country, "\n")
## Most Common Country: Indonesia
cat("Probability:", round(most_common_country$Probability, 4), "\n")
## Probability: 0.075
cat("Count:", most_common_country$Group_Count, "\n")
## Count: 75

Significance:

The rarest country has a probability of 0.055, meaning if we randomly select a row from the dataset, there is only a 5.5% chance it belongs to Mexico. In a perfectly balanced dataset with 15 countries, we would expect each country to have a probability of approximately 0.067 (6.7%). The deviation from this uniform distribution indicates geographic sampling bias.

Testable Hypothesis:

H₁: Countries with lower representation in the dataset have higher variance in their temperature measurements compared to more frequently sampled countries.

This hypothesis is quantifiable because we can calculate the coefficient of variation (CV = SD/Mean) for each country and test whether there’s a negative correlation between sample size and temperature variance using Spearman’s rank correlation.

# Calculate coefficient of variation
df_1 <- df_1 %>%
  mutate(Temp_CV = SD_Temp / abs(Avg_Temp))

# Test correlation between sample size and variance
cor_test <- cor.test(df_1$Group_Count, df_1$Temp_CV, method = "spearman")
cat("Spearman correlation between sample size and temperature CV:", 
    round(cor_test$estimate, 3), "\n")
## Spearman correlation between sample size and temperature CV: -0.202
cat("P-value:", format.pval(cor_test$p.value, digits = 3), "\n")
## P-value: 0.47

Further Questions:

  1. Does the temporal distribution differ across countries (i.e., are some countries sampled more in recent years)?
  2. Is there a relationship between a country’s average temperature and its representation in the dataset?
# Visualization 1: Bar chart of country frequencies
ggplot(df_1, aes(x = reorder(Country, Group_Count), y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("RARE_GEOGRAPHY" = "#d73027", "Standard" = "#4575b4")) +
  labs(title = "Probability of Occurrence by Country",
       subtitle = "Red indicates the rarest geographic group",
       x = "Country", 
       y = "Observation Count") +
  geom_hline(yintercept = mean(df_1$Group_Count), linetype = "dashed", color = "gray40") +
  annotate("text", x = 1, y = mean(df_1$Group_Count), 
           label = "Mean Count", vjust = -0.5, hjust = -0.1, size = 3.5) +
  theme(legend.position = "bottom")


3. Investigation 2: Hydrological States and Carbon Footprints

Objective

Examine the relationship between rainfall categories and CO2 emissions to identify which hydrological states are most anomalous.

# Group by Rainfall Category
df_2 <- df %>%
  group_by(Rainfall_Cat) %>%
  summarise(
    Mean_CO2 = mean(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    Median_CO2 = median(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    SD_CO2 = sd(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_HYDRO_STATE", "Common")
  ) %>%
  arrange(Probability)

# Display results
df_2 %>% kable(caption = "Hydrological State Analysis: Rainfall vs CO2 Emissions",
               digits = 3)
Hydrological State Analysis: Rainfall vs CO2 Emissions
Rainfall_Cat Mean_CO2 Median_CO2 SD_CO2 Group_Count Probability Tag
High 10.445 10.8 5.726 319 0.319 RARE_HYDRO_STATE
Low 10.212 10.5 5.527 335 0.335 Common
Medium 10.615 10.8 5.605 346 0.346 Common

Analysis of Investigation 2

Key Findings:

rarest_hydro <- df_2 %>% filter(Tag == "RARE_HYDRO_STATE")
cat("Rarest Hydrological State:", as.character(rarest_hydro$Rainfall_Cat), "\n")
## Rarest Hydrological State: High
cat("Probability:", round(rarest_hydro$Probability, 4), "\n")
## Probability: 0.319
cat("Mean CO2 Emissions:", round(rarest_hydro$Mean_CO2, 2), "Tons/Capita\n")
## Mean CO2 Emissions: 10.45 Tons/Capita

Significance:

The High rainfall category represents the tail end of the probability distribution with P = 0.319. While the distribution is relatively balanced, this indicates that extreme precipitation states are statistically less common than moderate conditions. This aligns with climatological expectations where extreme weather events, by definition, occur less frequently.

Testable Hypothesis:

H₂: High rainfall states (>2000mm) are associated with lower CO2 emissions per capita due to reduced industrial activity during extreme weather periods.

This is quantifiable through a t-test comparing mean CO2 emissions between High rainfall and Low/Medium rainfall groups, or through a linear regression of rainfall amount on CO2 emissions.

# Test if High rainfall has significantly different CO2 emissions
high_rainfall_co2 <- df %>% 
  filter(Rainfall_Cat == "High") %>% 
  pull(CO2.Emissions..Tons.Capita.)

other_rainfall_co2 <- df %>% 
  filter(Rainfall_Cat != "High") %>% 
  pull(CO2.Emissions..Tons.Capita.)

t_test_result <- t.test(high_rainfall_co2, other_rainfall_co2)
cat("T-test comparing High rainfall vs Others:\n")
## T-test comparing High rainfall vs Others:
cat("Mean CO2 (High):", round(mean(high_rainfall_co2, na.rm = TRUE), 2), "\n")
## Mean CO2 (High): 10.45
cat("Mean CO2 (Others):", round(mean(other_rainfall_co2, na.rm = TRUE), 2), "\n")
## Mean CO2 (Others): 10.42
cat("P-value:", format.pval(t_test_result$p.value, digits = 3), "\n")
## P-value: 0.941

Further Questions:

  1. Is there a temporal pattern to rainfall extremes (e.g., are they increasing over time)?
  2. Do certain countries disproportionately contribute to the “High” rainfall category?
# Visualization 2: Frequency and CO2 emissions by rainfall category
p1 <- ggplot(df_2, aes(x = Rainfall_Cat, y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#fc8d59", "Common" = "#91bfdb")) +
  labs(title = "Frequency of Hydrological States",
       x = "Rainfall Category", 
       y = "Observation Count") +
  theme(legend.position = "none")

p2 <- ggplot(df_2, aes(x = Rainfall_Cat, y = Mean_CO2, fill = Tag)) +
  geom_col() +
  scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#fc8d59", "Common" = "#91bfdb")) +
  labs(title = "Mean CO2 Emissions by Rainfall State",
       x = "Rainfall Category", 
       y = "Mean CO2 (Tons/Capita)") +
  theme(legend.position = "bottom")

# Display both plots
p1

p2


4. Investigation 3: Energy Transition States and Extreme Weather

Objective

Analyze the relationship between renewable energy adoption levels and extreme weather events to understand energy transition patterns.

# Group by Renewable Energy Category
df_3 <- df %>%
  group_by(Renewable_Cat) %>%
  summarise(
    Avg_Extreme_Events = mean(Extreme.Weather.Events, na.rm = TRUE),
    Median_Extreme_Events = median(Extreme.Weather.Events, na.rm = TRUE),
    SD_Extreme_Events = sd(Extreme.Weather.Events, na.rm = TRUE),
    Avg_Forest = mean(Forest.Area...., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_ENERGY_LEVEL", "Standard")
  ) %>%
  arrange(Group_Count)

# Display results
df_3 %>% kable(caption = "Energy Transition Analysis: Renewable Energy vs Extreme Weather",
               digits = 3)
Energy Transition Analysis: Renewable Energy vs Extreme Weather
Renewable_Cat Avg_Extreme_Events Median_Extreme_Events SD_Extreme_Events Avg_Forest Group_Count Probability Tag
Advanced 7.046 7.0 4.432 39.896 323 0.323 RARE_ENERGY_LEVEL
Early-Stage 7.120 7.5 4.503 41.267 334 0.334 Standard
Transitioning 7.688 8.0 4.318 40.532 343 0.343 Standard

Analysis of Investigation 3

Key Findings:

rarest_energy <- df_3 %>% filter(Tag == "RARE_ENERGY_LEVEL")
cat("Rarest Energy Transition State:", as.character(rarest_energy$Renewable_Cat), "\n")
## Rarest Energy Transition State: Advanced
cat("Probability:", round(rarest_energy$Probability, 4), "\n")
## Probability: 0.323
cat("Average Extreme Weather Events:", round(rarest_energy$Avg_Extreme_Events, 2), "\n")
## Average Extreme Weather Events: 7.05

Significance:

The Advanced renewable energy group is the least common (P = 0.323), reflecting the global reality of energy transition. Most observations fall into the “Transitioning” phase, while achieving advanced renewable integration (>35% renewable energy) is a rare, high-achievement state that only a subset of nations and time periods have reached.

Testable Hypothesis:

H₃: Countries in the “Advanced” renewable energy category have experienced fewer extreme weather events on average compared to “Early-Stage” countries, controlling for geographic location.

This hypothesis can be tested using ANOVA to compare mean extreme weather events across the three renewable energy categories, followed by post-hoc tests.

# ANOVA test for extreme weather events across renewable categories
anova_model <- aov(Extreme.Weather.Events ~ Renewable_Cat, data = df)
summary(anova_model)
##                Df Sum Sq Mean Sq F value Pr(>F)
## Renewable_Cat   2     83   41.59   2.131  0.119
## Residuals     997  19457   19.52
# Post-hoc Tukey test
tukey_result <- TukeyHSD(anova_model)
print(tukey_result)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Extreme.Weather.Events ~ Renewable_Cat, data = df)
## 
## $Renewable_Cat
##                                  diff        lwr       upr     p adj
## Transitioning-Early-Stage  0.56828617 -0.2288239 1.3653963 0.2159483
## Advanced-Early-Stage      -0.07332085 -0.8825145 0.7358728 0.9753731
## Advanced-Transitioning    -0.64160702 -1.4455645 0.1623504 0.1470164

Interesting Observation:

The data reveals a counterintuitive pattern: the “Transitioning” group has 7.69 average extreme weather events, which is higher than the “Advanced” group (7.05).

Further Questions:

  1. Is there a feedback loop where extreme weather events accelerate renewable energy policy adoption?
  2. Do countries with higher forest coverage transition to renewable energy faster?
# Visualization 3: Extreme weather by energy transition state
ggplot(df_3, aes(x = Renewable_Cat, y = Avg_Extreme_Events, fill = Tag)) +
  geom_col() +
  geom_errorbar(aes(ymin = Avg_Extreme_Events - SD_Extreme_Events, 
                    ymax = Avg_Extreme_Events + SD_Extreme_Events),
                width = 0.2) +
  scale_fill_manual(values = c("RARE_ENERGY_LEVEL" = "#238b45", "Standard" = "#737373")) +
  labs(title = "Extreme Weather Events by Energy Transition State",
       subtitle = "Error bars represent ± 1 standard deviation",
       x = "Renewable Energy Category", 
       y = "Mean Extreme Weather Events") +
  theme(legend.position = "bottom")

# Additional visualization: Count by category
ggplot(df_3, aes(x = Renewable_Cat, y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("RARE_ENERGY_LEVEL" = "#238b45", "Standard" = "#737373")) +
  labs(title = "Distribution of Energy Transition States",
       subtitle = "Green indicates the rarest energy level",
       x = "Renewable Energy Category", 
       y = "Observation Count") +
  theme(legend.position = "bottom")


5. Bivariate Categorical Analysis: Country and Rainfall Combinations

Objective

Identify which combinations of Country and Rainfall Category exist in the data, find missing combinations, and analyze the most/least common pairings.

# Build all existing combinations
combinations_df <- df %>%
  group_by(Country, Rainfall_Cat) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Probability = Count / sum(Count))

# Identify extremes
most_common <- combinations_df %>% 
  filter(Count == max(Count))

least_common <- combinations_df %>% 
  filter(Count == min(Count))

# Create all possible combinations
all_countries <- unique(df$Country)
all_rainfall <- levels(df$Rainfall_Cat)
all_possible <- expand.grid(Country = all_countries, 
                           Rainfall_Cat = all_rainfall,
                           stringsAsFactors = FALSE)

# Find missing combinations
missing_combos <- anti_join(all_possible, combinations_df, 
                           by = c("Country", "Rainfall_Cat"))

Results: Combination Analysis

# Display most common combinations
cat("MOST COMMON COMBINATIONS:\n")
## MOST COMMON COMBINATIONS:
most_common %>% 
  arrange(desc(Count)) %>%
  kable(caption = "Most Frequent Country-Rainfall Combinations")
Most Frequent Country-Rainfall Combinations
Country Rainfall_Cat Count Probability
India Medium 29 0.029
Russia Low 29 0.029
UK Low 29 0.029
cat("\nLEAST COMMON COMBINATIONS:\n")
## 
## LEAST COMMON COMBINATIONS:
least_common %>% 
  arrange(Count) %>%
  kable(caption = "Rarest Country-Rainfall Combinations")
Rarest Country-Rainfall Combinations
Country Rainfall_Cat Count Probability
UK High 14 0.014
cat("\nMISSING COMBINATIONS:\n")
## 
## MISSING COMBINATIONS:
if(nrow(missing_combos) > 0) {
  print(missing_combos %>% kable(caption = "Non-existent Country-Rainfall Pairs"))
} else {
  cat("No missing combinations found. All countries have observations in all rainfall categories.\n")
}
## No missing combinations found. All countries have observations in all rainfall categories.
# Summary statistics
cat("\n\nCOMBINATION STATISTICS:\n")
## 
## 
## COMBINATION STATISTICS:
cat("Total possible combinations:", nrow(all_possible), "\n")
## Total possible combinations: 45
cat("Observed combinations:", nrow(combinations_df), "\n")
## Observed combinations: 45
cat("Missing combinations:", nrow(missing_combos), "\n")
## Missing combinations: 0
cat("Percentage coverage:", 
    round(100 * nrow(combinations_df) / nrow(all_possible), 2), "%\n")
## Percentage coverage: 100 %

Interpretation of Combinations

Missing Combinations:

if(nrow(missing_combos) > 0) {
  cat("The dataset is missing", nrow(missing_combos), "country-rainfall combinations.\n")
  cat("This suggests:\n")
  cat("1. Incomplete temporal coverage for certain geographic regions\n")
  cat("2. Climate constraints preventing certain countries from experiencing all rainfall states\n")
  cat("3. Data collection gaps during specific hydrological conditions\n")
} else {
  cat("The absence of missing combinations is remarkable. Every country in the dataset\n")
  cat("has at least one record in Low, Medium, and High rainfall categories. This suggests:\n\n")
  cat("1. The dataset covers a sufficiently long temporal range where every nation\n")
  cat("   eventually experienced all three hydrological states\n")
  cat("2. The data may be synthetic or carefully balanced\n")
  cat("3. The rainfall categorization bins are wide enough to capture natural variability\n")
}
## The absence of missing combinations is remarkable. Every country in the dataset
## has at least one record in Low, Medium, and High rainfall categories. This suggests:
## 
## 1. The dataset covers a sufficiently long temporal range where every nation
##    eventually experienced all three hydrological states
## 2. The data may be synthetic or carefully balanced
## 3. The rainfall categorization bins are wide enough to capture natural variability

Most Common Combinations:

cat("\nThe most frequent combinations are:\n")
## 
## The most frequent combinations are:
for(i in 1:min(3, nrow(most_common))) {
  cat(sprintf("- %s + %s: %d observations (P = %.4f)\n", 
              most_common$Country[i], 
              most_common$Rainfall_Cat[i], 
              most_common$Count[i],
              most_common$Probability[i]))
}
## - India + Medium: 29 observations (P = 0.0290)
## - Russia + Low: 29 observations (P = 0.0290)
## - UK + Low: 29 observations (P = 0.0290)
cat("\nThese combinations likely represent:\n")
## 
## These combinations likely represent:
cat("1. Countries with stable, consistent rainfall patterns\n")
## 1. Countries with stable, consistent rainfall patterns
cat("2. Geographic regions with more comprehensive data collection\n")
## 2. Geographic regions with more comprehensive data collection
cat("3. Climate zones where certain rainfall levels are climatologically expected\n")
## 3. Climate zones where certain rainfall levels are climatologically expected

Least Common Combinations (RARE COMBINATIONS):

cat("\nThe rarest combinations are:\n")
## 
## The rarest combinations are:
for(i in 1:min(5, nrow(least_common))) {
  cat(sprintf("- %s + %s: %d observations (P = %.4f)\n", 
              least_common$Country[i], 
              least_common$Rainfall_Cat[i], 
              least_common$Count[i],
              least_common$Probability[i]))
}
## - UK + High: 14 observations (P = 0.0140)
cat("\n\nWhy are these rare?\n")
## 
## 
## Why are these rare?
cat("These represent CLIMATOLOGICAL ANOMALIES - conditions that are:\n")
## These represent CLIMATOLOGICAL ANOMALIES - conditions that are:
cat("1. Statistically unlikely given the country's typical climate\n")
## 1. Statistically unlikely given the country's typical climate
cat("2. Extreme weather events that occur infrequently\n")
## 2. Extreme weather events that occur infrequently
cat("3. Transitions between climate states during unusual atmospheric conditions\n")
## 3. Transitions between climate states during unusual atmospheric conditions
cat("\nFor example, if UK-High is among the rarest, this makes sense because\n")
## 
## For example, if UK-High is among the rarest, this makes sense because
cat("the UK typically experiences moderate, consistent rainfall rather than extremes.\n")
## the UK typically experiences moderate, consistent rainfall rather than extremes.

Visualization of Combinations

# Heatmap of Country-Rainfall Combinations
ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Count)) +
  geom_tile(color = "white", size = 0.5) +
  scale_fill_gradient(low = "#f7fbff", high = "#08306b",
                      name = "Count") +
  labs(title = "Heatmap of Country-Rainfall Combinations",
       subtitle = "Darker tiles represent higher occurrence probabilities",
       x = "Country",
       y = "Rainfall Category") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

# Alternative: Probability-based heatmap
ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Probability)) +
  geom_tile(color = "white", size = 0.5) +
  geom_text(aes(label = Count), size = 3, color = "white") +
  scale_fill_gradient(low = "#feedde", high = "#a63603",
                      name = "Probability",
                      labels = percent_format(accuracy = 0.1)) +
  labs(title = "Probability Distribution of Country-Rainfall Combinations",
       subtitle = "Numbers show observation counts; color intensity shows probability",
       x = "Country",
       y = "Rainfall Category") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

# Bar plot showing top and bottom combinations
top_bottom <- bind_rows(
  combinations_df %>% 
    arrange(desc(Count)) %>% 
    head(10) %>% 
    mutate(Category = "Top 10 Most Common"),
  combinations_df %>% 
    arrange(Count) %>% 
    head(10) %>% 
    mutate(Category = "Top 10 Rarest")
) %>%
  mutate(Combo = paste(Country, Rainfall_Cat, sep = " - "))

ggplot(top_bottom, aes(x = reorder(Combo, Count), y = Count, fill = Category)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Top 10 Most Common" = "#2166ac", 
                               "Top 10 Rarest" = "#b2182b")) +
  labs(title = "Extreme Combinations: Most Common vs Rarest",
       x = "Country-Rainfall Combination",
       y = "Observation Count",
       fill = "") +
  theme(legend.position = "bottom")


6. Synthesis and Conclusions

Summary of Key Findings

Our probability-based anomaly detection revealed three critical insights:

1. Geographic Sampling Bias - Mexico is the rarest country (P = 0.055) - This suggests systematic under-sampling or late integration into reporting frameworks - Hypothesis testing showed no significant correlation between sample size and temperature variance

2. Hydrological State Distribution - High rainfall is the rarest state (P = 0.319) - Extreme precipitation events are statistically infrequent, aligning with climatological expectations - CO2 emissions show no significant difference between high and other rainfall states

3. Energy Transition Patterns - Advanced renewable adoption is the rarest (P = 0.323) - Only a subset of observations represent advanced clean energy integration - Counterintuitively, transitioning states experience more extreme weather events

4. Combination Analysis - Complete coverage across all country-rainfall pairs - Rarest combinations represent climatological anomalies - Most common combinations reflect stable climate patterns

Implications for Climate Modeling

These “anomalies” are not data errors but represent the boundaries of observed climate systems. The most significant climate insights often reside in these least frequent data points, as they represent:

  1. Extreme Events: Low-probability states that have disproportionate impacts
  2. Transition States: Rare combinations indicating system changes
  3. Data Gaps: Under-sampled regions requiring targeted collection efforts

Recommendations for Future Analysis

  1. Temporal Analysis: Examine whether rare states are becoming more or less frequent over time
  2. Causal Modeling: Investigate causal relationships between renewable energy adoption and extreme weather
  3. Spatial Analysis: Map geographic clusters of rare combinations
  4. Multivariate Anomaly Detection: Extend to 3+ dimensional categorical combinations

Appendix: Additional Exploratory Analysis

# Distribution of probabilities across all combinations
summary_stats <- combinations_df %>%
  summarise(
    Mean_Count = mean(Count),
    Median_Count = median(Count),
    SD_Count = sd(Count),
    Min_Count = min(Count),
    Max_Count = max(Count),
    Skewness = (mean(Count) - median(Count)) / sd(Count)
  )

cat("Distribution Statistics for Country-Rainfall Combinations:\n")
## Distribution Statistics for Country-Rainfall Combinations:
print(summary_stats)
## # A tibble: 1 × 6
##   Mean_Count Median_Count SD_Count Min_Count Max_Count Skewness
##        <dbl>        <int>    <dbl>     <int>     <int>    <dbl>
## 1       22.2           22     3.77        14        29   0.0590
# Histogram of combination frequencies
ggplot(combinations_df, aes(x = Count)) +
  geom_histogram(bins = 20, fill = "#4292c6", color = "white") +
  geom_vline(xintercept = mean(combinations_df$Count), 
             linetype = "dashed", color = "red", size = 1) +
  labs(title = "Distribution of Combination Frequencies",
       subtitle = "Red line shows mean count",
       x = "Number of Observations",
       y = "Number of Combinations") +
  theme_minimal()


Session Information

sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0    knitr_1.51      lubridate_1.9.4 forcats_1.0.1  
##  [5] stringr_1.6.0   dplyr_1.1.4     purrr_1.2.1     readr_2.1.6    
##  [9] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
##  [5] jquerylib_0.1.4    yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
##  [9] labeling_0.4.3     generics_0.1.4     bslib_0.10.0       pillar_1.11.1     
## [13] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        stringi_1.8.7     
## [17] cachem_1.1.0       xfun_0.56          sass_0.4.10        S7_0.2.1          
## [21] otel_0.2.0         timechange_0.3.0   cli_3.6.5          withr_3.0.2       
## [25] magrittr_2.0.4     digest_0.6.39      grid_4.5.2         rstudioapi_0.18.0 
## [29] hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.1        evaluate_1.0.5    
## [33] glue_1.8.0         farver_2.1.2       rmarkdown_2.30     tools_4.5.2       
## [37] pkgconfig_2.0.3    htmltools_0.5.9