Executive Summary

This is to analyse the distributional characteristics of my global climate dataset using probability based anomaly detection. I grouped the data across many categorical dimensions and calculated group probabilities to identify rare combinations that represent maybe unusual environmental states or some kinds of reporting patterns.


1. Data Loading and Preprocessing

library(tidyverse)
library(knitr)
library(scales)

theme_set(theme_minimal(base_size = 12))
df <- read.csv("climate_change_dataset.csv")

str(df)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Year                       : int  2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
##  $ Country                    : chr  "UK" "USA" "France" "Argentina" ...
##  $ Avg.Temperature...C.       : num  8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
##  $ CO2.Emissions..Tons.Capita.: num  9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
##  $ Sea.Level.Rise..mm.        : num  3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
##  $ Rainfall..mm.              : int  1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
##  $ Population                 : int  530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
##  $ Renewable.Energy....       : num  20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
##  $ Extreme.Weather.Events     : int  14 8 9 7 4 12 10 1 4 5 ...
##  $ Forest.Area....            : num  59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...
summary(df)
##       Year        Country          Avg.Temperature...C.
##  Min.   :2000   Length:1000        Min.   : 5.00       
##  1st Qu.:2005   Class :character   1st Qu.:12.18       
##  Median :2012   Mode  :character   Median :20.10       
##  Mean   :2011                      Mean   :19.88       
##  3rd Qu.:2018                      3rd Qu.:27.23       
##  Max.   :2023                      Max.   :34.90       
##  CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm. 
##  Min.   : 0.500              Min.   :1.00        Min.   : 501  
##  1st Qu.: 5.575              1st Qu.:2.00        1st Qu.:1099  
##  Median :10.700              Median :3.00        Median :1726  
##  Mean   :10.426              Mean   :3.01        Mean   :1739  
##  3rd Qu.:15.400              3rd Qu.:4.00        3rd Qu.:2362  
##  Max.   :20.000              Max.   :5.00        Max.   :2999  
##    Population        Renewable.Energy.... Extreme.Weather.Events
##  Min.   :3.661e+06   Min.   : 5.10        Min.   : 0.000        
##  1st Qu.:3.436e+08   1st Qu.:16.10        1st Qu.: 3.000        
##  Median :7.131e+08   Median :27.15        Median : 8.000        
##  Mean   :7.054e+08   Mean   :27.30        Mean   : 7.291        
##  3rd Qu.:1.074e+09   3rd Qu.:38.92        3rd Qu.:11.000        
##  Max.   :1.397e+09   Max.   :50.00        Max.   :14.000        
##  Forest.Area....
##  Min.   :10.10  
##  1st Qu.:25.60  
##  Median :41.15  
##  Mean   :40.57  
##  3rd Qu.:55.80  
##  Max.   :70.00
cat("\nMissing values per column:\n")
## 
## Missing values per column:
colSums(is.na(df))
##                        Year                     Country 
##                           0                           0 
##        Avg.Temperature...C. CO2.Emissions..Tons.Capita. 
##                           0                           0 
##         Sea.Level.Rise..mm.               Rainfall..mm. 
##                           0                           0 
##                  Population        Renewable.Energy.... 
##                           0                           0 
##      Extreme.Weather.Events             Forest.Area.... 
##                           0                           0

Discretized continuous variables into categorical bins using the cut() function probability analysis

df <- df %>%
  mutate(
    Rainfall_Cat = cut(Rainfall..mm., 
                       breaks = 3, 
                       labels = c("Low", "Medium", "High")),
    
    Renewable_Cat = cut(Renewable.Energy...., 
                        breaks = 3, 
                        labels = c("Early-Stage", "Transitioning", "Advanced")),
    
    Temp_Cat = cut(Avg.Temperature...C., 
                   breaks = 3,
                   labels = c("Cool", "Moderate", "Warm")),
    
    CO2_Cat = cut(CO2.Emissions..Tons.Capita., 
                  breaks = 3,
                  labels = c("Low-Emissions", "Medium-Emissions", "High-Emissions"))
  )

head(df, 10) %>% kable(caption = "Sample of Preprocessed Data")
Sample of Preprocessed Data
Year Country Avg.Temperature…C. CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm. Population Renewable.Energy…. Extreme.Weather.Events Forest.Area…. Rainfall_Cat Renewable_Cat Temp_Cat CO2_Cat
2006 UK 8.9 9.3 3.1 1441 530911230 20.4 14 59.8 Medium Transitioning Cool Medium-Emissions
2019 USA 31.0 4.8 4.2 2407 107364344 49.2 8 31.0 High Advanced Warm Low-Emissions
2014 France 33.9 2.8 2.2 1241 441101758 33.3 9 35.5 Low Transitioning Warm Low-Emissions
2010 Argentina 5.9 1.8 3.2 1892 1069669579 23.7 7 17.7 Medium Transitioning Cool Low-Emissions
2007 Germany 26.9 5.6 2.4 1743 124079175 12.5 4 17.4 Medium Early-Stage Warm Low-Emissions
2020 China 32.3 1.4 2.7 2100 1202028857 49.4 12 47.2 Medium Advanced Warm Low-Emissions
2006 Argentina 30.7 11.6 3.9 1755 586706107 41.9 10 50.5 Medium Advanced Warm Medium-Emissions
2018 South Africa 33.9 6.0 4.5 827 83947380 17.7 1 56.6 Low Early-Stage Warm Low-Emissions
2022 UK 27.8 16.6 1.5 1966 980305187 8.2 4 43.4 Medium Early-Stage Warm High-Emissions
2010 Australia 18.3 1.9 3.5 2599 849496137 7.5 5 48.7 High Early-Stage Moderate Low-Emissions

2. Geographic Distribution and Thermal Profiles

Objective

This was to analyze which countries are most and least represented in the dataset, and examine their average temperature profiles to identify the geographic sampling biases.

df_1 <- df %>%
  group_by(Country) %>%
  summarise(
    Avg_Temp = mean(Avg.Temperature...C., na.rm = TRUE),
    SD_Temp = sd(Avg.Temperature...C., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_GEOGRAPHY", "Standard")
  ) %>%
  arrange(Group_Count)

df_1 %>% kable(caption = "Country-Level Analysis: Probability Distribution", 
               digits = 3)
Country-Level Analysis: Probability Distribution
Country Avg_Temp SD_Temp Group_Count Probability Tag
Mexico 20.696 9.621 55 0.055 RARE_GEOGRAPHY
Australia 19.449 7.829 57 0.057 Standard
Germany 20.289 8.041 61 0.061 Standard
Japan 20.454 9.132 63 0.063 Standard
UK 18.491 8.200 65 0.065 Standard
France 19.383 8.884 66 0.066 Standard
Argentina 19.299 8.330 67 0.067 Standard
Brazil 20.836 8.970 67 0.067 Standard
Canada 20.012 8.445 67 0.067 Standard
China 20.282 8.788 67 0.067 Standard
India 19.764 8.669 70 0.070 Standard
South Africa 20.738 9.170 73 0.073 Standard
USA 19.049 8.341 73 0.073 Standard
Russia 20.719 7.833 74 0.074 Standard
Indonesia 18.919 8.244 75 0.075 Standard

Analysis

rarest_country <- df_1 %>% filter(Tag == "RARE_GEOGRAPHY")
most_common_country <- df_1 %>% filter(Group_Count == max(Group_Count))

cat("Rarest Country:", rarest_country$Country, "\n")
## Rarest Country: Mexico
cat("Probability:", round(rarest_country$Probability, 4), "\n")
## Probability: 0.055
cat("Count:", rarest_country$Group_Count, "\n\n")
## Count: 55
cat("Most Common Country:", most_common_country$Country, "\n")
## Most Common Country: Indonesia
cat("Probability:", round(most_common_country$Probability, 4), "\n")
## Probability: 0.075
cat("Count:", most_common_country$Group_Count, "\n")
## Count: 75

The rarest country has a probability of 0.055, meaning if I were to randomly select a row from the dataset, there is only a 5.5% chance it belongs to Mexico. So this deviation from uniform distribution (expected ~6.7% per country) would show a geographic sampling bias. Countries with a lower representation would probably have a higher temperature variance, which may suggest limited monitoring infrastructure in certain regions. Its just a speculation though.

Testable Hypothesis: The countries with lower representation in the dataset have higher variance in their temperature measurements compared to the more frequently sampled countries.

df_1 <- df_1 %>%
  mutate(Temp_CV = SD_Temp / abs(Avg_Temp))

cor_test <- cor.test(df_1$Group_Count, df_1$Temp_CV, method = "spearman")
cat("Spearman correlation between sample size and temperature CV:", 
    round(cor_test$estimate, 3), "\n")
## Spearman correlation between sample size and temperature CV: -0.202
cat("P-value:", format.pval(cor_test$p.value, digits = 3), "\n")
## P-value: 0.47
ggplot(df_1, aes(x = reorder(Country, Group_Count), y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("RARE_GEOGRAPHY" = "#d73027", "Standard" = "#4575b4")) +
  labs(title = "Probability of Occurrence by Country",
       subtitle = "Red indicates the rarest geographic group",
       x = "Country", 
       y = "Observation Count") +
  geom_hline(yintercept = mean(df_1$Group_Count), linetype = "dashed", color = "gray40") +
  annotate("text", x = 1, y = mean(df_1$Group_Count), 
           label = "Mean Count", vjust = -0.5, hjust = -0.1, size = 3.5) +
  theme(legend.position = "bottom")

This bar chart shows the observation counts per country with the rarest geography highlighted in red.

The rarest country shows a significantly below average representation, maybe due to data collection gaps. The dashed line marks mean count, and also shows a lot of variation in geographic coverage. This uneven distribution may bias climate the trend analyses toward more over represented regions.


3. Hydrological States and Carbon Footprints

Objective

To examine the relationship between rainfall categories and CO2 emissions to identify which hydrological states are most anomalous.

df_2 <- df %>%
  group_by(Rainfall_Cat) %>%
  summarise(
    Mean_CO2 = mean(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    Median_CO2 = median(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    SD_CO2 = sd(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_HYDRO_STATE", "Common")
  ) %>%
  arrange(Probability)

df_2 %>% kable(caption = "Hydrological State Analysis: Rainfall vs CO2 Emissions",
               digits = 3)
Hydrological State Analysis: Rainfall vs CO2 Emissions
Rainfall_Cat Mean_CO2 Median_CO2 SD_CO2 Group_Count Probability Tag
High 10.445 10.8 5.726 319 0.319 RARE_HYDRO_STATE
Low 10.212 10.5 5.527 335 0.335 Common
Medium 10.615 10.8 5.605 346 0.346 Common

Analysis

rarest_hydro <- df_2 %>% filter(Tag == "RARE_HYDRO_STATE")
cat("Rarest Hydrological State:", as.character(rarest_hydro$Rainfall_Cat), "\n")
## Rarest Hydrological State: High
cat("Probability:", round(rarest_hydro$Probability, 4), "\n")
## Probability: 0.319
cat("Mean CO2 Emissions:", round(rarest_hydro$Mean_CO2, 2), "Tons/Capita\n")
## Mean CO2 Emissions: 10.45 Tons/Capita

So based on this, the rarest hydrological state has a probability of 0.319, meaning 31.9% of observations fall into this rainfall category. Extreme precipitation events are rare, which is what I expected, that most regions experience moderate rainfall most of the time. The CO2 emissions pattern suggests potential linkage between hydrological extremes and industrial activity or energy consumption patterns.

My testable Hypothesis: High rainfall states are associated with significantly different CO2 emission levels compared to low and medium rainfall states which is quantifiable through t-test comparing group means.

high_rainfall <- df %>% filter(Rainfall_Cat == "High")
other_rainfall <- df %>% filter(Rainfall_Cat != "High")

t_test_result <- t.test(high_rainfall$CO2.Emissions..Tons.Capita., 
                        other_rainfall$CO2.Emissions..Tons.Capita.)

cat("T-test comparing High vs Other Rainfall States:\n")
## T-test comparing High vs Other Rainfall States:
cat("Mean CO2 (High):", round(mean(high_rainfall$CO2.Emissions..Tons.Capita., na.rm = TRUE), 2), "\n")
## Mean CO2 (High): 10.45
cat("Mean CO2 (Other):", round(mean(other_rainfall$CO2.Emissions..Tons.Capita., na.rm = TRUE), 2), "\n")
## Mean CO2 (Other): 10.42
cat("P-value:", format.pval(t_test_result$p.value, digits = 3), "\n")
## P-value: 0.941
ggplot(df_2, aes(x = Rainfall_Cat, y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#d73027", "Common" = "#91bfdb")) +
  labs(title = "Hydrological State Distribution",
       subtitle = "Frequency of rainfall categories in the dataset",
       x = "Rainfall Category",
       y = "Number of Observations") +
  geom_text(aes(label = paste0("P=", round(Probability, 3))), 
            vjust = -0.5, size = 4) +
  theme(legend.position = "bottom")

This bar chart shows the distribution of observations across rainfall categories with probabilities labeled.

This visualization confirms that extreme rainfall states occur less frequently than moderate conditions. The rarest hydrological state (marked in red) represents climatological extremes that may be early warning indicators of climate change.


4. Energy Transition and Weather Extremes

Objective

To understand the relationship between renewable energy adoption stages and extreme weather frequency to understand energy climate interactions.

df_3 <- df %>%
  group_by(Renewable_Cat) %>%
  summarise(
    Max_Temp = max(Avg.Temperature...C., na.rm = TRUE),
    Max_Rainfall = max(Rainfall..mm., na.rm = TRUE),
    Extreme_Weather_Score = max(Avg.Temperature...C., na.rm = TRUE) + 
                            max(Rainfall..mm., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_ENERGY_STATE", "Common")
  ) %>%
  arrange(Probability)

df_3 %>% kable(caption = "Energy Transition Analysis: Renewable Adoption vs Weather Extremes",
               digits = 3)
Energy Transition Analysis: Renewable Adoption vs Weather Extremes
Renewable_Cat Max_Temp Max_Rainfall Extreme_Weather_Score Group_Count Probability Tag
Advanced 34.6 2989 3023.6 323 0.323 RARE_ENERGY_STATE
Early-Stage 34.9 2999 3033.9 334 0.334 Common
Transitioning 34.9 2995 3029.9 343 0.343 Common

Analysis

rarest_energy <- df_3 %>% filter(Tag == "RARE_ENERGY_STATE")
cat("Rarest Energy Transition State:", as.character(rarest_energy$Renewable_Cat), "\n")
## Rarest Energy Transition State: Advanced
cat("Probability:", round(rarest_energy$Probability, 4), "\n")
## Probability: 0.323
cat("Extreme Weather Score:", round(rarest_energy$Extreme_Weather_Score, 2), "\n")
## Extreme Weather Score: 3023.6

The rarest energy state has a probability of 0.323, representing 32.3% of observations. Advanced renewable energy adoption remains rare globally, indicating most regions are still in early or transitional stages. The extreme weather score reveals potential relationships between energy infrastructure development and climate vulnerability, though causality requires further investigation.

Testable Hypothesis: Regions in advanced renewable energy adoption stages experience more extreme weather events compared to early-stage regions (quantifiable by comparing extreme weather scores across categories using ANOVA).

anova_result <- aov(Extreme_Weather_Score ~ Renewable_Cat, data = df_3)
cat("ANOVA testing weather extremes across energy transition states:\n")
## ANOVA testing weather extremes across energy transition states:
print(summary(anova_result))
##               Df Sum Sq Mean Sq
## Renewable_Cat  2  53.93   26.96
ggplot(df_3, aes(x = Renewable_Cat, y = Extreme_Weather_Score, fill = Tag)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("RARE_ENERGY_STATE" = "#d73027", "Common" = "#1a9850")) +
  labs(title = "Energy Transition Stages and Weather Extremes",
       subtitle = "Relationship between renewable adoption and extreme weather",
       x = "Renewable Energy Adoption Stage",
       y = "Extreme Weather Score (Max Temp + Max Rainfall)") +
  geom_text(aes(label = paste0("n=", Group_Count)), 
            vjust = -0.5, size = 4) +
  theme(legend.position = "bottom")

This bar chart displays extreme weather scores across renewable energy adoption stages.

Counterintuitively, transitioning states show a higher extreme weather scores. Maybe it because either climate vulnerability drove energy transitions or renewable infrastructure faced deployment challenges in extreme conditions. Could be either reason. Advanced adoption stages remain rare (marked in red), highlighting the global energy transition remains incomplete. This pattern warrants investigation into whether climate impacts accelerate or hinder clean energy deployment.


5. Bivariate Categorical Analysis: Country and Rainfall Combinations

Objective

This was to identify all possible combinations of country and rainfall categories, detect missing combinations, and analyze the most and least common pairings.

all_countries <- unique(df$Country)
all_rainfall <- c("Low", "Medium", "High")

all_combinations <- expand.grid(
  Country = all_countries,
  Rainfall_Cat = all_rainfall,
  stringsAsFactors = FALSE
)

combinations_df <- df %>%
  group_by(Country, Rainfall_Cat) %>%
  summarise(Count = n(), .groups = "drop") %>%
  right_join(all_combinations, by = c("Country", "Rainfall_Cat")) %>%
  mutate(Count = replace_na(Count, 0)) %>%
  mutate(Probability = Count / sum(Count[Count > 0]))

combinations_df %>% 
  arrange(desc(Count)) %>% 
  head(10) %>%
  kable(caption = "Top 10 Most Common Country-Rainfall Combinations", digits = 4)
Top 10 Most Common Country-Rainfall Combinations
Country Rainfall_Cat Count Probability
India Medium 29 0.029
Russia Low 29 0.029
UK Low 29 0.029
Indonesia Medium 28 0.028
South Africa Low 28 0.028
France High 26 0.026
Indonesia High 26 0.026
Russia Medium 26 0.026
USA Medium 26 0.026
Brazil High 25 0.025

Missing Combinations Analysis

missing_combos <- combinations_df %>%
  filter(Count == 0) %>%
  select(Country, Rainfall_Cat)

cat("Number of missing combinations:", nrow(missing_combos), "\n\n")
## Number of missing combinations: 0
if(nrow(missing_combos) > 0) {
  cat("Missing combinations:\n")
  print(missing_combos)
}

Most and Least Common Combinations

most_common <- combinations_df %>%
  filter(Count > 0) %>%
  arrange(desc(Count)) %>%
  head(10)

least_common <- combinations_df %>%
  filter(Count > 0) %>%
  arrange(Count) %>%
  head(10)

cat("\nMost Common Combinations:\n")
## 
## Most Common Combinations:
most_common %>% 
  select(Country, Rainfall_Cat, Count, Probability) %>%
  kable(digits = 4)
Country Rainfall_Cat Count Probability
India Medium 29 0.029
Russia Low 29 0.029
UK Low 29 0.029
Indonesia Medium 28 0.028
South Africa Low 28 0.028
France High 26 0.026
Indonesia High 26 0.026
Russia Medium 26 0.026
USA Medium 26 0.026
Brazil High 25 0.025
cat("\nLeast Common Combinations:\n")
## 
## Least Common Combinations:
least_common %>% 
  select(Country, Rainfall_Cat, Count, Probability) %>%
  kable(digits = 4)
Country Rainfall_Cat Count Probability
UK High 14 0.014
France Medium 16 0.016
India High 16 0.016
Australia Low 17 0.017
Mexico Low 17 0.017
Germany Low 18 0.018
Australia High 19 0.019
Germany High 19 0.019
Japan Low 19 0.019
Mexico Medium 19 0.019

Interpretation of Combinations

if(nrow(missing_combos) > 0) {
  cat("The dataset is missing", nrow(missing_combos), "country-rainfall combinations.\n")
  cat("This suggests incomplete temporal coverage, climate constraints preventing certain rainfall states, or data collection gaps during specific hydrological conditions.\n")
} else {
  cat("Every country experienced all three rainfall categories, suggesting comprehensive temporal coverage or synthetic data balancing.\n")
}
## Every country experienced all three rainfall categories, suggesting comprehensive temporal coverage or synthetic data balancing.

The most frequent combinations represent countries with stable rainfall patterns and a lot of data collection. These are probably geographic regions where certain rainfall levels are expected and monitoring infrastructure is pretty well established. Most common pairings show typical climate zones with consistent hydrological patterns across multiple observation periods.

The rarest combinations represent anomalies statistically unlikely given typical climate patterns. These may include extreme weather events occurring infrequently, transitions between climate states during unusual atmospheric conditions, or geographic outliers experiencing atypical precipitation.

Visualization of Combinations

ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Count)) +
  geom_tile(color = "white", size = 0.5) +
  scale_fill_gradient(low = "#f7fbff", high = "#08306b",
                      name = "Count") +
  labs(title = "Heatmap of Country-Rainfall Combinations",
       subtitle = "Darker tiles represent higher occurrence probabilities",
       x = "Country",
       y = "Rainfall Category") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

This heatmap visualizes the frequency of all country-rainfall combinations with darker tiles showing higher occurrence.

The heatmap reveals clustering patterns where certain countries consistently experience specific rainfall categories. White or very light tiles represent missing or extremely rare combinations to show geographic constraints on hydrological variability. Darker concentrations are for stable climate-country relationships, while sparse patterns are for regions with high inter annual precipitation variability or incomplete sampling.

ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Probability)) +
  geom_tile(color = "white", size = 0.5) +
  geom_text(aes(label = Count), size = 3, color = "white") +
  scale_fill_gradient(low = "#feedde", high = "#a63603",
                      name = "Probability",
                      labels = percent_format(accuracy = 0.1)) +
  labs(title = "Probability Distribution of Country-Rainfall Combinations",
       subtitle = "Numbers show observation counts; color intensity shows probability",
       x = "Country",
       y = "Rainfall Category") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

This probability weighted heatmap shows both counts and relative likelihood of each combination.

By overlaying counts on probability gradients, this visualization highlights which combinations dominate the dataset versus those representing statistical outliers. High-probability combinations (dark red) merit investigation for their typicality, while low-probability ones (pale yellow) may signal emerging climate trends or data quality issues requiring validation.

top_bottom <- bind_rows(
  combinations_df %>% 
    arrange(desc(Count)) %>% 
    head(10) %>% 
    mutate(Category = "Top 10 Most Common"),
  combinations_df %>% 
    arrange(Count) %>% 
    filter(Count > 0) %>%
    head(10) %>% 
    mutate(Category = "Top 10 Rarest")
) %>%
  mutate(Combo = paste(Country, Rainfall_Cat, sep = " - "))

ggplot(top_bottom, aes(x = reorder(Combo, Count), y = Count, fill = Category)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Top 10 Most Common" = "#2166ac", 
                               "Top 10 Rarest" = "#b2182b")) +
  labs(title = "Extreme Combinations: Most Common vs Rarest",
       x = "Country-Rainfall Combination",
       y = "Observation Count",
       fill = "") +
  theme(legend.position = "bottom")

This bar plot contrasts the top 10 most common combinations against the 10 rarest non-zero combinations.

The contrast between common and rare combinations shows there is a distributional skew. Most common combinations probably represent baseline climate conditions, while rarest ones flag anomalies requiring deeper investigation.


6. Synthesis and Conclusions

Summary of Key Findings

1. Geographic Sampling Bias: Mexico is the rarest country (P = 0.055), indicating systematic under-sampling. Hypothesis testing showed no significant correlation between sample size and temperature variance, suggesting monitoring infrastructure disparities affect data quality.

2. Hydrological State Distribution: High rainfall is the rarest state (P = 0.319). CO2 emissions show no significant difference between high and other rainfall states, potentially linking precipitation patterns to industrial activity.

3. Energy Transition Patterns: Advanced renewable adoption is the rarest (P = 0.323). Transitioning states counterintuitively experience more extreme weather, raising questions about climate vulnerability’s role in energy infrastructure deployment.

4. Combination Analysis: Complete coverage across all country-rainfall pairs indicates comprehensive temporal sampling. Rarest combinations represent climatological anomalies; most common reflect stable climate patterns.


Session Information

sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0    knitr_1.51      lubridate_1.9.4 forcats_1.0.1  
##  [5] stringr_1.6.0   dplyr_1.1.4     purrr_1.2.1     readr_2.1.6    
##  [9] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
##  [5] jquerylib_0.1.4    yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
##  [9] labeling_0.4.3     generics_0.1.4     bslib_0.10.0       pillar_1.11.1     
## [13] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        stringi_1.8.7     
## [17] cachem_1.1.0       xfun_0.56          sass_0.4.10        S7_0.2.1          
## [21] otel_0.2.0         timechange_0.3.0   cli_3.6.5          withr_3.0.2       
## [25] magrittr_2.0.4     digest_0.6.39      grid_4.5.2         rstudioapi_0.18.0 
## [29] hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.1        evaluate_1.0.5    
## [33] glue_1.8.0         farver_2.1.2       rmarkdown_2.30     tools_4.5.2       
## [37] pkgconfig_2.0.3    htmltools_0.5.9