02-02-2026This is to analyse the distributional characteristics of my global climate dataset using probability based anomaly detection. I grouped the data across many categorical dimensions and calculated group probabilities to identify rare combinations that represent maybe unusual environmental states or some kinds of reporting patterns.
library(tidyverse)
library(knitr)
library(scales)
theme_set(theme_minimal(base_size = 12))
df <- read.csv("climate_change_dataset.csv")
str(df)
## 'data.frame': 1000 obs. of 10 variables:
## $ Year : int 2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
## $ Country : chr "UK" "USA" "France" "Argentina" ...
## $ Avg.Temperature...C. : num 8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
## $ CO2.Emissions..Tons.Capita.: num 9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
## $ Sea.Level.Rise..mm. : num 3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
## $ Rainfall..mm. : int 1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
## $ Population : int 530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
## $ Renewable.Energy.... : num 20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
## $ Extreme.Weather.Events : int 14 8 9 7 4 12 10 1 4 5 ...
## $ Forest.Area.... : num 59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...
summary(df)
## Year Country Avg.Temperature...C.
## Min. :2000 Length:1000 Min. : 5.00
## 1st Qu.:2005 Class :character 1st Qu.:12.18
## Median :2012 Mode :character Median :20.10
## Mean :2011 Mean :19.88
## 3rd Qu.:2018 3rd Qu.:27.23
## Max. :2023 Max. :34.90
## CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm.
## Min. : 0.500 Min. :1.00 Min. : 501
## 1st Qu.: 5.575 1st Qu.:2.00 1st Qu.:1099
## Median :10.700 Median :3.00 Median :1726
## Mean :10.426 Mean :3.01 Mean :1739
## 3rd Qu.:15.400 3rd Qu.:4.00 3rd Qu.:2362
## Max. :20.000 Max. :5.00 Max. :2999
## Population Renewable.Energy.... Extreme.Weather.Events
## Min. :3.661e+06 Min. : 5.10 Min. : 0.000
## 1st Qu.:3.436e+08 1st Qu.:16.10 1st Qu.: 3.000
## Median :7.131e+08 Median :27.15 Median : 8.000
## Mean :7.054e+08 Mean :27.30 Mean : 7.291
## 3rd Qu.:1.074e+09 3rd Qu.:38.92 3rd Qu.:11.000
## Max. :1.397e+09 Max. :50.00 Max. :14.000
## Forest.Area....
## Min. :10.10
## 1st Qu.:25.60
## Median :41.15
## Mean :40.57
## 3rd Qu.:55.80
## Max. :70.00
cat("\nMissing values per column:\n")
##
## Missing values per column:
colSums(is.na(df))
## Year Country
## 0 0
## Avg.Temperature...C. CO2.Emissions..Tons.Capita.
## 0 0
## Sea.Level.Rise..mm. Rainfall..mm.
## 0 0
## Population Renewable.Energy....
## 0 0
## Extreme.Weather.Events Forest.Area....
## 0 0
cut() function probability analysisdf <- df %>%
mutate(
Rainfall_Cat = cut(Rainfall..mm.,
breaks = 3,
labels = c("Low", "Medium", "High")),
Renewable_Cat = cut(Renewable.Energy....,
breaks = 3,
labels = c("Early-Stage", "Transitioning", "Advanced")),
Temp_Cat = cut(Avg.Temperature...C.,
breaks = 3,
labels = c("Cool", "Moderate", "Warm")),
CO2_Cat = cut(CO2.Emissions..Tons.Capita.,
breaks = 3,
labels = c("Low-Emissions", "Medium-Emissions", "High-Emissions"))
)
head(df, 10) %>% kable(caption = "Sample of Preprocessed Data")
| Year | Country | Avg.Temperature…C. | CO2.Emissions..Tons.Capita. | Sea.Level.Rise..mm. | Rainfall..mm. | Population | Renewable.Energy…. | Extreme.Weather.Events | Forest.Area…. | Rainfall_Cat | Renewable_Cat | Temp_Cat | CO2_Cat |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 | UK | 8.9 | 9.3 | 3.1 | 1441 | 530911230 | 20.4 | 14 | 59.8 | Medium | Transitioning | Cool | Medium-Emissions |
| 2019 | USA | 31.0 | 4.8 | 4.2 | 2407 | 107364344 | 49.2 | 8 | 31.0 | High | Advanced | Warm | Low-Emissions |
| 2014 | France | 33.9 | 2.8 | 2.2 | 1241 | 441101758 | 33.3 | 9 | 35.5 | Low | Transitioning | Warm | Low-Emissions |
| 2010 | Argentina | 5.9 | 1.8 | 3.2 | 1892 | 1069669579 | 23.7 | 7 | 17.7 | Medium | Transitioning | Cool | Low-Emissions |
| 2007 | Germany | 26.9 | 5.6 | 2.4 | 1743 | 124079175 | 12.5 | 4 | 17.4 | Medium | Early-Stage | Warm | Low-Emissions |
| 2020 | China | 32.3 | 1.4 | 2.7 | 2100 | 1202028857 | 49.4 | 12 | 47.2 | Medium | Advanced | Warm | Low-Emissions |
| 2006 | Argentina | 30.7 | 11.6 | 3.9 | 1755 | 586706107 | 41.9 | 10 | 50.5 | Medium | Advanced | Warm | Medium-Emissions |
| 2018 | South Africa | 33.9 | 6.0 | 4.5 | 827 | 83947380 | 17.7 | 1 | 56.6 | Low | Early-Stage | Warm | Low-Emissions |
| 2022 | UK | 27.8 | 16.6 | 1.5 | 1966 | 980305187 | 8.2 | 4 | 43.4 | Medium | Early-Stage | Warm | High-Emissions |
| 2010 | Australia | 18.3 | 1.9 | 3.5 | 2599 | 849496137 | 7.5 | 5 | 48.7 | High | Early-Stage | Moderate | Low-Emissions |
This was to analyze which countries are most and least represented in the dataset, and examine their average temperature profiles to identify the geographic sampling biases.
df_1 <- df %>%
group_by(Country) %>%
summarise(
Avg_Temp = mean(Avg.Temperature...C., na.rm = TRUE),
SD_Temp = sd(Avg.Temperature...C., na.rm = TRUE),
Group_Count = n()
) %>%
mutate(
Probability = Group_Count / sum(Group_Count),
Tag = ifelse(Group_Count == min(Group_Count), "RARE_GEOGRAPHY", "Standard")
) %>%
arrange(Group_Count)
df_1 %>% kable(caption = "Country-Level Analysis: Probability Distribution",
digits = 3)
| Country | Avg_Temp | SD_Temp | Group_Count | Probability | Tag |
|---|---|---|---|---|---|
| Mexico | 20.696 | 9.621 | 55 | 0.055 | RARE_GEOGRAPHY |
| Australia | 19.449 | 7.829 | 57 | 0.057 | Standard |
| Germany | 20.289 | 8.041 | 61 | 0.061 | Standard |
| Japan | 20.454 | 9.132 | 63 | 0.063 | Standard |
| UK | 18.491 | 8.200 | 65 | 0.065 | Standard |
| France | 19.383 | 8.884 | 66 | 0.066 | Standard |
| Argentina | 19.299 | 8.330 | 67 | 0.067 | Standard |
| Brazil | 20.836 | 8.970 | 67 | 0.067 | Standard |
| Canada | 20.012 | 8.445 | 67 | 0.067 | Standard |
| China | 20.282 | 8.788 | 67 | 0.067 | Standard |
| India | 19.764 | 8.669 | 70 | 0.070 | Standard |
| South Africa | 20.738 | 9.170 | 73 | 0.073 | Standard |
| USA | 19.049 | 8.341 | 73 | 0.073 | Standard |
| Russia | 20.719 | 7.833 | 74 | 0.074 | Standard |
| Indonesia | 18.919 | 8.244 | 75 | 0.075 | Standard |
rarest_country <- df_1 %>% filter(Tag == "RARE_GEOGRAPHY")
most_common_country <- df_1 %>% filter(Group_Count == max(Group_Count))
cat("Rarest Country:", rarest_country$Country, "\n")
## Rarest Country: Mexico
cat("Probability:", round(rarest_country$Probability, 4), "\n")
## Probability: 0.055
cat("Count:", rarest_country$Group_Count, "\n\n")
## Count: 55
cat("Most Common Country:", most_common_country$Country, "\n")
## Most Common Country: Indonesia
cat("Probability:", round(most_common_country$Probability, 4), "\n")
## Probability: 0.075
cat("Count:", most_common_country$Group_Count, "\n")
## Count: 75
The rarest country has a probability of 0.055, meaning if I were to randomly select a row from the dataset, there is only a 5.5% chance it belongs to Mexico. So this deviation from uniform distribution (expected ~6.7% per country) would show a geographic sampling bias. Countries with a lower representation would probably have a higher temperature variance, which may suggest limited monitoring infrastructure in certain regions. Its just a speculation though.
Testable Hypothesis: The countries with lower representation in the dataset have higher variance in their temperature measurements compared to the more frequently sampled countries.
df_1 <- df_1 %>%
mutate(Temp_CV = SD_Temp / abs(Avg_Temp))
cor_test <- cor.test(df_1$Group_Count, df_1$Temp_CV, method = "spearman")
cat("Spearman correlation between sample size and temperature CV:",
round(cor_test$estimate, 3), "\n")
## Spearman correlation between sample size and temperature CV: -0.202
cat("P-value:", format.pval(cor_test$p.value, digits = 3), "\n")
## P-value: 0.47
ggplot(df_1, aes(x = reorder(Country, Group_Count), y = Group_Count, fill = Tag)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("RARE_GEOGRAPHY" = "#d73027", "Standard" = "#4575b4")) +
labs(title = "Probability of Occurrence by Country",
subtitle = "Red indicates the rarest geographic group",
x = "Country",
y = "Observation Count") +
geom_hline(yintercept = mean(df_1$Group_Count), linetype = "dashed", color = "gray40") +
annotate("text", x = 1, y = mean(df_1$Group_Count),
label = "Mean Count", vjust = -0.5, hjust = -0.1, size = 3.5) +
theme(legend.position = "bottom")
This bar chart shows the observation counts per country with the rarest geography highlighted in red.
The rarest country shows a significantly below average representation, maybe due to data collection gaps. The dashed line marks mean count, and also shows a lot of variation in geographic coverage. This uneven distribution may bias climate the trend analyses toward more over represented regions.
To examine the relationship between rainfall categories and CO2 emissions to identify which hydrological states are most anomalous.
df_2 <- df %>%
group_by(Rainfall_Cat) %>%
summarise(
Mean_CO2 = mean(CO2.Emissions..Tons.Capita., na.rm = TRUE),
Median_CO2 = median(CO2.Emissions..Tons.Capita., na.rm = TRUE),
SD_CO2 = sd(CO2.Emissions..Tons.Capita., na.rm = TRUE),
Group_Count = n()
) %>%
mutate(
Probability = Group_Count / sum(Group_Count),
Tag = ifelse(Group_Count == min(Group_Count), "RARE_HYDRO_STATE", "Common")
) %>%
arrange(Probability)
df_2 %>% kable(caption = "Hydrological State Analysis: Rainfall vs CO2 Emissions",
digits = 3)
| Rainfall_Cat | Mean_CO2 | Median_CO2 | SD_CO2 | Group_Count | Probability | Tag |
|---|---|---|---|---|---|---|
| High | 10.445 | 10.8 | 5.726 | 319 | 0.319 | RARE_HYDRO_STATE |
| Low | 10.212 | 10.5 | 5.527 | 335 | 0.335 | Common |
| Medium | 10.615 | 10.8 | 5.605 | 346 | 0.346 | Common |
rarest_hydro <- df_2 %>% filter(Tag == "RARE_HYDRO_STATE")
cat("Rarest Hydrological State:", as.character(rarest_hydro$Rainfall_Cat), "\n")
## Rarest Hydrological State: High
cat("Probability:", round(rarest_hydro$Probability, 4), "\n")
## Probability: 0.319
cat("Mean CO2 Emissions:", round(rarest_hydro$Mean_CO2, 2), "Tons/Capita\n")
## Mean CO2 Emissions: 10.45 Tons/Capita
So based on this, the rarest hydrological state has a probability of 0.319, meaning 31.9% of observations fall into this rainfall category. Extreme precipitation events are rare, which is what I expected, that most regions experience moderate rainfall most of the time. The CO2 emissions pattern suggests potential linkage between hydrological extremes and industrial activity or energy consumption patterns.
My testable Hypothesis: High rainfall states are associated with significantly different CO2 emission levels compared to low and medium rainfall states which is quantifiable through t-test comparing group means.
high_rainfall <- df %>% filter(Rainfall_Cat == "High")
other_rainfall <- df %>% filter(Rainfall_Cat != "High")
t_test_result <- t.test(high_rainfall$CO2.Emissions..Tons.Capita.,
other_rainfall$CO2.Emissions..Tons.Capita.)
cat("T-test comparing High vs Other Rainfall States:\n")
## T-test comparing High vs Other Rainfall States:
cat("Mean CO2 (High):", round(mean(high_rainfall$CO2.Emissions..Tons.Capita., na.rm = TRUE), 2), "\n")
## Mean CO2 (High): 10.45
cat("Mean CO2 (Other):", round(mean(other_rainfall$CO2.Emissions..Tons.Capita., na.rm = TRUE), 2), "\n")
## Mean CO2 (Other): 10.42
cat("P-value:", format.pval(t_test_result$p.value, digits = 3), "\n")
## P-value: 0.941
ggplot(df_2, aes(x = Rainfall_Cat, y = Group_Count, fill = Tag)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#d73027", "Common" = "#91bfdb")) +
labs(title = "Hydrological State Distribution",
subtitle = "Frequency of rainfall categories in the dataset",
x = "Rainfall Category",
y = "Number of Observations") +
geom_text(aes(label = paste0("P=", round(Probability, 3))),
vjust = -0.5, size = 4) +
theme(legend.position = "bottom")
This bar chart shows the distribution of observations across rainfall categories with probabilities labeled.
This visualization confirms that extreme rainfall states occur less frequently than moderate conditions. The rarest hydrological state (marked in red) represents climatological extremes that may be early warning indicators of climate change.
To understand the relationship between renewable energy adoption stages and extreme weather frequency to understand energy climate interactions.
df_3 <- df %>%
group_by(Renewable_Cat) %>%
summarise(
Max_Temp = max(Avg.Temperature...C., na.rm = TRUE),
Max_Rainfall = max(Rainfall..mm., na.rm = TRUE),
Extreme_Weather_Score = max(Avg.Temperature...C., na.rm = TRUE) +
max(Rainfall..mm., na.rm = TRUE),
Group_Count = n()
) %>%
mutate(
Probability = Group_Count / sum(Group_Count),
Tag = ifelse(Group_Count == min(Group_Count), "RARE_ENERGY_STATE", "Common")
) %>%
arrange(Probability)
df_3 %>% kable(caption = "Energy Transition Analysis: Renewable Adoption vs Weather Extremes",
digits = 3)
| Renewable_Cat | Max_Temp | Max_Rainfall | Extreme_Weather_Score | Group_Count | Probability | Tag |
|---|---|---|---|---|---|---|
| Advanced | 34.6 | 2989 | 3023.6 | 323 | 0.323 | RARE_ENERGY_STATE |
| Early-Stage | 34.9 | 2999 | 3033.9 | 334 | 0.334 | Common |
| Transitioning | 34.9 | 2995 | 3029.9 | 343 | 0.343 | Common |
rarest_energy <- df_3 %>% filter(Tag == "RARE_ENERGY_STATE")
cat("Rarest Energy Transition State:", as.character(rarest_energy$Renewable_Cat), "\n")
## Rarest Energy Transition State: Advanced
cat("Probability:", round(rarest_energy$Probability, 4), "\n")
## Probability: 0.323
cat("Extreme Weather Score:", round(rarest_energy$Extreme_Weather_Score, 2), "\n")
## Extreme Weather Score: 3023.6
The rarest energy state has a probability of 0.323, representing 32.3% of observations. Advanced renewable energy adoption remains rare globally, indicating most regions are still in early or transitional stages. The extreme weather score reveals potential relationships between energy infrastructure development and climate vulnerability, though causality requires further investigation.
Testable Hypothesis: Regions in advanced renewable energy adoption stages experience more extreme weather events compared to early-stage regions (quantifiable by comparing extreme weather scores across categories using ANOVA).
anova_result <- aov(Extreme_Weather_Score ~ Renewable_Cat, data = df_3)
cat("ANOVA testing weather extremes across energy transition states:\n")
## ANOVA testing weather extremes across energy transition states:
print(summary(anova_result))
## Df Sum Sq Mean Sq
## Renewable_Cat 2 53.93 26.96
ggplot(df_3, aes(x = Renewable_Cat, y = Extreme_Weather_Score, fill = Tag)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("RARE_ENERGY_STATE" = "#d73027", "Common" = "#1a9850")) +
labs(title = "Energy Transition Stages and Weather Extremes",
subtitle = "Relationship between renewable adoption and extreme weather",
x = "Renewable Energy Adoption Stage",
y = "Extreme Weather Score (Max Temp + Max Rainfall)") +
geom_text(aes(label = paste0("n=", Group_Count)),
vjust = -0.5, size = 4) +
theme(legend.position = "bottom")
This bar chart displays extreme weather scores across renewable energy adoption stages.
Counterintuitively, transitioning states show a higher extreme weather scores. Maybe it because either climate vulnerability drove energy transitions or renewable infrastructure faced deployment challenges in extreme conditions. Could be either reason. Advanced adoption stages remain rare (marked in red), highlighting the global energy transition remains incomplete. This pattern warrants investigation into whether climate impacts accelerate or hinder clean energy deployment.
This was to identify all possible combinations of country and rainfall categories, detect missing combinations, and analyze the most and least common pairings.
all_countries <- unique(df$Country)
all_rainfall <- c("Low", "Medium", "High")
all_combinations <- expand.grid(
Country = all_countries,
Rainfall_Cat = all_rainfall,
stringsAsFactors = FALSE
)
combinations_df <- df %>%
group_by(Country, Rainfall_Cat) %>%
summarise(Count = n(), .groups = "drop") %>%
right_join(all_combinations, by = c("Country", "Rainfall_Cat")) %>%
mutate(Count = replace_na(Count, 0)) %>%
mutate(Probability = Count / sum(Count[Count > 0]))
combinations_df %>%
arrange(desc(Count)) %>%
head(10) %>%
kable(caption = "Top 10 Most Common Country-Rainfall Combinations", digits = 4)
| Country | Rainfall_Cat | Count | Probability |
|---|---|---|---|
| India | Medium | 29 | 0.029 |
| Russia | Low | 29 | 0.029 |
| UK | Low | 29 | 0.029 |
| Indonesia | Medium | 28 | 0.028 |
| South Africa | Low | 28 | 0.028 |
| France | High | 26 | 0.026 |
| Indonesia | High | 26 | 0.026 |
| Russia | Medium | 26 | 0.026 |
| USA | Medium | 26 | 0.026 |
| Brazil | High | 25 | 0.025 |
missing_combos <- combinations_df %>%
filter(Count == 0) %>%
select(Country, Rainfall_Cat)
cat("Number of missing combinations:", nrow(missing_combos), "\n\n")
## Number of missing combinations: 0
if(nrow(missing_combos) > 0) {
cat("Missing combinations:\n")
print(missing_combos)
}
most_common <- combinations_df %>%
filter(Count > 0) %>%
arrange(desc(Count)) %>%
head(10)
least_common <- combinations_df %>%
filter(Count > 0) %>%
arrange(Count) %>%
head(10)
cat("\nMost Common Combinations:\n")
##
## Most Common Combinations:
most_common %>%
select(Country, Rainfall_Cat, Count, Probability) %>%
kable(digits = 4)
| Country | Rainfall_Cat | Count | Probability |
|---|---|---|---|
| India | Medium | 29 | 0.029 |
| Russia | Low | 29 | 0.029 |
| UK | Low | 29 | 0.029 |
| Indonesia | Medium | 28 | 0.028 |
| South Africa | Low | 28 | 0.028 |
| France | High | 26 | 0.026 |
| Indonesia | High | 26 | 0.026 |
| Russia | Medium | 26 | 0.026 |
| USA | Medium | 26 | 0.026 |
| Brazil | High | 25 | 0.025 |
cat("\nLeast Common Combinations:\n")
##
## Least Common Combinations:
least_common %>%
select(Country, Rainfall_Cat, Count, Probability) %>%
kable(digits = 4)
| Country | Rainfall_Cat | Count | Probability |
|---|---|---|---|
| UK | High | 14 | 0.014 |
| France | Medium | 16 | 0.016 |
| India | High | 16 | 0.016 |
| Australia | Low | 17 | 0.017 |
| Mexico | Low | 17 | 0.017 |
| Germany | Low | 18 | 0.018 |
| Australia | High | 19 | 0.019 |
| Germany | High | 19 | 0.019 |
| Japan | Low | 19 | 0.019 |
| Mexico | Medium | 19 | 0.019 |
if(nrow(missing_combos) > 0) {
cat("The dataset is missing", nrow(missing_combos), "country-rainfall combinations.\n")
cat("This suggests incomplete temporal coverage, climate constraints preventing certain rainfall states, or data collection gaps during specific hydrological conditions.\n")
} else {
cat("Every country experienced all three rainfall categories, suggesting comprehensive temporal coverage or synthetic data balancing.\n")
}
## Every country experienced all three rainfall categories, suggesting comprehensive temporal coverage or synthetic data balancing.
The most frequent combinations represent countries with stable rainfall patterns and a lot of data collection. These are probably geographic regions where certain rainfall levels are expected and monitoring infrastructure is pretty well established. Most common pairings show typical climate zones with consistent hydrological patterns across multiple observation periods.
The rarest combinations represent anomalies statistically unlikely given typical climate patterns. These may include extreme weather events occurring infrequently, transitions between climate states during unusual atmospheric conditions, or geographic outliers experiencing atypical precipitation.
ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Count)) +
geom_tile(color = "white", size = 0.5) +
scale_fill_gradient(low = "#f7fbff", high = "#08306b",
name = "Count") +
labs(title = "Heatmap of Country-Rainfall Combinations",
subtitle = "Darker tiles represent higher occurrence probabilities",
x = "Country",
y = "Rainfall Category") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())
This heatmap visualizes the frequency of all country-rainfall combinations with darker tiles showing higher occurrence.
The heatmap reveals clustering patterns where certain countries consistently experience specific rainfall categories. White or very light tiles represent missing or extremely rare combinations to show geographic constraints on hydrological variability. Darker concentrations are for stable climate-country relationships, while sparse patterns are for regions with high inter annual precipitation variability or incomplete sampling.
ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Probability)) +
geom_tile(color = "white", size = 0.5) +
geom_text(aes(label = Count), size = 3, color = "white") +
scale_fill_gradient(low = "#feedde", high = "#a63603",
name = "Probability",
labels = percent_format(accuracy = 0.1)) +
labs(title = "Probability Distribution of Country-Rainfall Combinations",
subtitle = "Numbers show observation counts; color intensity shows probability",
x = "Country",
y = "Rainfall Category") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())
This probability weighted heatmap shows both counts and relative likelihood of each combination.
By overlaying counts on probability gradients, this visualization highlights which combinations dominate the dataset versus those representing statistical outliers. High-probability combinations (dark red) merit investigation for their typicality, while low-probability ones (pale yellow) may signal emerging climate trends or data quality issues requiring validation.
top_bottom <- bind_rows(
combinations_df %>%
arrange(desc(Count)) %>%
head(10) %>%
mutate(Category = "Top 10 Most Common"),
combinations_df %>%
arrange(Count) %>%
filter(Count > 0) %>%
head(10) %>%
mutate(Category = "Top 10 Rarest")
) %>%
mutate(Combo = paste(Country, Rainfall_Cat, sep = " - "))
ggplot(top_bottom, aes(x = reorder(Combo, Count), y = Count, fill = Category)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Top 10 Most Common" = "#2166ac",
"Top 10 Rarest" = "#b2182b")) +
labs(title = "Extreme Combinations: Most Common vs Rarest",
x = "Country-Rainfall Combination",
y = "Observation Count",
fill = "") +
theme(legend.position = "bottom")
This bar plot contrasts the top 10 most common combinations against the 10 rarest non-zero combinations.
The contrast between common and rare combinations shows there is a distributional skew. Most common combinations probably represent baseline climate conditions, while rarest ones flag anomalies requiring deeper investigation.
1. Geographic Sampling Bias: Mexico is the rarest country (P = 0.055), indicating systematic under-sampling. Hypothesis testing showed no significant correlation between sample size and temperature variance, suggesting monitoring infrastructure disparities affect data quality.
2. Hydrological State Distribution: High rainfall is the rarest state (P = 0.319). CO2 emissions show no significant difference between high and other rainfall states, potentially linking precipitation patterns to industrial activity.
3. Energy Transition Patterns: Advanced renewable adoption is the rarest (P = 0.323). Transitioning states counterintuitively experience more extreme weather, raising questions about climate vulnerability’s role in energy infrastructure deployment.
4. Combination Analysis: Complete coverage across all country-rainfall pairs indicates comprehensive temporal sampling. Rarest combinations represent climatological anomalies; most common reflect stable climate patterns.
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.4.0 knitr_1.51 lubridate_1.9.4 forcats_1.0.1
## [5] stringr_1.6.0 dplyr_1.1.4 purrr_1.2.1 readr_2.1.6
## [9] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.2 tidyselect_1.2.1
## [5] jquerylib_0.1.4 yaml_2.3.12 fastmap_1.2.0 R6_2.6.1
## [9] labeling_0.4.3 generics_0.1.4 bslib_0.10.0 pillar_1.11.1
## [13] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.1.7 stringi_1.8.7
## [17] cachem_1.1.0 xfun_0.56 sass_0.4.10 S7_0.2.1
## [21] otel_0.2.0 timechange_0.3.0 cli_3.6.5 withr_3.0.2
## [25] magrittr_2.0.4 digest_0.6.39 grid_4.5.2 rstudioapi_0.18.0
## [29] hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.1 evaluate_1.0.5
## [33] glue_1.8.0 farver_2.1.2 rmarkdown_2.30 tools_4.5.2
## [37] pkgconfig_2.0.3 htmltools_0.5.9