02/02/2026This analysis investigates the distributional characteristics of a global climate dataset using probability-based anomaly detection. By grouping data across multiple categorical dimensions and calculating group probabilities, we identify rare combinations that represent unusual environmental states or reporting patterns. Our investigation reveals critical insights about global climate patterns, energy transitions, and geographic sampling biases.
# Load required libraries
library(tidyverse)
library(knitr)
library(scales)
# Set theme for consistent visualization
theme_set(theme_minimal(base_size = 12))
# Load the climate change dataset
df <- read.csv("climate_change_dataset.csv")
# Display dataset structure
str(df)
## 'data.frame': 1000 obs. of 10 variables:
## $ Year : int 2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
## $ Country : chr "UK" "USA" "France" "Argentina" ...
## $ Avg.Temperature...C. : num 8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
## $ CO2.Emissions..Tons.Capita.: num 9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
## $ Sea.Level.Rise..mm. : num 3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
## $ Rainfall..mm. : int 1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
## $ Population : int 530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
## $ Renewable.Energy.... : num 20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
## $ Extreme.Weather.Events : int 14 8 9 7 4 12 10 1 4 5 ...
## $ Forest.Area.... : num 59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...
# Summary statistics
summary(df)
## Year Country Avg.Temperature...C.
## Min. :2000 Length:1000 Min. : 5.00
## 1st Qu.:2005 Class :character 1st Qu.:12.18
## Median :2012 Mode :character Median :20.10
## Mean :2011 Mean :19.88
## 3rd Qu.:2018 3rd Qu.:27.23
## Max. :2023 Max. :34.90
## CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm.
## Min. : 0.500 Min. :1.00 Min. : 501
## 1st Qu.: 5.575 1st Qu.:2.00 1st Qu.:1099
## Median :10.700 Median :3.00 Median :1726
## Mean :10.426 Mean :3.01 Mean :1739
## 3rd Qu.:15.400 3rd Qu.:4.00 3rd Qu.:2362
## Max. :20.000 Max. :5.00 Max. :2999
## Population Renewable.Energy.... Extreme.Weather.Events
## Min. :3.661e+06 Min. : 5.10 Min. : 0.000
## 1st Qu.:3.436e+08 1st Qu.:16.10 1st Qu.: 3.000
## Median :7.131e+08 Median :27.15 Median : 8.000
## Mean :7.054e+08 Mean :27.30 Mean : 7.291
## 3rd Qu.:1.074e+09 3rd Qu.:38.92 3rd Qu.:11.000
## Max. :1.397e+09 Max. :50.00 Max. :14.000
## Forest.Area....
## Min. :10.10
## 1st Qu.:25.60
## Median :41.15
## Mean :40.57
## 3rd Qu.:55.80
## Max. :70.00
# Check for missing values
cat("\nMissing values per column:\n")
##
## Missing values per column:
colSums(is.na(df))
## Year Country
## 0 0
## Avg.Temperature...C. CO2.Emissions..Tons.Capita.
## 0 0
## Sea.Level.Rise..mm. Rainfall..mm.
## 0 0
## Population Renewable.Energy....
## 0 0
## Extreme.Weather.Events Forest.Area....
## 0 0
To enable probability analysis, we discretize continuous variables
into categorical bins using the cut() function.
# Create categorical bins for analysis
df <- df %>%
mutate(
# Categorize rainfall into three levels
Rainfall_Cat = cut(Rainfall..mm.,
breaks = 3,
labels = c("Low", "Medium", "High")),
# Categorize renewable energy adoption
Renewable_Cat = cut(Renewable.Energy....,
breaks = 3,
labels = c("Early-Stage", "Transitioning", "Advanced")),
# Categorize temperature
Temp_Cat = cut(Avg.Temperature...C.,
breaks = 3,
labels = c("Cool", "Moderate", "Warm")),
# Categorize CO2 emissions
CO2_Cat = cut(CO2.Emissions..Tons.Capita.,
breaks = 3,
labels = c("Low-Emissions", "Medium-Emissions", "High-Emissions"))
)
# Preview the transformed data
head(df, 10) %>% kable(caption = "Sample of Preprocessed Data")
| Year | Country | Avg.Temperature…C. | CO2.Emissions..Tons.Capita. | Sea.Level.Rise..mm. | Rainfall..mm. | Population | Renewable.Energy…. | Extreme.Weather.Events | Forest.Area…. | Rainfall_Cat | Renewable_Cat | Temp_Cat | CO2_Cat |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 | UK | 8.9 | 9.3 | 3.1 | 1441 | 530911230 | 20.4 | 14 | 59.8 | Medium | Transitioning | Cool | Medium-Emissions |
| 2019 | USA | 31.0 | 4.8 | 4.2 | 2407 | 107364344 | 49.2 | 8 | 31.0 | High | Advanced | Warm | Low-Emissions |
| 2014 | France | 33.9 | 2.8 | 2.2 | 1241 | 441101758 | 33.3 | 9 | 35.5 | Low | Transitioning | Warm | Low-Emissions |
| 2010 | Argentina | 5.9 | 1.8 | 3.2 | 1892 | 1069669579 | 23.7 | 7 | 17.7 | Medium | Transitioning | Cool | Low-Emissions |
| 2007 | Germany | 26.9 | 5.6 | 2.4 | 1743 | 124079175 | 12.5 | 4 | 17.4 | Medium | Early-Stage | Warm | Low-Emissions |
| 2020 | China | 32.3 | 1.4 | 2.7 | 2100 | 1202028857 | 49.4 | 12 | 47.2 | Medium | Advanced | Warm | Low-Emissions |
| 2006 | Argentina | 30.7 | 11.6 | 3.9 | 1755 | 586706107 | 41.9 | 10 | 50.5 | Medium | Advanced | Warm | Medium-Emissions |
| 2018 | South Africa | 33.9 | 6.0 | 4.5 | 827 | 83947380 | 17.7 | 1 | 56.6 | Low | Early-Stage | Warm | Low-Emissions |
| 2022 | UK | 27.8 | 16.6 | 1.5 | 1966 | 980305187 | 8.2 | 4 | 43.4 | Medium | Early-Stage | Warm | High-Emissions |
| 2010 | Australia | 18.3 | 1.9 | 3.5 | 2599 | 849496137 | 7.5 | 5 | 48.7 | High | Early-Stage | Moderate | Low-Emissions |
Analyze which countries are most and least represented in the dataset, and examine their average temperature profiles to identify geographic sampling biases.
# Group by Country and calculate statistics
df_1 <- df %>%
group_by(Country) %>%
summarise(
Avg_Temp = mean(Avg.Temperature...C., na.rm = TRUE),
SD_Temp = sd(Avg.Temperature...C., na.rm = TRUE),
Group_Count = n()
) %>%
mutate(
Probability = Group_Count / sum(Group_Count),
Tag = ifelse(Group_Count == min(Group_Count), "RARE_GEOGRAPHY", "Standard")
) %>%
arrange(Group_Count)
# Display the results
df_1 %>% kable(caption = "Country-Level Analysis: Probability Distribution",
digits = 3)
| Country | Avg_Temp | SD_Temp | Group_Count | Probability | Tag |
|---|---|---|---|---|---|
| Mexico | 20.696 | 9.621 | 55 | 0.055 | RARE_GEOGRAPHY |
| Australia | 19.449 | 7.829 | 57 | 0.057 | Standard |
| Germany | 20.289 | 8.041 | 61 | 0.061 | Standard |
| Japan | 20.454 | 9.132 | 63 | 0.063 | Standard |
| UK | 18.491 | 8.200 | 65 | 0.065 | Standard |
| France | 19.383 | 8.884 | 66 | 0.066 | Standard |
| Argentina | 19.299 | 8.330 | 67 | 0.067 | Standard |
| Brazil | 20.836 | 8.970 | 67 | 0.067 | Standard |
| Canada | 20.012 | 8.445 | 67 | 0.067 | Standard |
| China | 20.282 | 8.788 | 67 | 0.067 | Standard |
| India | 19.764 | 8.669 | 70 | 0.070 | Standard |
| South Africa | 20.738 | 9.170 | 73 | 0.073 | Standard |
| USA | 19.049 | 8.341 | 73 | 0.073 | Standard |
| Russia | 20.719 | 7.833 | 74 | 0.074 | Standard |
| Indonesia | 18.919 | 8.244 | 75 | 0.075 | Standard |
Key Findings:
# Identify the rarest country
rarest_country <- df_1 %>% filter(Tag == "RARE_GEOGRAPHY")
most_common_country <- df_1 %>% filter(Group_Count == max(Group_Count))
cat("Rarest Country:", rarest_country$Country, "\n")
## Rarest Country: Mexico
cat("Probability:", round(rarest_country$Probability, 4), "\n")
## Probability: 0.055
cat("Count:", rarest_country$Group_Count, "\n\n")
## Count: 55
cat("Most Common Country:", most_common_country$Country, "\n")
## Most Common Country: Indonesia
cat("Probability:", round(most_common_country$Probability, 4), "\n")
## Probability: 0.075
cat("Count:", most_common_country$Group_Count, "\n")
## Count: 75
Significance:
The rarest country has a probability of 0.055, meaning if we randomly select a row from the dataset, there is only a 5.5% chance it belongs to Mexico. In a perfectly balanced dataset with 15 countries, we would expect each country to have a probability of approximately 0.067 (6.7%). The deviation from this uniform distribution indicates geographic sampling bias.
Testable Hypothesis:
H₁: Countries with lower representation in the dataset have higher variance in their temperature measurements compared to more frequently sampled countries.
This hypothesis is quantifiable because we can calculate the coefficient of variation (CV = SD/Mean) for each country and test whether there’s a negative correlation between sample size and temperature variance using Spearman’s rank correlation.
# Calculate coefficient of variation
df_1 <- df_1 %>%
mutate(Temp_CV = SD_Temp / abs(Avg_Temp))
# Test correlation between sample size and variance
cor_test <- cor.test(df_1$Group_Count, df_1$Temp_CV, method = "spearman")
cat("Spearman correlation between sample size and temperature CV:",
round(cor_test$estimate, 3), "\n")
## Spearman correlation between sample size and temperature CV: -0.202
cat("P-value:", format.pval(cor_test$p.value, digits = 3), "\n")
## P-value: 0.47
Further Questions:
# Visualization 1: Bar chart of country frequencies
ggplot(df_1, aes(x = reorder(Country, Group_Count), y = Group_Count, fill = Tag)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("RARE_GEOGRAPHY" = "#d73027", "Standard" = "#4575b4")) +
labs(title = "Probability of Occurrence by Country",
subtitle = "Red indicates the rarest geographic group",
x = "Country",
y = "Observation Count") +
geom_hline(yintercept = mean(df_1$Group_Count), linetype = "dashed", color = "gray40") +
annotate("text", x = 1, y = mean(df_1$Group_Count),
label = "Mean Count", vjust = -0.5, hjust = -0.1, size = 3.5) +
theme(legend.position = "bottom")
Examine the relationship between rainfall categories and CO2 emissions to identify which hydrological states are most anomalous.
# Group by Rainfall Category
df_2 <- df %>%
group_by(Rainfall_Cat) %>%
summarise(
Mean_CO2 = mean(CO2.Emissions..Tons.Capita., na.rm = TRUE),
Median_CO2 = median(CO2.Emissions..Tons.Capita., na.rm = TRUE),
SD_CO2 = sd(CO2.Emissions..Tons.Capita., na.rm = TRUE),
Group_Count = n()
) %>%
mutate(
Probability = Group_Count / sum(Group_Count),
Tag = ifelse(Group_Count == min(Group_Count), "RARE_HYDRO_STATE", "Common")
) %>%
arrange(Probability)
# Display results
df_2 %>% kable(caption = "Hydrological State Analysis: Rainfall vs CO2 Emissions",
digits = 3)
| Rainfall_Cat | Mean_CO2 | Median_CO2 | SD_CO2 | Group_Count | Probability | Tag |
|---|---|---|---|---|---|---|
| High | 10.445 | 10.8 | 5.726 | 319 | 0.319 | RARE_HYDRO_STATE |
| Low | 10.212 | 10.5 | 5.527 | 335 | 0.335 | Common |
| Medium | 10.615 | 10.8 | 5.605 | 346 | 0.346 | Common |
Key Findings:
rarest_hydro <- df_2 %>% filter(Tag == "RARE_HYDRO_STATE")
cat("Rarest Hydrological State:", as.character(rarest_hydro$Rainfall_Cat), "\n")
## Rarest Hydrological State: High
cat("Probability:", round(rarest_hydro$Probability, 4), "\n")
## Probability: 0.319
cat("Mean CO2 Emissions:", round(rarest_hydro$Mean_CO2, 2), "Tons/Capita\n")
## Mean CO2 Emissions: 10.45 Tons/Capita
Significance:
The High rainfall category represents the tail end of the probability distribution with P = 0.319. While the distribution is relatively balanced, this indicates that extreme precipitation states are statistically less common than moderate conditions. This aligns with climatological expectations where extreme weather events, by definition, occur less frequently.
Testable Hypothesis:
H₂: High rainfall states (>2000mm) are associated with lower CO2 emissions per capita due to reduced industrial activity during extreme weather periods.
This is quantifiable through a t-test comparing mean CO2 emissions between High rainfall and Low/Medium rainfall groups, or through a linear regression of rainfall amount on CO2 emissions.
# Test if High rainfall has significantly different CO2 emissions
high_rainfall_co2 <- df %>%
filter(Rainfall_Cat == "High") %>%
pull(CO2.Emissions..Tons.Capita.)
other_rainfall_co2 <- df %>%
filter(Rainfall_Cat != "High") %>%
pull(CO2.Emissions..Tons.Capita.)
t_test_result <- t.test(high_rainfall_co2, other_rainfall_co2)
cat("T-test comparing High rainfall vs Others:\n")
## T-test comparing High rainfall vs Others:
cat("Mean CO2 (High):", round(mean(high_rainfall_co2, na.rm = TRUE), 2), "\n")
## Mean CO2 (High): 10.45
cat("Mean CO2 (Others):", round(mean(other_rainfall_co2, na.rm = TRUE), 2), "\n")
## Mean CO2 (Others): 10.42
cat("P-value:", format.pval(t_test_result$p.value, digits = 3), "\n")
## P-value: 0.941
Further Questions:
# Visualization 2: Frequency and CO2 emissions by rainfall category
p1 <- ggplot(df_2, aes(x = Rainfall_Cat, y = Group_Count, fill = Tag)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#fc8d59", "Common" = "#91bfdb")) +
labs(title = "Frequency of Hydrological States",
x = "Rainfall Category",
y = "Observation Count") +
theme(legend.position = "none")
p2 <- ggplot(df_2, aes(x = Rainfall_Cat, y = Mean_CO2, fill = Tag)) +
geom_col() +
scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#fc8d59", "Common" = "#91bfdb")) +
labs(title = "Mean CO2 Emissions by Rainfall State",
x = "Rainfall Category",
y = "Mean CO2 (Tons/Capita)") +
theme(legend.position = "bottom")
# Display both plots
p1
p2
Analyze the relationship between renewable energy adoption levels and extreme weather events to understand energy transition patterns.
# Group by Renewable Energy Category
df_3 <- df %>%
group_by(Renewable_Cat) %>%
summarise(
Avg_Extreme_Events = mean(Extreme.Weather.Events, na.rm = TRUE),
Median_Extreme_Events = median(Extreme.Weather.Events, na.rm = TRUE),
SD_Extreme_Events = sd(Extreme.Weather.Events, na.rm = TRUE),
Avg_Forest = mean(Forest.Area...., na.rm = TRUE),
Group_Count = n()
) %>%
mutate(
Probability = Group_Count / sum(Group_Count),
Tag = ifelse(Group_Count == min(Group_Count), "RARE_ENERGY_LEVEL", "Standard")
) %>%
arrange(Group_Count)
# Display results
df_3 %>% kable(caption = "Energy Transition Analysis: Renewable Energy vs Extreme Weather",
digits = 3)
| Renewable_Cat | Avg_Extreme_Events | Median_Extreme_Events | SD_Extreme_Events | Avg_Forest | Group_Count | Probability | Tag |
|---|---|---|---|---|---|---|---|
| Advanced | 7.046 | 7.0 | 4.432 | 39.896 | 323 | 0.323 | RARE_ENERGY_LEVEL |
| Early-Stage | 7.120 | 7.5 | 4.503 | 41.267 | 334 | 0.334 | Standard |
| Transitioning | 7.688 | 8.0 | 4.318 | 40.532 | 343 | 0.343 | Standard |
Key Findings:
rarest_energy <- df_3 %>% filter(Tag == "RARE_ENERGY_LEVEL")
cat("Rarest Energy Transition State:", as.character(rarest_energy$Renewable_Cat), "\n")
## Rarest Energy Transition State: Advanced
cat("Probability:", round(rarest_energy$Probability, 4), "\n")
## Probability: 0.323
cat("Average Extreme Weather Events:", round(rarest_energy$Avg_Extreme_Events, 2), "\n")
## Average Extreme Weather Events: 7.05
Significance:
The Advanced renewable energy group is the least common (P = 0.323), reflecting the global reality of energy transition. Most observations fall into the “Transitioning” phase, while achieving advanced renewable integration (>35% renewable energy) is a rare, high-achievement state that only a subset of nations and time periods have reached.
Testable Hypothesis:
H₃: Countries in the “Advanced” renewable energy category have experienced fewer extreme weather events on average compared to “Early-Stage” countries, controlling for geographic location.
This hypothesis can be tested using ANOVA to compare mean extreme weather events across the three renewable energy categories, followed by post-hoc tests.
# ANOVA test for extreme weather events across renewable categories
anova_model <- aov(Extreme.Weather.Events ~ Renewable_Cat, data = df)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Renewable_Cat 2 83 41.59 2.131 0.119
## Residuals 997 19457 19.52
# Post-hoc Tukey test
tukey_result <- TukeyHSD(anova_model)
print(tukey_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Extreme.Weather.Events ~ Renewable_Cat, data = df)
##
## $Renewable_Cat
## diff lwr upr p adj
## Transitioning-Early-Stage 0.56828617 -0.2288239 1.3653963 0.2159483
## Advanced-Early-Stage -0.07332085 -0.8825145 0.7358728 0.9753731
## Advanced-Transitioning -0.64160702 -1.4455645 0.1623504 0.1470164
Interesting Observation:
The data reveals a counterintuitive pattern: the “Transitioning” group has 7.69 average extreme weather events, which is higher than the “Advanced” group (7.05).
Further Questions:
# Visualization 3: Extreme weather by energy transition state
ggplot(df_3, aes(x = Renewable_Cat, y = Avg_Extreme_Events, fill = Tag)) +
geom_col() +
geom_errorbar(aes(ymin = Avg_Extreme_Events - SD_Extreme_Events,
ymax = Avg_Extreme_Events + SD_Extreme_Events),
width = 0.2) +
scale_fill_manual(values = c("RARE_ENERGY_LEVEL" = "#238b45", "Standard" = "#737373")) +
labs(title = "Extreme Weather Events by Energy Transition State",
subtitle = "Error bars represent ± 1 standard deviation",
x = "Renewable Energy Category",
y = "Mean Extreme Weather Events") +
theme(legend.position = "bottom")
# Additional visualization: Count by category
ggplot(df_3, aes(x = Renewable_Cat, y = Group_Count, fill = Tag)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("RARE_ENERGY_LEVEL" = "#238b45", "Standard" = "#737373")) +
labs(title = "Distribution of Energy Transition States",
subtitle = "Green indicates the rarest energy level",
x = "Renewable Energy Category",
y = "Observation Count") +
theme(legend.position = "bottom")
Identify which combinations of Country and Rainfall Category exist in the data, find missing combinations, and analyze the most/least common pairings.
# Build all existing combinations
combinations_df <- df %>%
group_by(Country, Rainfall_Cat) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(Probability = Count / sum(Count))
# Identify extremes
most_common <- combinations_df %>%
filter(Count == max(Count))
least_common <- combinations_df %>%
filter(Count == min(Count))
# Create all possible combinations
all_countries <- unique(df$Country)
all_rainfall <- levels(df$Rainfall_Cat)
all_possible <- expand.grid(Country = all_countries,
Rainfall_Cat = all_rainfall,
stringsAsFactors = FALSE)
# Find missing combinations
missing_combos <- anti_join(all_possible, combinations_df,
by = c("Country", "Rainfall_Cat"))
# Display most common combinations
cat("MOST COMMON COMBINATIONS:\n")
## MOST COMMON COMBINATIONS:
most_common %>%
arrange(desc(Count)) %>%
kable(caption = "Most Frequent Country-Rainfall Combinations")
| Country | Rainfall_Cat | Count | Probability |
|---|---|---|---|
| India | Medium | 29 | 0.029 |
| Russia | Low | 29 | 0.029 |
| UK | Low | 29 | 0.029 |
cat("\nLEAST COMMON COMBINATIONS:\n")
##
## LEAST COMMON COMBINATIONS:
least_common %>%
arrange(Count) %>%
kable(caption = "Rarest Country-Rainfall Combinations")
| Country | Rainfall_Cat | Count | Probability |
|---|---|---|---|
| UK | High | 14 | 0.014 |
cat("\nMISSING COMBINATIONS:\n")
##
## MISSING COMBINATIONS:
if(nrow(missing_combos) > 0) {
print(missing_combos %>% kable(caption = "Non-existent Country-Rainfall Pairs"))
} else {
cat("No missing combinations found. All countries have observations in all rainfall categories.\n")
}
## No missing combinations found. All countries have observations in all rainfall categories.
# Summary statistics
cat("\n\nCOMBINATION STATISTICS:\n")
##
##
## COMBINATION STATISTICS:
cat("Total possible combinations:", nrow(all_possible), "\n")
## Total possible combinations: 45
cat("Observed combinations:", nrow(combinations_df), "\n")
## Observed combinations: 45
cat("Missing combinations:", nrow(missing_combos), "\n")
## Missing combinations: 0
cat("Percentage coverage:",
round(100 * nrow(combinations_df) / nrow(all_possible), 2), "%\n")
## Percentage coverage: 100 %
Missing Combinations:
if(nrow(missing_combos) > 0) {
cat("The dataset is missing", nrow(missing_combos), "country-rainfall combinations.\n")
cat("This suggests:\n")
cat("1. Incomplete temporal coverage for certain geographic regions\n")
cat("2. Climate constraints preventing certain countries from experiencing all rainfall states\n")
cat("3. Data collection gaps during specific hydrological conditions\n")
} else {
cat("The absence of missing combinations is remarkable. Every country in the dataset\n")
cat("has at least one record in Low, Medium, and High rainfall categories. This suggests:\n\n")
cat("1. The dataset covers a sufficiently long temporal range where every nation\n")
cat(" eventually experienced all three hydrological states\n")
cat("2. The data may be synthetic or carefully balanced\n")
cat("3. The rainfall categorization bins are wide enough to capture natural variability\n")
}
## The absence of missing combinations is remarkable. Every country in the dataset
## has at least one record in Low, Medium, and High rainfall categories. This suggests:
##
## 1. The dataset covers a sufficiently long temporal range where every nation
## eventually experienced all three hydrological states
## 2. The data may be synthetic or carefully balanced
## 3. The rainfall categorization bins are wide enough to capture natural variability
Most Common Combinations:
cat("\nThe most frequent combinations are:\n")
##
## The most frequent combinations are:
for(i in 1:min(3, nrow(most_common))) {
cat(sprintf("- %s + %s: %d observations (P = %.4f)\n",
most_common$Country[i],
most_common$Rainfall_Cat[i],
most_common$Count[i],
most_common$Probability[i]))
}
## - India + Medium: 29 observations (P = 0.0290)
## - Russia + Low: 29 observations (P = 0.0290)
## - UK + Low: 29 observations (P = 0.0290)
cat("\nThese combinations likely represent:\n")
##
## These combinations likely represent:
cat("1. Countries with stable, consistent rainfall patterns\n")
## 1. Countries with stable, consistent rainfall patterns
cat("2. Geographic regions with more comprehensive data collection\n")
## 2. Geographic regions with more comprehensive data collection
cat("3. Climate zones where certain rainfall levels are climatologically expected\n")
## 3. Climate zones where certain rainfall levels are climatologically expected
Least Common Combinations (RARE COMBINATIONS):
cat("\nThe rarest combinations are:\n")
##
## The rarest combinations are:
for(i in 1:min(5, nrow(least_common))) {
cat(sprintf("- %s + %s: %d observations (P = %.4f)\n",
least_common$Country[i],
least_common$Rainfall_Cat[i],
least_common$Count[i],
least_common$Probability[i]))
}
## - UK + High: 14 observations (P = 0.0140)
cat("\n\nWhy are these rare?\n")
##
##
## Why are these rare?
cat("These represent CLIMATOLOGICAL ANOMALIES - conditions that are:\n")
## These represent CLIMATOLOGICAL ANOMALIES - conditions that are:
cat("1. Statistically unlikely given the country's typical climate\n")
## 1. Statistically unlikely given the country's typical climate
cat("2. Extreme weather events that occur infrequently\n")
## 2. Extreme weather events that occur infrequently
cat("3. Transitions between climate states during unusual atmospheric conditions\n")
## 3. Transitions between climate states during unusual atmospheric conditions
cat("\nFor example, if UK-High is among the rarest, this makes sense because\n")
##
## For example, if UK-High is among the rarest, this makes sense because
cat("the UK typically experiences moderate, consistent rainfall rather than extremes.\n")
## the UK typically experiences moderate, consistent rainfall rather than extremes.
# Heatmap of Country-Rainfall Combinations
ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Count)) +
geom_tile(color = "white", size = 0.5) +
scale_fill_gradient(low = "#f7fbff", high = "#08306b",
name = "Count") +
labs(title = "Heatmap of Country-Rainfall Combinations",
subtitle = "Darker tiles represent higher occurrence probabilities",
x = "Country",
y = "Rainfall Category") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())
# Alternative: Probability-based heatmap
ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Probability)) +
geom_tile(color = "white", size = 0.5) +
geom_text(aes(label = Count), size = 3, color = "white") +
scale_fill_gradient(low = "#feedde", high = "#a63603",
name = "Probability",
labels = percent_format(accuracy = 0.1)) +
labs(title = "Probability Distribution of Country-Rainfall Combinations",
subtitle = "Numbers show observation counts; color intensity shows probability",
x = "Country",
y = "Rainfall Category") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())
# Bar plot showing top and bottom combinations
top_bottom <- bind_rows(
combinations_df %>%
arrange(desc(Count)) %>%
head(10) %>%
mutate(Category = "Top 10 Most Common"),
combinations_df %>%
arrange(Count) %>%
head(10) %>%
mutate(Category = "Top 10 Rarest")
) %>%
mutate(Combo = paste(Country, Rainfall_Cat, sep = " - "))
ggplot(top_bottom, aes(x = reorder(Combo, Count), y = Count, fill = Category)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Top 10 Most Common" = "#2166ac",
"Top 10 Rarest" = "#b2182b")) +
labs(title = "Extreme Combinations: Most Common vs Rarest",
x = "Country-Rainfall Combination",
y = "Observation Count",
fill = "") +
theme(legend.position = "bottom")
Our probability-based anomaly detection revealed three critical insights:
1. Geographic Sampling Bias - Mexico is the rarest country (P = 0.055) - This suggests systematic under-sampling or late integration into reporting frameworks - Hypothesis testing showed no significant correlation between sample size and temperature variance
2. Hydrological State Distribution - High rainfall is the rarest state (P = 0.319) - Extreme precipitation events are statistically infrequent, aligning with climatological expectations - CO2 emissions show no significant difference between high and other rainfall states
3. Energy Transition Patterns - Advanced renewable adoption is the rarest (P = 0.323) - Only a subset of observations represent advanced clean energy integration - Counterintuitively, transitioning states experience more extreme weather events
4. Combination Analysis - Complete coverage across all country-rainfall pairs - Rarest combinations represent climatological anomalies - Most common combinations reflect stable climate patterns
These “anomalies” are not data errors but represent the boundaries of observed climate systems. The most significant climate insights often reside in these least frequent data points, as they represent:
# Distribution of probabilities across all combinations
summary_stats <- combinations_df %>%
summarise(
Mean_Count = mean(Count),
Median_Count = median(Count),
SD_Count = sd(Count),
Min_Count = min(Count),
Max_Count = max(Count),
Skewness = (mean(Count) - median(Count)) / sd(Count)
)
cat("Distribution Statistics for Country-Rainfall Combinations:\n")
## Distribution Statistics for Country-Rainfall Combinations:
print(summary_stats)
## # A tibble: 1 × 6
## Mean_Count Median_Count SD_Count Min_Count Max_Count Skewness
## <dbl> <int> <dbl> <int> <int> <dbl>
## 1 22.2 22 3.77 14 29 0.0590
# Histogram of combination frequencies
ggplot(combinations_df, aes(x = Count)) +
geom_histogram(bins = 20, fill = "#4292c6", color = "white") +
geom_vline(xintercept = mean(combinations_df$Count),
linetype = "dashed", color = "red", size = 1) +
labs(title = "Distribution of Combination Frequencies",
subtitle = "Red line shows mean count",
x = "Number of Observations",
y = "Number of Combinations") +
theme_minimal()
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.4.0 knitr_1.51 lubridate_1.9.4 forcats_1.0.1
## [5] stringr_1.6.0 dplyr_1.1.4 purrr_1.2.1 readr_2.1.6
## [9] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.2 tidyselect_1.2.1
## [5] jquerylib_0.1.4 yaml_2.3.12 fastmap_1.2.0 R6_2.6.1
## [9] labeling_0.4.3 generics_0.1.4 bslib_0.10.0 pillar_1.11.1
## [13] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.1.7 stringi_1.8.7
## [17] cachem_1.1.0 xfun_0.56 sass_0.4.10 S7_0.2.1
## [21] otel_0.2.0 timechange_0.3.0 cli_3.6.5 withr_3.0.2
## [25] magrittr_2.0.4 digest_0.6.39 grid_4.5.2 rstudioapi_0.18.0
## [29] hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.1 evaluate_1.0.5
## [33] glue_1.8.0 farver_2.1.2 rmarkdown_2.30 tools_4.5.2
## [37] pkgconfig_2.0.3 htmltools_0.5.9