Climate Change Data Analysis

Executive Summary

This analysis investigates the distributional characteristics of a global climate dataset using probability-based anomaly detection. By grouping data across multiple categorical dimensions and calculating group probabilities, we identify rare combinations that represent unusual environmental states or reporting patterns. Our investigation reveals critical insights about global climate patterns, energy transitions, and geographic sampling biases.

1. Data Loading and Preprocessing

# Load required libraries
library(tidyverse)
library(knitr)
library(scales)

# Set theme for consistent visualization
theme_set(theme_minimal(base_size = 12))

# Load the climate change dataset
df <- read.csv("climate_change_dataset.csv")

# Display dataset structure
str(df)

## 'data.frame':    1000 obs. of  10 variables:
##  $ Year                       : int  2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
##  $ Country                    : chr  "UK" "USA" "France" "Argentina" ...
##  $ Avg.Temperature...C.       : num  8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
##  $ CO2.Emissions..Tons.Capita.: num  9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
##  $ Sea.Level.Rise..mm.        : num  3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
##  $ Rainfall..mm.              : int  1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
##  $ Population                 : int  530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
##  $ Renewable.Energy....       : num  20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
##  $ Extreme.Weather.Events     : int  14 8 9 7 4 12 10 1 4 5 ...
##  $ Forest.Area....            : num  59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...

# Summary statistics
summary(df)

##       Year        Country          Avg.Temperature...C.
##  Min.   :2000   Length:1000        Min.   : 5.00       
##  1st Qu.:2005   Class :character   1st Qu.:12.18       
##  Median :2012   Mode  :character   Median :20.10       
##  Mean   :2011                      Mean   :19.88       
##  3rd Qu.:2018                      3rd Qu.:27.23       
##  Max.   :2023                      Max.   :34.90       
##  CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm. 
##  Min.   : 0.500              Min.   :1.00        Min.   : 501  
##  1st Qu.: 5.575              1st Qu.:2.00        1st Qu.:1099  
##  Median :10.700              Median :3.00        Median :1726  
##  Mean   :10.426              Mean   :3.01        Mean   :1739  
##  3rd Qu.:15.400              3rd Qu.:4.00        3rd Qu.:2362  
##  Max.   :20.000              Max.   :5.00        Max.   :2999  
##    Population        Renewable.Energy.... Extreme.Weather.Events
##  Min.   :3.661e+06   Min.   : 5.10        Min.   : 0.000        
##  1st Qu.:3.436e+08   1st Qu.:16.10        1st Qu.: 3.000        
##  Median :7.131e+08   Median :27.15        Median : 8.000        
##  Mean   :7.054e+08   Mean   :27.30        Mean   : 7.291        
##  3rd Qu.:1.074e+09   3rd Qu.:38.92        3rd Qu.:11.000        
##  Max.   :1.397e+09   Max.   :50.00        Max.   :14.000        
##  Forest.Area....
##  Min.   :10.10  
##  1st Qu.:25.60  
##  Median :41.15  
##  Mean   :40.57  
##  3rd Qu.:55.80  
##  Max.   :70.00

# Check for missing values
cat("\nMissing values per column:\n")

## 
## Missing values per column:

colSums(is.na(df))

##                        Year                     Country 
##                           0                           0 
##        Avg.Temperature...C. CO2.Emissions..Tons.Capita. 
##                           0                           0 
##         Sea.Level.Rise..mm.               Rainfall..mm. 
##                           0                           0 
##                  Population        Renewable.Energy.... 
##                           0                           0 
##      Extreme.Weather.Events             Forest.Area.... 
##                           0                           0

Data Preprocessing: Creating Categorical Variables

To enable probability analysis, we discretize continuous variables into categorical bins using the cut() function.

# Create categorical bins for analysis
df <- df %>%
  mutate(
    # Categorize rainfall into three levels
    Rainfall_Cat = cut(Rainfall..mm., 
                       breaks = 3, 
                       labels = c("Low", "Medium", "High")),
    
    # Categorize renewable energy adoption
    Renewable_Cat = cut(Renewable.Energy...., 
                        breaks = 3, 
                        labels = c("Early-Stage", "Transitioning", "Advanced")),
    
    # Categorize temperature
    Temp_Cat = cut(Avg.Temperature...C., 
                   breaks = 3,
                   labels = c("Cool", "Moderate", "Warm")),
    
    # Categorize CO2 emissions
    CO2_Cat = cut(CO2.Emissions..Tons.Capita., 
                  breaks = 3,
                  labels = c("Low-Emissions", "Medium-Emissions", "High-Emissions"))
  )

# Preview the transformed data
head(df, 10) %>% kable(caption = "Sample of Preprocessed Data")

Sample of Preprocessed Data
Year	Country	Avg.Temperature…C.	CO2.Emissions..Tons.Capita.	Sea.Level.Rise..mm.	Rainfall..mm.	Population	Renewable.Energy….	Extreme.Weather.Events	Forest.Area….	Rainfall_Cat	Renewable_Cat	Temp_Cat	CO2_Cat
2006	UK	8.9	9.3	3.1	1441	530911230	20.4	14	59.8	Medium	Transitioning	Cool	Medium-Emissions
2019	USA	31.0	4.8	4.2	2407	107364344	49.2	8	31.0	High	Advanced	Warm	Low-Emissions
2014	France	33.9	2.8	2.2	1241	441101758	33.3	9	35.5	Low	Transitioning	Warm	Low-Emissions
2010	Argentina	5.9	1.8	3.2	1892	1069669579	23.7	7	17.7	Medium	Transitioning	Cool	Low-Emissions
2007	Germany	26.9	5.6	2.4	1743	124079175	12.5	4	17.4	Medium	Early-Stage	Warm	Low-Emissions
2020	China	32.3	1.4	2.7	2100	1202028857	49.4	12	47.2	Medium	Advanced	Warm	Low-Emissions
2006	Argentina	30.7	11.6	3.9	1755	586706107	41.9	10	50.5	Medium	Advanced	Warm	Medium-Emissions
2018	South Africa	33.9	6.0	4.5	827	83947380	17.7	1	56.6	Low	Early-Stage	Warm	Low-Emissions
2022	UK	27.8	16.6	1.5	1966	980305187	8.2	4	43.4	Medium	Early-Stage	Warm	High-Emissions
2010	Australia	18.3	1.9	3.5	2599	849496137	7.5	5	48.7	High	Early-Stage	Moderate	Low-Emissions

2. Investigation 1: Geographic Distribution and Thermal Profiles

Objective

Analyze which countries are most and least represented in the dataset, and examine their average temperature profiles to identify geographic sampling biases.

# Group by Country and calculate statistics
df_1 <- df %>%
  group_by(Country) %>%
  summarise(
    Avg_Temp = mean(Avg.Temperature...C., na.rm = TRUE),
    SD_Temp = sd(Avg.Temperature...C., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_GEOGRAPHY", "Standard")
  ) %>%
  arrange(Group_Count)

# Display the results
df_1 %>% kable(caption = "Country-Level Analysis: Probability Distribution", 
               digits = 3)

Country-Level Analysis: Probability Distribution
Country	Avg_Temp	SD_Temp	Group_Count	Probability	Tag
Mexico	20.696	9.621	55	0.055	RARE_GEOGRAPHY
Australia	19.449	7.829	57	0.057	Standard
Germany	20.289	8.041	61	0.061	Standard
Japan	20.454	9.132	63	0.063	Standard
UK	18.491	8.200	65	0.065	Standard
France	19.383	8.884	66	0.066	Standard
Argentina	19.299	8.330	67	0.067	Standard
Brazil	20.836	8.970	67	0.067	Standard
Canada	20.012	8.445	67	0.067	Standard
China	20.282	8.788	67	0.067	Standard
India	19.764	8.669	70	0.070	Standard
South Africa	20.738	9.170	73	0.073	Standard
USA	19.049	8.341	73	0.073	Standard
Russia	20.719	7.833	74	0.074	Standard
Indonesia	18.919	8.244	75	0.075	Standard

Analysis of Investigation 1

Key Findings:

# Identify the rarest country
rarest_country <- df_1 %>% filter(Tag == "RARE_GEOGRAPHY")
most_common_country <- df_1 %>% filter(Group_Count == max(Group_Count))

cat("Rarest Country:", rarest_country$Country, "\n")

## Rarest Country: Mexico

cat("Probability:", round(rarest_country$Probability, 4), "\n")

## Probability: 0.055

cat("Count:", rarest_country$Group_Count, "\n\n")

## Count: 55

cat("Most Common Country:", most_common_country$Country, "\n")

## Most Common Country: Indonesia

cat("Probability:", round(most_common_country$Probability, 4), "\n")

## Probability: 0.075

cat("Count:", most_common_country$Group_Count, "\n")

## Count: 75

Significance:

The rarest country has a probability of 0.055, meaning if we randomly select a row from the dataset, there is only a 5.5% chance it belongs to Mexico. In a perfectly balanced dataset with 15 countries, we would expect each country to have a probability of approximately 0.067 (6.7%). The deviation from this uniform distribution indicates geographic sampling bias.

Testable Hypothesis:

H₁: Countries with lower representation in the dataset have higher variance in their temperature measurements compared to more frequently sampled countries.

This hypothesis is quantifiable because we can calculate the coefficient of variation (CV = SD/Mean) for each country and test whether there’s a negative correlation between sample size and temperature variance using Spearman’s rank correlation.

# Calculate coefficient of variation
df_1 <- df_1 %>%
  mutate(Temp_CV = SD_Temp / abs(Avg_Temp))

# Test correlation between sample size and variance
cor_test <- cor.test(df_1$Group_Count, df_1$Temp_CV, method = "spearman")
cat("Spearman correlation between sample size and temperature CV:", 
    round(cor_test$estimate, 3), "\n")

## Spearman correlation between sample size and temperature CV: -0.202

cat("P-value:", format.pval(cor_test$p.value, digits = 3), "\n")

## P-value: 0.47

Further Questions:

Does the temporal distribution differ across countries (i.e., are some countries sampled more in recent years)?
Is there a relationship between a country’s average temperature and its representation in the dataset?

# Visualization 1: Bar chart of country frequencies
ggplot(df_1, aes(x = reorder(Country, Group_Count), y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("RARE_GEOGRAPHY" = "#d73027", "Standard" = "#4575b4")) +
  labs(title = "Probability of Occurrence by Country",
       subtitle = "Red indicates the rarest geographic group",
       x = "Country", 
       y = "Observation Count") +
  geom_hline(yintercept = mean(df_1$Group_Count), linetype = "dashed", color = "gray40") +
  annotate("text", x = 1, y = mean(df_1$Group_Count), 
           label = "Mean Count", vjust = -0.5, hjust = -0.1, size = 3.5) +
  theme(legend.position = "bottom")

3. Investigation 2: Hydrological States and Carbon Footprints

Objective

Examine the relationship between rainfall categories and CO2 emissions to identify which hydrological states are most anomalous.

# Group by Rainfall Category
df_2 <- df %>%
  group_by(Rainfall_Cat) %>%
  summarise(
    Mean_CO2 = mean(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    Median_CO2 = median(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    SD_CO2 = sd(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_HYDRO_STATE", "Common")
  ) %>%
  arrange(Probability)

# Display results
df_2 %>% kable(caption = "Hydrological State Analysis: Rainfall vs CO2 Emissions",
               digits = 3)

Hydrological State Analysis: Rainfall vs CO2 Emissions
Rainfall_Cat	Mean_CO2	Median_CO2	SD_CO2	Group_Count	Probability	Tag
High	10.445	10.8	5.726	319	0.319	RARE_HYDRO_STATE
Low	10.212	10.5	5.527	335	0.335	Common
Medium	10.615	10.8	5.605	346	0.346	Common

Analysis of Investigation 2

Key Findings:

rarest_hydro <- df_2 %>% filter(Tag == "RARE_HYDRO_STATE")
cat("Rarest Hydrological State:", as.character(rarest_hydro$Rainfall_Cat), "\n")

## Rarest Hydrological State: High

cat("Probability:", round(rarest_hydro$Probability, 4), "\n")

## Probability: 0.319

cat("Mean CO2 Emissions:", round(rarest_hydro$Mean_CO2, 2), "Tons/Capita\n")

## Mean CO2 Emissions: 10.45 Tons/Capita

Significance:

The High rainfall category represents the tail end of the probability distribution with P = 0.319. While the distribution is relatively balanced, this indicates that extreme precipitation states are statistically less common than moderate conditions. This aligns with climatological expectations where extreme weather events, by definition, occur less frequently.

Testable Hypothesis:

H₂: High rainfall states (>2000mm) are associated with lower CO2 emissions per capita due to reduced industrial activity during extreme weather periods.

This is quantifiable through a t-test comparing mean CO2 emissions between High rainfall and Low/Medium rainfall groups, or through a linear regression of rainfall amount on CO2 emissions.

# Test if High rainfall has significantly different CO2 emissions
high_rainfall_co2 <- df %>% 
  filter(Rainfall_Cat == "High") %>% 
  pull(CO2.Emissions..Tons.Capita.)

other_rainfall_co2 <- df %>% 
  filter(Rainfall_Cat != "High") %>% 
  pull(CO2.Emissions..Tons.Capita.)

t_test_result <- t.test(high_rainfall_co2, other_rainfall_co2)
cat("T-test comparing High rainfall vs Others:\n")

## T-test comparing High rainfall vs Others:

cat("Mean CO2 (High):", round(mean(high_rainfall_co2, na.rm = TRUE), 2), "\n")

## Mean CO2 (High): 10.45

cat("Mean CO2 (Others):", round(mean(other_rainfall_co2, na.rm = TRUE), 2), "\n")

## Mean CO2 (Others): 10.42

cat("P-value:", format.pval(t_test_result$p.value, digits = 3), "\n")

## P-value: 0.941

Further Questions:

Is there a temporal pattern to rainfall extremes (e.g., are they increasing over time)?
Do certain countries disproportionately contribute to the “High” rainfall category?

# Visualization 2: Frequency and CO2 emissions by rainfall category
p1 <- ggplot(df_2, aes(x = Rainfall_Cat, y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#fc8d59", "Common" = "#91bfdb")) +
  labs(title = "Frequency of Hydrological States",
       x = "Rainfall Category", 
       y = "Observation Count") +
  theme(legend.position = "none")

p2 <- ggplot(df_2, aes(x = Rainfall_Cat, y = Mean_CO2, fill = Tag)) +
  geom_col() +
  scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#fc8d59", "Common" = "#91bfdb")) +
  labs(title = "Mean CO2 Emissions by Rainfall State",
       x = "Rainfall Category", 
       y = "Mean CO2 (Tons/Capita)") +
  theme(legend.position = "bottom")

# Display both plots
p1

p2

4. Investigation 3: Energy Transition States and Extreme Weather

Objective

Analyze the relationship between renewable energy adoption levels and extreme weather events to understand energy transition patterns.

# Group by Renewable Energy Category
df_3 <- df %>%
  group_by(Renewable_Cat) %>%
  summarise(
    Avg_Extreme_Events = mean(Extreme.Weather.Events, na.rm = TRUE),
    Median_Extreme_Events = median(Extreme.Weather.Events, na.rm = TRUE),
    SD_Extreme_Events = sd(Extreme.Weather.Events, na.rm = TRUE),
    Avg_Forest = mean(Forest.Area...., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_ENERGY_LEVEL", "Standard")
  ) %>%
  arrange(Group_Count)

# Display results
df_3 %>% kable(caption = "Energy Transition Analysis: Renewable Energy vs Extreme Weather",
               digits = 3)

Energy Transition Analysis: Renewable Energy vs Extreme Weather
Renewable_Cat	Avg_Extreme_Events	Median_Extreme_Events	SD_Extreme_Events	Avg_Forest	Group_Count	Probability	Tag
Advanced	7.046	7.0	4.432	39.896	323	0.323	RARE_ENERGY_LEVEL
Early-Stage	7.120	7.5	4.503	41.267	334	0.334	Standard
Transitioning	7.688	8.0	4.318	40.532	343	0.343	Standard

Analysis of Investigation 3

Key Findings:

rarest_energy <- df_3 %>% filter(Tag == "RARE_ENERGY_LEVEL")
cat("Rarest Energy Transition State:", as.character(rarest_energy$Renewable_Cat), "\n")

## Rarest Energy Transition State: Advanced

cat("Probability:", round(rarest_energy$Probability, 4), "\n")

## Probability: 0.323

cat("Average Extreme Weather Events:", round(rarest_energy$Avg_Extreme_Events, 2), "\n")

## Average Extreme Weather Events: 7.05

Significance:

The Advanced renewable energy group is the least common (P = 0.323), reflecting the global reality of energy transition. Most observations fall into the “Transitioning” phase, while achieving advanced renewable integration (>35% renewable energy) is a rare, high-achievement state that only a subset of nations and time periods have reached.

Testable Hypothesis:

H₃: Countries in the “Advanced” renewable energy category have experienced fewer extreme weather events on average compared to “Early-Stage” countries, controlling for geographic location.

This hypothesis can be tested using ANOVA to compare mean extreme weather events across the three renewable energy categories, followed by post-hoc tests.

# ANOVA test for extreme weather events across renewable categories
anova_model <- aov(Extreme.Weather.Events ~ Renewable_Cat, data = df)
summary(anova_model)

##                Df Sum Sq Mean Sq F value Pr(>F)
## Renewable_Cat   2     83   41.59   2.131  0.119
## Residuals     997  19457   19.52

# Post-hoc Tukey test
tukey_result <- TukeyHSD(anova_model)
print(tukey_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Extreme.Weather.Events ~ Renewable_Cat, data = df)
## 
## $Renewable_Cat
##                                  diff        lwr       upr     p adj
## Transitioning-Early-Stage  0.56828617 -0.2288239 1.3653963 0.2159483
## Advanced-Early-Stage      -0.07332085 -0.8825145 0.7358728 0.9753731
## Advanced-Transitioning    -0.64160702 -1.4455645 0.1623504 0.1470164

Interesting Observation:

The data reveals a counterintuitive pattern: the “Transitioning” group has 7.69 average extreme weather events, which is higher than the “Advanced” group (7.05).

Further Questions:

Is there a feedback loop where extreme weather events accelerate renewable energy policy adoption?
Do countries with higher forest coverage transition to renewable energy faster?

# Visualization 3: Extreme weather by energy transition state
ggplot(df_3, aes(x = Renewable_Cat, y = Avg_Extreme_Events, fill = Tag)) +
  geom_col() +
  geom_errorbar(aes(ymin = Avg_Extreme_Events - SD_Extreme_Events, 
                    ymax = Avg_Extreme_Events + SD_Extreme_Events),
                width = 0.2) +
  scale_fill_manual(values = c("RARE_ENERGY_LEVEL" = "#238b45", "Standard" = "#737373")) +
  labs(title = "Extreme Weather Events by Energy Transition State",
       subtitle = "Error bars represent ± 1 standard deviation",
       x = "Renewable Energy Category", 
       y = "Mean Extreme Weather Events") +
  theme(legend.position = "bottom")

# Additional visualization: Count by category
ggplot(df_3, aes(x = Renewable_Cat, y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("RARE_ENERGY_LEVEL" = "#238b45", "Standard" = "#737373")) +
  labs(title = "Distribution of Energy Transition States",
       subtitle = "Green indicates the rarest energy level",
       x = "Renewable Energy Category", 
       y = "Observation Count") +
  theme(legend.position = "bottom")

5. Bivariate Categorical Analysis: Country and Rainfall Combinations

Objective

Identify which combinations of Country and Rainfall Category exist in the data, find missing combinations, and analyze the most/least common pairings.

# Build all existing combinations
combinations_df <- df %>%
  group_by(Country, Rainfall_Cat) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Probability = Count / sum(Count))

# Identify extremes
most_common <- combinations_df %>% 
  filter(Count == max(Count))

least_common <- combinations_df %>% 
  filter(Count == min(Count))

# Create all possible combinations
all_countries <- unique(df$Country)
all_rainfall <- levels(df$Rainfall_Cat)
all_possible <- expand.grid(Country = all_countries, 
                           Rainfall_Cat = all_rainfall,
                           stringsAsFactors = FALSE)

# Find missing combinations
missing_combos <- anti_join(all_possible, combinations_df, 
                           by = c("Country", "Rainfall_Cat"))

Results: Combination Analysis

# Display most common combinations
cat("MOST COMMON COMBINATIONS:\n")

## MOST COMMON COMBINATIONS:

most_common %>% 
  arrange(desc(Count)) %>%
  kable(caption = "Most Frequent Country-Rainfall Combinations")

Most Frequent Country-Rainfall Combinations
Country	Rainfall_Cat	Count	Probability
India	Medium	29	0.029
Russia	Low	29	0.029
UK	Low	29	0.029

cat("\nLEAST COMMON COMBINATIONS:\n")

## 
## LEAST COMMON COMBINATIONS:

least_common %>% 
  arrange(Count) %>%
  kable(caption = "Rarest Country-Rainfall Combinations")

Rarest Country-Rainfall Combinations
Country	Rainfall_Cat	Count	Probability
UK	High	14	0.014

cat("\nMISSING COMBINATIONS:\n")

## 
## MISSING COMBINATIONS:

if(nrow(missing_combos) > 0) {
  print(missing_combos %>% kable(caption = "Non-existent Country-Rainfall Pairs"))
} else {
  cat("No missing combinations found. All countries have observations in all rainfall categories.\n")
}

## No missing combinations found. All countries have observations in all rainfall categories.

# Summary statistics
cat("\n\nCOMBINATION STATISTICS:\n")

## 
## 
## COMBINATION STATISTICS:

cat("Total possible combinations:", nrow(all_possible), "\n")

## Total possible combinations: 45

cat("Observed combinations:", nrow(combinations_df), "\n")

## Observed combinations: 45

cat("Missing combinations:", nrow(missing_combos), "\n")

## Missing combinations: 0

cat("Percentage coverage:", 
    round(100 * nrow(combinations_df) / nrow(all_possible), 2), "%\n")

## Percentage coverage: 100 %

Interpretation of Combinations

Missing Combinations:

if(nrow(missing_combos) > 0) {
  cat("The dataset is missing", nrow(missing_combos), "country-rainfall combinations.\n")
  cat("This suggests:\n")
  cat("1. Incomplete temporal coverage for certain geographic regions\n")
  cat("2. Climate constraints preventing certain countries from experiencing all rainfall states\n")
  cat("3. Data collection gaps during specific hydrological conditions\n")
} else {
  cat("The absence of missing combinations is remarkable. Every country in the dataset\n")
  cat("has at least one record in Low, Medium, and High rainfall categories. This suggests:\n\n")
  cat("1. The dataset covers a sufficiently long temporal range where every nation\n")
  cat("   eventually experienced all three hydrological states\n")
  cat("2. The data may be synthetic or carefully balanced\n")
  cat("3. The rainfall categorization bins are wide enough to capture natural variability\n")
}

## The absence of missing combinations is remarkable. Every country in the dataset
## has at least one record in Low, Medium, and High rainfall categories. This suggests:
## 
## 1. The dataset covers a sufficiently long temporal range where every nation
##    eventually experienced all three hydrological states
## 2. The data may be synthetic or carefully balanced
## 3. The rainfall categorization bins are wide enough to capture natural variability

Most Common Combinations:

cat("\nThe most frequent combinations are:\n")

## 
## The most frequent combinations are:

for(i in 1:min(3, nrow(most_common))) {
  cat(sprintf("- %s + %s: %d observations (P = %.4f)\n", 
              most_common$Country[i], 
              most_common$Rainfall_Cat[i], 
              most_common$Count[i],
              most_common$Probability[i]))
}

## - India + Medium: 29 observations (P = 0.0290)
## - Russia + Low: 29 observations (P = 0.0290)
## - UK + Low: 29 observations (P = 0.0290)

cat("\nThese combinations likely represent:\n")

## 
## These combinations likely represent:

cat("1. Countries with stable, consistent rainfall patterns\n")

## 1. Countries with stable, consistent rainfall patterns

cat("2. Geographic regions with more comprehensive data collection\n")

## 2. Geographic regions with more comprehensive data collection

cat("3. Climate zones where certain rainfall levels are climatologically expected\n")

## 3. Climate zones where certain rainfall levels are climatologically expected

Least Common Combinations (RARE COMBINATIONS):

cat("\nThe rarest combinations are:\n")

## 
## The rarest combinations are:

for(i in 1:min(5, nrow(least_common))) {
  cat(sprintf("- %s + %s: %d observations (P = %.4f)\n", 
              least_common$Country[i], 
              least_common$Rainfall_Cat[i], 
              least_common$Count[i],
              least_common$Probability[i]))
}

## - UK + High: 14 observations (P = 0.0140)

cat("\n\nWhy are these rare?\n")

## 
## 
## Why are these rare?

cat("These represent CLIMATOLOGICAL ANOMALIES - conditions that are:\n")

## These represent CLIMATOLOGICAL ANOMALIES - conditions that are:

cat("1. Statistically unlikely given the country's typical climate\n")

## 1. Statistically unlikely given the country's typical climate

cat("2. Extreme weather events that occur infrequently\n")

## 2. Extreme weather events that occur infrequently

cat("3. Transitions between climate states during unusual atmospheric conditions\n")

## 3. Transitions between climate states during unusual atmospheric conditions

cat("\nFor example, if UK-High is among the rarest, this makes sense because\n")

## 
## For example, if UK-High is among the rarest, this makes sense because

cat("the UK typically experiences moderate, consistent rainfall rather than extremes.\n")

## the UK typically experiences moderate, consistent rainfall rather than extremes.

Visualization of Combinations

# Heatmap of Country-Rainfall Combinations
ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Count)) +
  geom_tile(color = "white", size = 0.5) +
  scale_fill_gradient(low = "#f7fbff", high = "#08306b",
                      name = "Count") +
  labs(title = "Heatmap of Country-Rainfall Combinations",
       subtitle = "Darker tiles represent higher occurrence probabilities",
       x = "Country",
       y = "Rainfall Category") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

# Alternative: Probability-based heatmap
ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Probability)) +
  geom_tile(color = "white", size = 0.5) +
  geom_text(aes(label = Count), size = 3, color = "white") +
  scale_fill_gradient(low = "#feedde", high = "#a63603",
                      name = "Probability",
                      labels = percent_format(accuracy = 0.1)) +
  labs(title = "Probability Distribution of Country-Rainfall Combinations",
       subtitle = "Numbers show observation counts; color intensity shows probability",
       x = "Country",
       y = "Rainfall Category") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

# Bar plot showing top and bottom combinations
top_bottom <- bind_rows(
  combinations_df %>% 
    arrange(desc(Count)) %>% 
    head(10) %>% 
    mutate(Category = "Top 10 Most Common"),
  combinations_df %>% 
    arrange(Count) %>% 
    head(10) %>% 
    mutate(Category = "Top 10 Rarest")
) %>%
  mutate(Combo = paste(Country, Rainfall_Cat, sep = " - "))

ggplot(top_bottom, aes(x = reorder(Combo, Count), y = Count, fill = Category)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Top 10 Most Common" = "#2166ac", 
                               "Top 10 Rarest" = "#b2182b")) +
  labs(title = "Extreme Combinations: Most Common vs Rarest",
       x = "Country-Rainfall Combination",
       y = "Observation Count",
       fill = "") +
  theme(legend.position = "bottom")

6. Synthesis and Conclusions

Summary of Key Findings

Our probability-based anomaly detection revealed three critical insights:

1. Geographic Sampling Bias - Mexico is the rarest country (P = 0.055) - This suggests systematic under-sampling or late integration into reporting frameworks - Hypothesis testing showed no significant correlation between sample size and temperature variance

2. Hydrological State Distribution - High rainfall is the rarest state (P = 0.319) - Extreme precipitation events are statistically infrequent, aligning with climatological expectations - CO2 emissions show no significant difference between high and other rainfall states

3. Energy Transition Patterns - Advanced renewable adoption is the rarest (P = 0.323) - Only a subset of observations represent advanced clean energy integration - Counterintuitively, transitioning states experience more extreme weather events

4. Combination Analysis - Complete coverage across all country-rainfall pairs - Rarest combinations represent climatological anomalies - Most common combinations reflect stable climate patterns

Implications for Climate Modeling

These “anomalies” are not data errors but represent the boundaries of observed climate systems. The most significant climate insights often reside in these least frequent data points, as they represent:

Extreme Events: Low-probability states that have disproportionate impacts
Transition States: Rare combinations indicating system changes
Data Gaps: Under-sampled regions requiring targeted collection efforts

Recommendations for Future Analysis

Temporal Analysis: Examine whether rare states are becoming more or less frequent over time
Causal Modeling: Investigate causal relationships between renewable energy adoption and extreme weather
Spatial Analysis: Map geographic clusters of rare combinations
Multivariate Anomaly Detection: Extend to 3+ dimensional categorical combinations

Appendix: Additional Exploratory Analysis

# Distribution of probabilities across all combinations
summary_stats <- combinations_df %>%
  summarise(
    Mean_Count = mean(Count),
    Median_Count = median(Count),
    SD_Count = sd(Count),
    Min_Count = min(Count),
    Max_Count = max(Count),
    Skewness = (mean(Count) - median(Count)) / sd(Count)
  )

cat("Distribution Statistics for Country-Rainfall Combinations:\n")

## Distribution Statistics for Country-Rainfall Combinations:

print(summary_stats)

## # A tibble: 1 × 6
##   Mean_Count Median_Count SD_Count Min_Count Max_Count Skewness
##        <dbl>        <int>    <dbl>     <int>     <int>    <dbl>
## 1       22.2           22     3.77        14        29   0.0590

# Histogram of combination frequencies
ggplot(combinations_df, aes(x = Count)) +
  geom_histogram(bins = 20, fill = "#4292c6", color = "white") +
  geom_vline(xintercept = mean(combinations_df$Count), 
             linetype = "dashed", color = "red", size = 1) +
  labs(title = "Distribution of Combination Frequencies",
       subtitle = "Red line shows mean count",
       x = "Number of Observations",
       y = "Number of Combinations") +
  theme_minimal()

Session Information

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0    knitr_1.51      lubridate_1.9.4 forcats_1.0.1  
##  [5] stringr_1.6.0   dplyr_1.1.4     purrr_1.2.1     readr_2.1.6    
##  [9] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
##  [5] jquerylib_0.1.4    yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
##  [9] labeling_0.4.3     generics_0.1.4     bslib_0.10.0       pillar_1.11.1     
## [13] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        stringi_1.8.7     
## [17] cachem_1.1.0       xfun_0.56          sass_0.4.10        S7_0.2.1          
## [21] otel_0.2.0         timechange_0.3.0   cli_3.6.5          withr_3.0.2       
## [25] magrittr_2.0.4     digest_0.6.39      grid_4.5.2         rstudioapi_0.18.0 
## [29] hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.1        evaluate_1.0.5    
## [33] glue_1.8.0         farver_2.1.2       rmarkdown_2.30     tools_4.5.2       
## [37] pkgconfig_2.0.3    htmltools_0.5.9

Climate Change Data Analysis

Tiyasha Banerjee

`02/02/2026`

Executive Summary

1. Data Loading and Preprocessing

Data Preprocessing: Creating Categorical Variables

2. Investigation 1: Geographic Distribution and Thermal Profiles

Objective

Analysis of Investigation 1

3. Investigation 2: Hydrological States and Carbon Footprints

Objective

Analysis of Investigation 2

4. Investigation 3: Energy Transition States and Extreme Weather

Objective

Analysis of Investigation 3

5. Bivariate Categorical Analysis: Country and Rainfall Combinations

Objective

Results: Combination Analysis

Interpretation of Combinations

Visualization of Combinations

6. Synthesis and Conclusions

Summary of Key Findings

Implications for Climate Modeling

Recommendations for Future Analysis

Appendix: Additional Exploratory Analysis

Session Information