Climate Change Data Analysis

Executive Summary

This is to analyse the distributional characteristics of my global climate dataset using probability based anomaly detection. I grouped the data across many categorical dimensions and calculated group probabilities to identify rare combinations that represent maybe unusual environmental states or some kinds of reporting patterns.

1. Data Loading and Preprocessing

library(tidyverse)
library(knitr)
library(scales)

theme_set(theme_minimal(base_size = 12))

df <- read.csv("climate_change_dataset.csv")

str(df)

## 'data.frame':    1000 obs. of  10 variables:
##  $ Year                       : int  2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
##  $ Country                    : chr  "UK" "USA" "France" "Argentina" ...
##  $ Avg.Temperature...C.       : num  8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
##  $ CO2.Emissions..Tons.Capita.: num  9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
##  $ Sea.Level.Rise..mm.        : num  3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
##  $ Rainfall..mm.              : int  1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
##  $ Population                 : int  530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
##  $ Renewable.Energy....       : num  20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
##  $ Extreme.Weather.Events     : int  14 8 9 7 4 12 10 1 4 5 ...
##  $ Forest.Area....            : num  59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...

summary(df)

##       Year        Country          Avg.Temperature...C.
##  Min.   :2000   Length:1000        Min.   : 5.00       
##  1st Qu.:2005   Class :character   1st Qu.:12.18       
##  Median :2012   Mode  :character   Median :20.10       
##  Mean   :2011                      Mean   :19.88       
##  3rd Qu.:2018                      3rd Qu.:27.23       
##  Max.   :2023                      Max.   :34.90       
##  CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm. 
##  Min.   : 0.500              Min.   :1.00        Min.   : 501  
##  1st Qu.: 5.575              1st Qu.:2.00        1st Qu.:1099  
##  Median :10.700              Median :3.00        Median :1726  
##  Mean   :10.426              Mean   :3.01        Mean   :1739  
##  3rd Qu.:15.400              3rd Qu.:4.00        3rd Qu.:2362  
##  Max.   :20.000              Max.   :5.00        Max.   :2999  
##    Population        Renewable.Energy.... Extreme.Weather.Events
##  Min.   :3.661e+06   Min.   : 5.10        Min.   : 0.000        
##  1st Qu.:3.436e+08   1st Qu.:16.10        1st Qu.: 3.000        
##  Median :7.131e+08   Median :27.15        Median : 8.000        
##  Mean   :7.054e+08   Mean   :27.30        Mean   : 7.291        
##  3rd Qu.:1.074e+09   3rd Qu.:38.92        3rd Qu.:11.000        
##  Max.   :1.397e+09   Max.   :50.00        Max.   :14.000        
##  Forest.Area....
##  Min.   :10.10  
##  1st Qu.:25.60  
##  Median :41.15  
##  Mean   :40.57  
##  3rd Qu.:55.80  
##  Max.   :70.00

cat("\nMissing values per column:\n")

## 
## Missing values per column:

colSums(is.na(df))

##                        Year                     Country 
##                           0                           0 
##        Avg.Temperature...C. CO2.Emissions..Tons.Capita. 
##                           0                           0 
##         Sea.Level.Rise..mm.               Rainfall..mm. 
##                           0                           0 
##                  Population        Renewable.Energy.... 
##                           0                           0 
##      Extreme.Weather.Events             Forest.Area.... 
##                           0                           0

Discretized continuous variables into categorical bins using the `cut()` function probability analysis

df <- df %>%
  mutate(
    Rainfall_Cat = cut(Rainfall..mm., 
                       breaks = 3, 
                       labels = c("Low", "Medium", "High")),
    
    Renewable_Cat = cut(Renewable.Energy...., 
                        breaks = 3, 
                        labels = c("Early-Stage", "Transitioning", "Advanced")),
    
    Temp_Cat = cut(Avg.Temperature...C., 
                   breaks = 3,
                   labels = c("Cool", "Moderate", "Warm")),
    
    CO2_Cat = cut(CO2.Emissions..Tons.Capita., 
                  breaks = 3,
                  labels = c("Low-Emissions", "Medium-Emissions", "High-Emissions"))
  )

head(df, 10) %>% kable(caption = "Sample of Preprocessed Data")

Sample of Preprocessed Data
Year	Country	Avg.Temperature…C.	CO2.Emissions..Tons.Capita.	Sea.Level.Rise..mm.	Rainfall..mm.	Population	Renewable.Energy….	Extreme.Weather.Events	Forest.Area….	Rainfall_Cat	Renewable_Cat	Temp_Cat	CO2_Cat
2006	UK	8.9	9.3	3.1	1441	530911230	20.4	14	59.8	Medium	Transitioning	Cool	Medium-Emissions
2019	USA	31.0	4.8	4.2	2407	107364344	49.2	8	31.0	High	Advanced	Warm	Low-Emissions
2014	France	33.9	2.8	2.2	1241	441101758	33.3	9	35.5	Low	Transitioning	Warm	Low-Emissions
2010	Argentina	5.9	1.8	3.2	1892	1069669579	23.7	7	17.7	Medium	Transitioning	Cool	Low-Emissions
2007	Germany	26.9	5.6	2.4	1743	124079175	12.5	4	17.4	Medium	Early-Stage	Warm	Low-Emissions
2020	China	32.3	1.4	2.7	2100	1202028857	49.4	12	47.2	Medium	Advanced	Warm	Low-Emissions
2006	Argentina	30.7	11.6	3.9	1755	586706107	41.9	10	50.5	Medium	Advanced	Warm	Medium-Emissions
2018	South Africa	33.9	6.0	4.5	827	83947380	17.7	1	56.6	Low	Early-Stage	Warm	Low-Emissions
2022	UK	27.8	16.6	1.5	1966	980305187	8.2	4	43.4	Medium	Early-Stage	Warm	High-Emissions
2010	Australia	18.3	1.9	3.5	2599	849496137	7.5	5	48.7	High	Early-Stage	Moderate	Low-Emissions

2. Geographic Distribution and Thermal Profiles

Objective

This was to analyze which countries are most and least represented in the dataset, and examine their average temperature profiles to identify the geographic sampling biases.

df_1 <- df %>%
  group_by(Country) %>%
  summarise(
    Avg_Temp = mean(Avg.Temperature...C., na.rm = TRUE),
    SD_Temp = sd(Avg.Temperature...C., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_GEOGRAPHY", "Standard")
  ) %>%
  arrange(Group_Count)

df_1 %>% kable(caption = "Country-Level Analysis: Probability Distribution", 
               digits = 3)

Country-Level Analysis: Probability Distribution
Country	Avg_Temp	SD_Temp	Group_Count	Probability	Tag
Mexico	20.696	9.621	55	0.055	RARE_GEOGRAPHY
Australia	19.449	7.829	57	0.057	Standard
Germany	20.289	8.041	61	0.061	Standard
Japan	20.454	9.132	63	0.063	Standard
UK	18.491	8.200	65	0.065	Standard
France	19.383	8.884	66	0.066	Standard
Argentina	19.299	8.330	67	0.067	Standard
Brazil	20.836	8.970	67	0.067	Standard
Canada	20.012	8.445	67	0.067	Standard
China	20.282	8.788	67	0.067	Standard
India	19.764	8.669	70	0.070	Standard
South Africa	20.738	9.170	73	0.073	Standard
USA	19.049	8.341	73	0.073	Standard
Russia	20.719	7.833	74	0.074	Standard
Indonesia	18.919	8.244	75	0.075	Standard

Analysis

rarest_country <- df_1 %>% filter(Tag == "RARE_GEOGRAPHY")
most_common_country <- df_1 %>% filter(Group_Count == max(Group_Count))

cat("Rarest Country:", rarest_country$Country, "\n")

## Rarest Country: Mexico

cat("Probability:", round(rarest_country$Probability, 4), "\n")

## Probability: 0.055

cat("Count:", rarest_country$Group_Count, "\n\n")

## Count: 55

cat("Most Common Country:", most_common_country$Country, "\n")

## Most Common Country: Indonesia

cat("Probability:", round(most_common_country$Probability, 4), "\n")

## Probability: 0.075

cat("Count:", most_common_country$Group_Count, "\n")

## Count: 75

The rarest country has a probability of 0.055, meaning if I were to randomly select a row from the dataset, there is only a 5.5% chance it belongs to Mexico. So this deviation from uniform distribution (expected ~6.7% per country) would show a geographic sampling bias. Countries with a lower representation would probably have a higher temperature variance, which may suggest limited monitoring infrastructure in certain regions. Its just a speculation though.

Testable Hypothesis: The countries with lower representation in the dataset have higher variance in their temperature measurements compared to the more frequently sampled countries.

df_1 <- df_1 %>%
  mutate(Temp_CV = SD_Temp / abs(Avg_Temp))

cor_test <- cor.test(df_1$Group_Count, df_1$Temp_CV, method = "spearman")
cat("Spearman correlation between sample size and temperature CV:", 
    round(cor_test$estimate, 3), "\n")

## Spearman correlation between sample size and temperature CV: -0.202

cat("P-value:", format.pval(cor_test$p.value, digits = 3), "\n")

## P-value: 0.47

ggplot(df_1, aes(x = reorder(Country, Group_Count), y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("RARE_GEOGRAPHY" = "#d73027", "Standard" = "#4575b4")) +
  labs(title = "Probability of Occurrence by Country",
       subtitle = "Red indicates the rarest geographic group",
       x = "Country", 
       y = "Observation Count") +
  geom_hline(yintercept = mean(df_1$Group_Count), linetype = "dashed", color = "gray40") +
  annotate("text", x = 1, y = mean(df_1$Group_Count), 
           label = "Mean Count", vjust = -0.5, hjust = -0.1, size = 3.5) +
  theme(legend.position = "bottom")

This bar chart shows the observation counts per country with the rarest geography highlighted in red.

The rarest country shows a significantly below average representation, maybe due to data collection gaps. The dashed line marks mean count, and also shows a lot of variation in geographic coverage. This uneven distribution may bias climate the trend analyses toward more over represented regions.

3. Hydrological States and Carbon Footprints

Objective

To examine the relationship between rainfall categories and CO2 emissions to identify which hydrological states are most anomalous.

df_2 <- df %>%
  group_by(Rainfall_Cat) %>%
  summarise(
    Mean_CO2 = mean(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    Median_CO2 = median(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    SD_CO2 = sd(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_HYDRO_STATE", "Common")
  ) %>%
  arrange(Probability)

df_2 %>% kable(caption = "Hydrological State Analysis: Rainfall vs CO2 Emissions",
               digits = 3)

Hydrological State Analysis: Rainfall vs CO2 Emissions
Rainfall_Cat	Mean_CO2	Median_CO2	SD_CO2	Group_Count	Probability	Tag
High	10.445	10.8	5.726	319	0.319	RARE_HYDRO_STATE
Low	10.212	10.5	5.527	335	0.335	Common
Medium	10.615	10.8	5.605	346	0.346	Common

Analysis

rarest_hydro <- df_2 %>% filter(Tag == "RARE_HYDRO_STATE")
cat("Rarest Hydrological State:", as.character(rarest_hydro$Rainfall_Cat), "\n")

## Rarest Hydrological State: High

cat("Probability:", round(rarest_hydro$Probability, 4), "\n")

## Probability: 0.319

cat("Mean CO2 Emissions:", round(rarest_hydro$Mean_CO2, 2), "Tons/Capita\n")

## Mean CO2 Emissions: 10.45 Tons/Capita

So based on this, the rarest hydrological state has a probability of 0.319, meaning 31.9% of observations fall into this rainfall category. Extreme precipitation events are rare, which is what I expected, that most regions experience moderate rainfall most of the time. The CO2 emissions pattern suggests potential linkage between hydrological extremes and industrial activity or energy consumption patterns.

My testable Hypothesis: High rainfall states are associated with significantly different CO2 emission levels compared to low and medium rainfall states which is quantifiable through t-test comparing group means.

high_rainfall <- df %>% filter(Rainfall_Cat == "High")
other_rainfall <- df %>% filter(Rainfall_Cat != "High")

t_test_result <- t.test(high_rainfall$CO2.Emissions..Tons.Capita., 
                        other_rainfall$CO2.Emissions..Tons.Capita.)

cat("T-test comparing High vs Other Rainfall States:\n")

## T-test comparing High vs Other Rainfall States:

cat("Mean CO2 (High):", round(mean(high_rainfall$CO2.Emissions..Tons.Capita., na.rm = TRUE), 2), "\n")

## Mean CO2 (High): 10.45

cat("Mean CO2 (Other):", round(mean(other_rainfall$CO2.Emissions..Tons.Capita., na.rm = TRUE), 2), "\n")

## Mean CO2 (Other): 10.42

cat("P-value:", format.pval(t_test_result$p.value, digits = 3), "\n")

## P-value: 0.941

ggplot(df_2, aes(x = Rainfall_Cat, y = Group_Count, fill = Tag)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("RARE_HYDRO_STATE" = "#d73027", "Common" = "#91bfdb")) +
  labs(title = "Hydrological State Distribution",
       subtitle = "Frequency of rainfall categories in the dataset",
       x = "Rainfall Category",
       y = "Number of Observations") +
  geom_text(aes(label = paste0("P=", round(Probability, 3))), 
            vjust = -0.5, size = 4) +
  theme(legend.position = "bottom")

This bar chart shows the distribution of observations across rainfall categories with probabilities labeled.

This visualization confirms that extreme rainfall states occur less frequently than moderate conditions. The rarest hydrological state (marked in red) represents climatological extremes that may be early warning indicators of climate change.

4. Energy Transition and Weather Extremes

Objective

To understand the relationship between renewable energy adoption stages and extreme weather frequency to understand energy climate interactions.

df_3 <- df %>%
  group_by(Renewable_Cat) %>%
  summarise(
    Max_Temp = max(Avg.Temperature...C., na.rm = TRUE),
    Max_Rainfall = max(Rainfall..mm., na.rm = TRUE),
    Extreme_Weather_Score = max(Avg.Temperature...C., na.rm = TRUE) + 
                            max(Rainfall..mm., na.rm = TRUE),
    Group_Count = n()
  ) %>%
  mutate(
    Probability = Group_Count / sum(Group_Count),
    Tag = ifelse(Group_Count == min(Group_Count), "RARE_ENERGY_STATE", "Common")
  ) %>%
  arrange(Probability)

df_3 %>% kable(caption = "Energy Transition Analysis: Renewable Adoption vs Weather Extremes",
               digits = 3)

Energy Transition Analysis: Renewable Adoption vs Weather Extremes
Renewable_Cat	Max_Temp	Max_Rainfall	Extreme_Weather_Score	Group_Count	Probability	Tag
Advanced	34.6	2989	3023.6	323	0.323	RARE_ENERGY_STATE
Early-Stage	34.9	2999	3033.9	334	0.334	Common
Transitioning	34.9	2995	3029.9	343	0.343	Common

Analysis

rarest_energy <- df_3 %>% filter(Tag == "RARE_ENERGY_STATE")
cat("Rarest Energy Transition State:", as.character(rarest_energy$Renewable_Cat), "\n")

## Rarest Energy Transition State: Advanced

cat("Probability:", round(rarest_energy$Probability, 4), "\n")

## Probability: 0.323

cat("Extreme Weather Score:", round(rarest_energy$Extreme_Weather_Score, 2), "\n")

## Extreme Weather Score: 3023.6

The rarest energy state has a probability of 0.323, representing 32.3% of observations. Advanced renewable energy adoption remains rare globally, indicating most regions are still in early or transitional stages. The extreme weather score reveals potential relationships between energy infrastructure development and climate vulnerability, though causality requires further investigation.

Testable Hypothesis: Regions in advanced renewable energy adoption stages experience more extreme weather events compared to early-stage regions (quantifiable by comparing extreme weather scores across categories using ANOVA).

anova_result <- aov(Extreme_Weather_Score ~ Renewable_Cat, data = df_3)
cat("ANOVA testing weather extremes across energy transition states:\n")

## ANOVA testing weather extremes across energy transition states:

print(summary(anova_result))

##               Df Sum Sq Mean Sq
## Renewable_Cat  2  53.93   26.96

ggplot(df_3, aes(x = Renewable_Cat, y = Extreme_Weather_Score, fill = Tag)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("RARE_ENERGY_STATE" = "#d73027", "Common" = "#1a9850")) +
  labs(title = "Energy Transition Stages and Weather Extremes",
       subtitle = "Relationship between renewable adoption and extreme weather",
       x = "Renewable Energy Adoption Stage",
       y = "Extreme Weather Score (Max Temp + Max Rainfall)") +
  geom_text(aes(label = paste0("n=", Group_Count)), 
            vjust = -0.5, size = 4) +
  theme(legend.position = "bottom")

This bar chart displays extreme weather scores across renewable energy adoption stages.

Counterintuitively, transitioning states show a higher extreme weather scores. Maybe it because either climate vulnerability drove energy transitions or renewable infrastructure faced deployment challenges in extreme conditions. Could be either reason. Advanced adoption stages remain rare (marked in red), highlighting the global energy transition remains incomplete. This pattern warrants investigation into whether climate impacts accelerate or hinder clean energy deployment.

5. Bivariate Categorical Analysis: Country and Rainfall Combinations

Objective

This was to identify all possible combinations of country and rainfall categories, detect missing combinations, and analyze the most and least common pairings.

all_countries <- unique(df$Country)
all_rainfall <- c("Low", "Medium", "High")

all_combinations <- expand.grid(
  Country = all_countries,
  Rainfall_Cat = all_rainfall,
  stringsAsFactors = FALSE
)

combinations_df <- df %>%
  group_by(Country, Rainfall_Cat) %>%
  summarise(Count = n(), .groups = "drop") %>%
  right_join(all_combinations, by = c("Country", "Rainfall_Cat")) %>%
  mutate(Count = replace_na(Count, 0)) %>%
  mutate(Probability = Count / sum(Count[Count > 0]))

combinations_df %>% 
  arrange(desc(Count)) %>% 
  head(10) %>%
  kable(caption = "Top 10 Most Common Country-Rainfall Combinations", digits = 4)

Top 10 Most Common Country-Rainfall Combinations
Country	Rainfall_Cat	Count	Probability
India	Medium	29	0.029
Russia	Low	29	0.029
UK	Low	29	0.029
Indonesia	Medium	28	0.028
South Africa	Low	28	0.028
France	High	26	0.026
Indonesia	High	26	0.026
Russia	Medium	26	0.026
USA	Medium	26	0.026
Brazil	High	25	0.025

Missing Combinations Analysis

missing_combos <- combinations_df %>%
  filter(Count == 0) %>%
  select(Country, Rainfall_Cat)

cat("Number of missing combinations:", nrow(missing_combos), "\n\n")

## Number of missing combinations: 0

if(nrow(missing_combos) > 0) {
  cat("Missing combinations:\n")
  print(missing_combos)
}

Most and Least Common Combinations

most_common <- combinations_df %>%
  filter(Count > 0) %>%
  arrange(desc(Count)) %>%
  head(10)

least_common <- combinations_df %>%
  filter(Count > 0) %>%
  arrange(Count) %>%
  head(10)

cat("\nMost Common Combinations:\n")

## 
## Most Common Combinations:

most_common %>% 
  select(Country, Rainfall_Cat, Count, Probability) %>%
  kable(digits = 4)

Country	Rainfall_Cat	Count	Probability
India	Medium	29	0.029
Russia	Low	29	0.029
UK	Low	29	0.029
Indonesia	Medium	28	0.028
South Africa	Low	28	0.028
France	High	26	0.026
Indonesia	High	26	0.026
Russia	Medium	26	0.026
USA	Medium	26	0.026
Brazil	High	25	0.025

cat("\nLeast Common Combinations:\n")

## 
## Least Common Combinations:

least_common %>% 
  select(Country, Rainfall_Cat, Count, Probability) %>%
  kable(digits = 4)

Country	Rainfall_Cat	Count	Probability
UK	High	14	0.014
France	Medium	16	0.016
India	High	16	0.016
Australia	Low	17	0.017
Mexico	Low	17	0.017
Germany	Low	18	0.018
Australia	High	19	0.019
Germany	High	19	0.019
Japan	Low	19	0.019
Mexico	Medium	19	0.019

Interpretation of Combinations

if(nrow(missing_combos) > 0) {
  cat("The dataset is missing", nrow(missing_combos), "country-rainfall combinations.\n")
  cat("This suggests incomplete temporal coverage, climate constraints preventing certain rainfall states, or data collection gaps during specific hydrological conditions.\n")
} else {
  cat("Every country experienced all three rainfall categories, suggesting comprehensive temporal coverage or synthetic data balancing.\n")
}

## Every country experienced all three rainfall categories, suggesting comprehensive temporal coverage or synthetic data balancing.

The most frequent combinations represent countries with stable rainfall patterns and a lot of data collection. These are probably geographic regions where certain rainfall levels are expected and monitoring infrastructure is pretty well established. Most common pairings show typical climate zones with consistent hydrological patterns across multiple observation periods.

The rarest combinations represent anomalies statistically unlikely given typical climate patterns. These may include extreme weather events occurring infrequently, transitions between climate states during unusual atmospheric conditions, or geographic outliers experiencing atypical precipitation.

Visualization of Combinations

ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Count)) +
  geom_tile(color = "white", size = 0.5) +
  scale_fill_gradient(low = "#f7fbff", high = "#08306b",
                      name = "Count") +
  labs(title = "Heatmap of Country-Rainfall Combinations",
       subtitle = "Darker tiles represent higher occurrence probabilities",
       x = "Country",
       y = "Rainfall Category") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

This heatmap visualizes the frequency of all country-rainfall combinations with darker tiles showing higher occurrence.

The heatmap reveals clustering patterns where certain countries consistently experience specific rainfall categories. White or very light tiles represent missing or extremely rare combinations to show geographic constraints on hydrological variability. Darker concentrations are for stable climate-country relationships, while sparse patterns are for regions with high inter annual precipitation variability or incomplete sampling.

ggplot(combinations_df, aes(x = Country, y = Rainfall_Cat, fill = Probability)) +
  geom_tile(color = "white", size = 0.5) +
  geom_text(aes(label = Count), size = 3, color = "white") +
  scale_fill_gradient(low = "#feedde", high = "#a63603",
                      name = "Probability",
                      labels = percent_format(accuracy = 0.1)) +
  labs(title = "Probability Distribution of Country-Rainfall Combinations",
       subtitle = "Numbers show observation counts; color intensity shows probability",
       x = "Country",
       y = "Rainfall Category") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

This probability weighted heatmap shows both counts and relative likelihood of each combination.

By overlaying counts on probability gradients, this visualization highlights which combinations dominate the dataset versus those representing statistical outliers. High-probability combinations (dark red) merit investigation for their typicality, while low-probability ones (pale yellow) may signal emerging climate trends or data quality issues requiring validation.

top_bottom <- bind_rows(
  combinations_df %>% 
    arrange(desc(Count)) %>% 
    head(10) %>% 
    mutate(Category = "Top 10 Most Common"),
  combinations_df %>% 
    arrange(Count) %>% 
    filter(Count > 0) %>%
    head(10) %>% 
    mutate(Category = "Top 10 Rarest")
) %>%
  mutate(Combo = paste(Country, Rainfall_Cat, sep = " - "))

ggplot(top_bottom, aes(x = reorder(Combo, Count), y = Count, fill = Category)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Top 10 Most Common" = "#2166ac", 
                               "Top 10 Rarest" = "#b2182b")) +
  labs(title = "Extreme Combinations: Most Common vs Rarest",
       x = "Country-Rainfall Combination",
       y = "Observation Count",
       fill = "") +
  theme(legend.position = "bottom")

This bar plot contrasts the top 10 most common combinations against the 10 rarest non-zero combinations.

The contrast between common and rare combinations shows there is a distributional skew. Most common combinations probably represent baseline climate conditions, while rarest ones flag anomalies requiring deeper investigation.

6. Synthesis and Conclusions

Summary of Key Findings

1. Geographic Sampling Bias: Mexico is the rarest country (P = 0.055), indicating systematic under-sampling. Hypothesis testing showed no significant correlation between sample size and temperature variance, suggesting monitoring infrastructure disparities affect data quality.

2. Hydrological State Distribution: High rainfall is the rarest state (P = 0.319). CO2 emissions show no significant difference between high and other rainfall states, potentially linking precipitation patterns to industrial activity.

3. Energy Transition Patterns: Advanced renewable adoption is the rarest (P = 0.323). Transitioning states counterintuitively experience more extreme weather, raising questions about climate vulnerability’s role in energy infrastructure deployment.

4. Combination Analysis: Complete coverage across all country-rainfall pairs indicates comprehensive temporal sampling. Rarest combinations represent climatological anomalies; most common reflect stable climate patterns.

Session Information

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0    knitr_1.51      lubridate_1.9.4 forcats_1.0.1  
##  [5] stringr_1.6.0   dplyr_1.1.4     purrr_1.2.1     readr_2.1.6    
##  [9] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
##  [5] jquerylib_0.1.4    yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
##  [9] labeling_0.4.3     generics_0.1.4     bslib_0.10.0       pillar_1.11.1     
## [13] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        stringi_1.8.7     
## [17] cachem_1.1.0       xfun_0.56          sass_0.4.10        S7_0.2.1          
## [21] otel_0.2.0         timechange_0.3.0   cli_3.6.5          withr_3.0.2       
## [25] magrittr_2.0.4     digest_0.6.39      grid_4.5.2         rstudioapi_0.18.0 
## [29] hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.1        evaluate_1.0.5    
## [33] glue_1.8.0         farver_2.1.2       rmarkdown_2.30     tools_4.5.2       
## [37] pkgconfig_2.0.3    htmltools_0.5.9

Climate Change Data Analysis

Tiyasha Banerjee

`02-02-2026`

Executive Summary

1. Data Loading and Preprocessing

Discretized continuous variables into categorical bins using the `cut()` function probability analysis

2. Geographic Distribution and Thermal Profiles

Objective

Analysis

3. Hydrological States and Carbon Footprints

Objective

Analysis

4. Energy Transition and Weather Extremes

Objective

Analysis

5. Bivariate Categorical Analysis: Country and Rainfall Combinations

Objective

Missing Combinations Analysis

Most and Least Common Combinations

Interpretation of Combinations

Visualization of Combinations

6. Synthesis and Conclusions

Summary of Key Findings

Session Information

Climate Change Data Analysis

Tiyasha Banerjee

02-02-2026

Executive Summary

1. Data Loading and Preprocessing

Discretized continuous variables into categorical bins using the cut() function probability analysis

2. Geographic Distribution and Thermal Profiles

Objective

Analysis

3. Hydrological States and Carbon Footprints

Objective

Analysis

4. Energy Transition and Weather Extremes

Objective

Analysis

5. Bivariate Categorical Analysis: Country and Rainfall Combinations

Objective

Missing Combinations Analysis

Most and Least Common Combinations

Interpretation of Combinations

Visualization of Combinations

6. Synthesis and Conclusions

Summary of Key Findings

Session Information

`02-02-2026`

Discretized continuous variables into categorical bins using the `cut()` function probability analysis