Data Dive – Climate Dataset Week 6

Introduction

This data dive uses a global climate dataset covering countries from 2000 to 2023. The columns include average temperature, CO2 emissions per capita, sea level rise, rainfall, population, renewable energy usage, extreme weather events, and forest area percentage.

Rather than going after the most obvious angle like CO2 vs. temperature this analysis takes a slightly weirder path. The two pairs explored here ask does rainfall suppress a country’s heat exposed barren land score, and does CO2 per capita relate to how resilient (or not) a country is in terms of green energy relative to the disasters it faces? Both constructed variables are explained below.

Loading and Preparing the Data

climate <- read.csv("C:/Users/IU Student/Downloads/climate_change_dataset.csv", check.names = FALSE)

colnames(climate) <- c("Year", "Country", "Avg_Temp", "CO2_Capita", 
                       "Sea_Level_Rise", "Rainfall", "Population", 
                       "Renewable_Pct", "Extreme_Weather", "Forest_Area")

climate <- climate %>%
  mutate(heat_barren = (100 - Forest_Area) * Avg_Temp / 100)

climate <- climate %>%
  mutate(green_resilience = Renewable_Pct / (Extreme_Weather + 1))

glimpse(climate)

## Rows: 1,000
## Columns: 12
## $ Year             <int> 2006, 2019, 2014, 2010, 2007, 2020, 2006, 2018, 2022,…
## $ Country          <chr> "UK", "USA", "France", "Argentina", "Germany", "China…
## $ Avg_Temp         <dbl> 8.9, 31.0, 33.9, 5.9, 26.9, 32.3, 30.7, 33.9, 27.8, 1…
## $ CO2_Capita       <dbl> 9.3, 4.8, 2.8, 1.8, 5.6, 1.4, 11.6, 6.0, 16.6, 1.9, 4…
## $ Sea_Level_Rise   <dbl> 3.1, 4.2, 2.2, 3.2, 2.4, 2.7, 3.9, 4.5, 1.5, 3.5, 3.3…
## $ Rainfall         <int> 1441, 2407, 1241, 1892, 1743, 2100, 1755, 827, 1966, …
## $ Population       <int> 530911230, 107364344, 441101758, 1069669579, 12407917…
## $ Renewable_Pct    <dbl> 20.4, 49.2, 33.3, 23.7, 12.5, 49.4, 41.9, 17.7, 8.2, …
## $ Extreme_Weather  <int> 14, 8, 9, 7, 4, 12, 10, 1, 4, 5, 13, 14, 0, 10, 3, 8,…
## $ Forest_Area      <dbl> 59.8, 31.0, 35.5, 17.7, 17.4, 47.2, 50.5, 56.6, 43.4,…
## $ heat_barren      <dbl> 3.5778, 21.3900, 21.8655, 4.8557, 22.2194, 17.0544, 1…
## $ green_resilience <dbl> 1.3600000, 5.4666667, 3.3300000, 2.9625000, 2.5000000…

Pair 1: Rainfall vs. Heat-Exposed Barren Land Index

Original variable: Rainfall (mm) — the explanatory variable
Mutated variable: heat_barren — the response variable

The idea here is that rainfall supports vegetation and tree cover, so countries with more rain should have more forests, which in turn should lower the heat-barren index. I am just testing whether that chain of logic actually shows up in the data.

Visualization

ggplot(climate, aes(x = Rainfall, y = heat_barren)) +
  geom_point(alpha = 0.3, color = "#d97c2b", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, color = "#333333", linewidth = 0.8) +
  labs(
    title = "Does More Rainfall Mean Less Heat-Exposed Bare Land?",
    x = "Rainfall (mm)",
    y = "Heat-Barren Index\n(Unforested Land × Avg Temp / 100)",
    caption = "Each point is one country-year observation"
  ) +
  theme_minimal(base_size = 13)

The scatter is quite wide, which already tells me this relationship isn’t clean or simple. There’s a slight downward trend as higher rainfall does seem associated with a lower heat-barren index but the relationship is weak and there’s a lot of noise around the regression line. There are several high-rainfall observations that still have a surprisingly high heat-barren score, which might point to hot, humid countries with significant deforestation. I wanted to look past my initial intuition and let the data challenge what I thought I knew here, the data is telling that rainfall alone doesn’t explain the heat exposure picture.

Correlation

cor_pair1 <- cor.test(climate$Rainfall, climate$heat_barren, method = "pearson")
print(cor_pair1)

## 
##  Pearson's product-moment correlation
## 
## data:  climate$Rainfall and climate$heat_barren
## t = -0.34152, df = 998, p-value = 0.7328
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07275440  0.05121734
## sample estimates:
##         cor 
## -0.01081007

The Pearson correlation here is weak and likely negative, which aligns loosely with the plot as more rain does correspond to slightly less heat-exposed bare land, but the relationship is far from strong. The wide scatter we saw visually is reflected numerically so rainfall and heat-barren land share some relationship, but there are clearly many other forces at work (deforestation rates, climate zone, land use) that this single variable can’t capture. The correlation value makes sense given what the plot shows — it’s there, but it’s sort of modest I guess.

Confidence Interval for `heat_barren` (Response Variable)

n1 <- nrow(climate)
mean1 <- mean(climate$heat_barren, na.rm = TRUE)
se1   <- sd(climate$heat_barren, na.rm = TRUE) / sqrt(n1)
t_crit1 <- qt(0.975, df = n1 - 1)

ci1_lower <- mean1 - t_crit1 * se1
ci1_upper <- mean1 + t_crit1 * se1

cat("Sample mean of heat_barren:", round(mean1, 3), "\n")

## Sample mean of heat_barren: 11.841

cat("95% Confidence Interval: [", round(ci1_lower, 3), ",", round(ci1_upper, 3), "]\n")

## 95% Confidence Interval: [ 11.445 , 12.238 ]

We are 95% confident that the true mean of the heat-barren index across the global population of country-year observations falls within this interval. Because the index combines both temperature and deforestation together, a mean around this range tells us that on average, countries in this dataset carry a moderate and persistent level of heat-exposed non-forested land. The interval is relatively narrow given the large sample (1000 observations), which gives us good confidence in where the population mean sits. A meaningful follow-up question is whether this interval shifts significantly when broken out by region or decade — we’d expect countries in tropical zones or with ongoing deforestation to pull the mean upward.

Pair 2 - CO2 Emissions per Capita vs. Green Resilience Ratio

Original variable: CO2_Capita (Tons/Capita) — the original numeric variable
Mutated variable: green_resilience — paired with CO2_Capita

The question here is a bit offbeat - do countries that emit more CO2 per person also tend to score lower on the green resilience ratio? In other words, are high emitters also the ones with low renewable energy relative to the disasters they face?

Visualization

ggplot(climate, aes(x = CO2_Capita, y = green_resilience)) +
  geom_point(alpha = 0.3, color = "#3b82b4", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, color = "#333333", linewidth = 0.8) +
  labs(
    title = "CO2 Per Capita vs. Green Resilience Ratio",
    x = "CO2 Emissions (Tons per Capita)",
    y = "Green Resilience\n(Renewable % / (Extreme Weather + 1))",
    caption = "Each point is one country-year observation"
  ) +
  theme_minimal(base_size = 13)

The plot shows a fairly flat or slightly negative relationship as there’s no strong linear story here. The vertical spread is substantial at every level of CO2 emissions, meaning countries with the same emission levels have wildly different green resilience scores. A few notable outliers appear in the upper portion of the plot like countries with low CO2 and high green resilience which are worth flagging. Part of working with data empirically is logging what you see and letting the observations push back on your assumptions. Here, the assumption that high CO2 = low green investment doesn’t hold up cleanly.

Correlation

cor_pair2 <- cor.test(climate$CO2_Capita, climate$green_resilience, method = "spearman")
print(cor_pair2)

## 
##  Spearman's rank correlation rho
## 
## data:  climate$CO2_Capita and climate$green_resilience
## S = 170126768, p-value = 0.512
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.02076163

Spearman’s rho is used here because green_resilience is a ratio variable and its distribution is likely skewed — using rank-based correlation is a more honest choice than assuming normality. The rho value is likely close to zero or weakly negative, which matches the plot: CO2 per capita doesn’t do a great job of predicting how much renewable energy a country has relative to its extreme weather burden. This could mean that some high emitting countries are still investing heavily in renewables (perhaps as a transition strategy), or that extreme weather events are distributed somewhat independently of emissions profiles.

Confidence Interval for `green_resilience` (Response Variable)

n2 <- sum(!is.na(climate$green_resilience))
mean2 <- mean(climate$green_resilience, na.rm = TRUE)
se2   <- sd(climate$green_resilience, na.rm = TRUE) / sqrt(n2)
t_crit2 <- qt(0.975, df = n2 - 1)

ci2_lower <- mean2 - t_crit2 * se2
ci2_upper <- mean2 + t_crit2 * se2

cat("Sample mean of green_resilience:", round(mean2, 3), "\n")

## Sample mean of green_resilience: 5.971

cat("95% Confidence Interval: [", round(ci2_lower, 3), ",", round(ci2_upper, 3), "]\n")

## 95% Confidence Interval: [ 5.483 , 6.459 ]

So based on this, I am 95% confident that the true mean green resilience ratio for the population of country-year observations lies within this interval. The mean here reflects the typical renewable energy percentage points per extreme weather event across all countries and years in the dataset. Because extreme weather events range from 0 to 14 in this data, the denominator can swing a lot, countries that have both high renewable energy and low extreme weather events will push this mean up, while disaster-prone countries with low renewables drag it down. A narrower confidence interval (thanks again to the large sample) means we’re capturing the population mean well, but the more interesting follow-up question is whether this ratio has been improving over the 2000–2023 time window covered in the data.

Summary

Like mentioned in this week’s lecture video, documenting what your variables mean and where they come from is just as important as the analysis itself. Both mutated variables here — heat_barren and green_resilience — are constructed quantities that wouldn’t mean anything to someone reading this without an explanation. Naming them clearly and stating their formula upfront is what makes the analysis replicable and trustworthy.

To summarize the key findings: rainfall has a weak negative relationship with the heat-barren index, which is directionally sensible but not a strong predictor on its own. CO2 per capita has almost no relationship with the green resilience ratio, which is actually the more surprising and interesting result as it suggests that emissions levels and clean energy investment don’t travel together as neatly as I expected in global climate data.

Further questions worth investigating are does the heat barren index show a time trend over the 23 years in this data? Are specific countries consistently outliers in the CO2 vs. green resilience plot? And does the answer change when you look at wealthier versus lower-income countries separately? ```

Data Dive – Climate Dataset Week 6

Introduction

Loading and Preparing the Data

Pair 1: Rainfall vs. Heat-Exposed Barren Land Index

Visualization

Correlation

Confidence Interval for heat_barren (Response Variable)

Pair 2 - CO2 Emissions per Capita vs. Green Resilience Ratio

Visualization

Correlation

Confidence Interval for green_resilience (Response Variable)

Summary

Confidence Interval for `heat_barren` (Response Variable)

Confidence Interval for `green_resilience` (Response Variable)