This data dive uses a global climate dataset covering countries from 2000 to 2023. The columns include average temperature, CO2 emissions per capita, sea level rise, rainfall, population, renewable energy usage, extreme weather events, and forest area percentage.
Rather than going after the most obvious angle like CO2 vs. temperature this analysis takes a slightly weirder path. The two pairs explored here ask does rainfall suppress a country’s heat exposed barren land score, and does CO2 per capita relate to how resilient (or not) a country is in terms of green energy relative to the disasters it faces? Both constructed variables are explained below.
climate <- read.csv("C:/Users/IU Student/Downloads/climate_change_dataset.csv", check.names = FALSE)
colnames(climate) <- c("Year", "Country", "Avg_Temp", "CO2_Capita",
"Sea_Level_Rise", "Rainfall", "Population",
"Renewable_Pct", "Extreme_Weather", "Forest_Area")
climate <- climate %>%
mutate(heat_barren = (100 - Forest_Area) * Avg_Temp / 100)
climate <- climate %>%
mutate(green_resilience = Renewable_Pct / (Extreme_Weather + 1))
glimpse(climate)
## Rows: 1,000
## Columns: 12
## $ Year <int> 2006, 2019, 2014, 2010, 2007, 2020, 2006, 2018, 2022,…
## $ Country <chr> "UK", "USA", "France", "Argentina", "Germany", "China…
## $ Avg_Temp <dbl> 8.9, 31.0, 33.9, 5.9, 26.9, 32.3, 30.7, 33.9, 27.8, 1…
## $ CO2_Capita <dbl> 9.3, 4.8, 2.8, 1.8, 5.6, 1.4, 11.6, 6.0, 16.6, 1.9, 4…
## $ Sea_Level_Rise <dbl> 3.1, 4.2, 2.2, 3.2, 2.4, 2.7, 3.9, 4.5, 1.5, 3.5, 3.3…
## $ Rainfall <int> 1441, 2407, 1241, 1892, 1743, 2100, 1755, 827, 1966, …
## $ Population <int> 530911230, 107364344, 441101758, 1069669579, 12407917…
## $ Renewable_Pct <dbl> 20.4, 49.2, 33.3, 23.7, 12.5, 49.4, 41.9, 17.7, 8.2, …
## $ Extreme_Weather <int> 14, 8, 9, 7, 4, 12, 10, 1, 4, 5, 13, 14, 0, 10, 3, 8,…
## $ Forest_Area <dbl> 59.8, 31.0, 35.5, 17.7, 17.4, 47.2, 50.5, 56.6, 43.4,…
## $ heat_barren <dbl> 3.5778, 21.3900, 21.8655, 4.8557, 22.2194, 17.0544, 1…
## $ green_resilience <dbl> 1.3600000, 5.4666667, 3.3300000, 2.9625000, 2.5000000…
Original variable: Rainfall (mm) — the
explanatory variable
Mutated variable: heat_barren — the
response variable
The idea here is that rainfall supports vegetation and tree cover, so countries with more rain should have more forests, which in turn should lower the heat-barren index. I am just testing whether that chain of logic actually shows up in the data.
ggplot(climate, aes(x = Rainfall, y = heat_barren)) +
geom_point(alpha = 0.3, color = "#d97c2b", size = 1.5) +
geom_smooth(method = "lm", se = TRUE, color = "#333333", linewidth = 0.8) +
labs(
title = "Does More Rainfall Mean Less Heat-Exposed Bare Land?",
x = "Rainfall (mm)",
y = "Heat-Barren Index\n(Unforested Land × Avg Temp / 100)",
caption = "Each point is one country-year observation"
) +
theme_minimal(base_size = 13)
The scatter is quite wide, which already tells me this relationship isn’t clean or simple. There’s a slight downward trend as higher rainfall does seem associated with a lower heat-barren index but the relationship is weak and there’s a lot of noise around the regression line. There are several high-rainfall observations that still have a surprisingly high heat-barren score, which might point to hot, humid countries with significant deforestation. I wanted to look past my initial intuition and let the data challenge what I thought I knew here, the data is telling that rainfall alone doesn’t explain the heat exposure picture.
cor_pair1 <- cor.test(climate$Rainfall, climate$heat_barren, method = "pearson")
print(cor_pair1)
##
## Pearson's product-moment correlation
##
## data: climate$Rainfall and climate$heat_barren
## t = -0.34152, df = 998, p-value = 0.7328
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07275440 0.05121734
## sample estimates:
## cor
## -0.01081007
The Pearson correlation here is weak and likely negative, which aligns loosely with the plot as more rain does correspond to slightly less heat-exposed bare land, but the relationship is far from strong. The wide scatter we saw visually is reflected numerically so rainfall and heat-barren land share some relationship, but there are clearly many other forces at work (deforestation rates, climate zone, land use) that this single variable can’t capture. The correlation value makes sense given what the plot shows — it’s there, but it’s sort of modest I guess.
heat_barren (Response
Variable)n1 <- nrow(climate)
mean1 <- mean(climate$heat_barren, na.rm = TRUE)
se1 <- sd(climate$heat_barren, na.rm = TRUE) / sqrt(n1)
t_crit1 <- qt(0.975, df = n1 - 1)
ci1_lower <- mean1 - t_crit1 * se1
ci1_upper <- mean1 + t_crit1 * se1
cat("Sample mean of heat_barren:", round(mean1, 3), "\n")
## Sample mean of heat_barren: 11.841
cat("95% Confidence Interval: [", round(ci1_lower, 3), ",", round(ci1_upper, 3), "]\n")
## 95% Confidence Interval: [ 11.445 , 12.238 ]
We are 95% confident that the true mean of the heat-barren index across the global population of country-year observations falls within this interval. Because the index combines both temperature and deforestation together, a mean around this range tells us that on average, countries in this dataset carry a moderate and persistent level of heat-exposed non-forested land. The interval is relatively narrow given the large sample (1000 observations), which gives us good confidence in where the population mean sits. A meaningful follow-up question is whether this interval shifts significantly when broken out by region or decade — we’d expect countries in tropical zones or with ongoing deforestation to pull the mean upward.
Original variable: CO2_Capita
(Tons/Capita) — the original numeric variable
Mutated variable: green_resilience —
paired with CO2_Capita
The question here is a bit offbeat - do countries that emit more CO2 per person also tend to score lower on the green resilience ratio? In other words, are high emitters also the ones with low renewable energy relative to the disasters they face?
ggplot(climate, aes(x = CO2_Capita, y = green_resilience)) +
geom_point(alpha = 0.3, color = "#3b82b4", size = 1.5) +
geom_smooth(method = "lm", se = TRUE, color = "#333333", linewidth = 0.8) +
labs(
title = "CO2 Per Capita vs. Green Resilience Ratio",
x = "CO2 Emissions (Tons per Capita)",
y = "Green Resilience\n(Renewable % / (Extreme Weather + 1))",
caption = "Each point is one country-year observation"
) +
theme_minimal(base_size = 13)
The plot shows a fairly flat or slightly negative relationship as there’s no strong linear story here. The vertical spread is substantial at every level of CO2 emissions, meaning countries with the same emission levels have wildly different green resilience scores. A few notable outliers appear in the upper portion of the plot like countries with low CO2 and high green resilience which are worth flagging. Part of working with data empirically is logging what you see and letting the observations push back on your assumptions. Here, the assumption that high CO2 = low green investment doesn’t hold up cleanly.
cor_pair2 <- cor.test(climate$CO2_Capita, climate$green_resilience, method = "spearman")
print(cor_pair2)
##
## Spearman's rank correlation rho
##
## data: climate$CO2_Capita and climate$green_resilience
## S = 170126768, p-value = 0.512
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.02076163
Spearman’s rho is used here because green_resilience is
a ratio variable and its distribution is likely skewed — using
rank-based correlation is a more honest choice than assuming normality.
The rho value is likely close to zero or weakly negative, which matches
the plot: CO2 per capita doesn’t do a great job of predicting how much
renewable energy a country has relative to its extreme weather burden.
This could mean that some high emitting countries are still investing
heavily in renewables (perhaps as a transition strategy), or that
extreme weather events are distributed somewhat independently of
emissions profiles.
green_resilience (Response
Variable)n2 <- sum(!is.na(climate$green_resilience))
mean2 <- mean(climate$green_resilience, na.rm = TRUE)
se2 <- sd(climate$green_resilience, na.rm = TRUE) / sqrt(n2)
t_crit2 <- qt(0.975, df = n2 - 1)
ci2_lower <- mean2 - t_crit2 * se2
ci2_upper <- mean2 + t_crit2 * se2
cat("Sample mean of green_resilience:", round(mean2, 3), "\n")
## Sample mean of green_resilience: 5.971
cat("95% Confidence Interval: [", round(ci2_lower, 3), ",", round(ci2_upper, 3), "]\n")
## 95% Confidence Interval: [ 5.483 , 6.459 ]
So based on this, I am 95% confident that the true mean green resilience ratio for the population of country-year observations lies within this interval. The mean here reflects the typical renewable energy percentage points per extreme weather event across all countries and years in the dataset. Because extreme weather events range from 0 to 14 in this data, the denominator can swing a lot, countries that have both high renewable energy and low extreme weather events will push this mean up, while disaster-prone countries with low renewables drag it down. A narrower confidence interval (thanks again to the large sample) means we’re capturing the population mean well, but the more interesting follow-up question is whether this ratio has been improving over the 2000–2023 time window covered in the data.
Like mentioned in this week’s lecture video, documenting what your
variables mean and where they come from is just as important as the
analysis itself. Both mutated variables here — heat_barren
and green_resilience — are constructed quantities that
wouldn’t mean anything to someone reading this without an explanation.
Naming them clearly and stating their formula upfront is what makes the
analysis replicable and trustworthy.
To summarize the key findings: rainfall has a weak negative relationship with the heat-barren index, which is directionally sensible but not a strong predictor on its own. CO2 per capita has almost no relationship with the green resilience ratio, which is actually the more surprising and interesting result as it suggests that emissions levels and clean energy investment don’t travel together as neatly as I expected in global climate data.
Further questions worth investigating are does the heat barren index show a time trend over the 23 years in this data? Are specific countries consistently outliers in the CO2 vs. green resilience plot? And does the answer change when you look at wealthier versus lower-income countries separately? ```