The target audience for this work will be climatologist, paleo-climatologists, or adjacent atmospheric sciences/meteorology professionals that may not have a background in statistics or data science
Background Info
Throughout history, whether by horseback across formidable mountain ranges and arid deserts, or by vessel contending with the ferocity of the restless seas, transmitting a message across the globe was a monumental undertaking. It is difficult to argue that any inherent benefit existed in the arduous trials of that era. However, the advent of modern technology has brought a different challenge: a relentless surge of misinformation and propaganda that saturates our collective consciousness. In this modern “storm” of disregard for objective truth, we must anchor ourselves in the fundamental principles of scrutiny and rigorous critical thinking. No discipline demands this more than the field of science. It is imperative that we scrutinize every facet of our own findings and the claims of others before arriving at our conclusion. As the public looks to us for guidance amidst this noise, failing to uphold these scientific standards risks misleading far more than just ourselves.
A primary vehicle for the dissemination of such misinformation is the manipulation of statistics and numbers. Within the context of climate change, this practice is particularly pervasive for those who wish to decesive. Consequently, it becomes the climatologist’s moral obligation to leverage their deep understanding of our Earth and its complex behaviors and combine it with robust statistical and data science techniques to find and present the truth hidden within the numbers. We must recognize that whether intentional or not, inherent biases and inconsistent collection methodologies can render a dataset an unreliable foundation for comprehensive conclusions. Even a dataset free from such malice or bias may remain restricted in scope. Therefore, any extrapolations we draw must be carefully calibrated to reflect those specific limitations.
It is in this light that we focus on our core objective: the vital importance of understanding the nature, structure, and constraints of a dataset. We aim to explore how we can accurately determine these boundaries and, crucially, how to curtail our conclusions to ensure they remain in strict accordance with the data’s actual scope.
Problem Statement
The core objective is to evaluate the structural framework of the dataset while acknowledging its constraints. This comprehensive understanding ensures that any subsequent analyses remains grounded in the data’s reality when approaching the dataset from differing angles of analysis, and allows for critique and scrutiny of any results that may be obtained.
Such an approach is realistic and pragmatically approaches data analysis as most datasets in the real world are not immune to containing bias either intentionally or unintentionally.
This is an imperative dilemma especially when dealing with topics with such importance as the health and well being of our global, something that will define the lives of generations to come.
Initial Exploratory Data Analysis (EDA)
Climate Dataset:
#just basic init slop for nowlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pwr)library(broom)library(readr)#this broke again for some reason :sob:library(zoo)
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
#for multicollinearity check sense the car package is not working library(performance)library(knitr)df_main <-read.csv("climate_change_dataset.csv")#output clean table to show overall structure of the datasetdf_summary <-data.frame(Column =colnames(df_main),Type =sapply(df_main, class),Missing_Values =sapply(df_main, function(x) sum(is.na(x))))kable(df_summary, caption ="Dataset Structure Overview")
Dataset Structure Overview
Column
Type
Missing_Values
Year
Year
integer
0
Country
Country
character
0
Avg.Temperature…C.
Avg.Temperature…C.
numeric
0
CO2.Emissions..Tons.Capita.
CO2.Emissions..Tons.Capita.
numeric
0
Sea.Level.Rise..mm.
Sea.Level.Rise..mm.
numeric
0
Rainfall..mm.
Rainfall..mm.
integer
0
Population
Population
integer
0
Renewable.Energy….
Renewable.Energy….
numeric
0
Extreme.Weather.Events
Extreme.Weather.Events
integer
0
Forest.Area….
Forest.Area….
numeric
0
# summary statisticssummary(df_main)
Year Country Avg.Temperature...C.
Min. :2000 Length:1000 Min. : 5.00
1st Qu.:2005 Class :character 1st Qu.:12.18
Median :2012 Mode :character Median :20.10
Mean :2011 Mean :19.88
3rd Qu.:2018 3rd Qu.:27.23
Max. :2023 Max. :34.90
CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm.
Min. : 0.500 Min. :1.00 Min. : 501
1st Qu.: 5.575 1st Qu.:2.00 1st Qu.:1099
Median :10.700 Median :3.00 Median :1726
Mean :10.426 Mean :3.01 Mean :1739
3rd Qu.:15.400 3rd Qu.:4.00 3rd Qu.:2362
Max. :20.000 Max. :5.00 Max. :2999
Population Renewable.Energy.... Extreme.Weather.Events
Min. :3.661e+06 Min. : 5.10 Min. : 0.000
1st Qu.:3.436e+08 1st Qu.:16.10 1st Qu.: 3.000
Median :7.131e+08 Median :27.15 Median : 8.000
Mean :7.054e+08 Mean :27.30 Mean : 7.291
3rd Qu.:1.074e+09 3rd Qu.:38.92 3rd Qu.:11.000
Max. :1.397e+09 Max. :50.00 Max. :14.000
Forest.Area....
Min. :10.10
1st Qu.:25.60
Median :41.15
Mean :40.57
3rd Qu.:55.80
Max. :70.00
# missing values per columndf_main |>summarise(across(everything(), ~sum(is.na(.))))
Year Country Avg.Temperature...C. CO2.Emissions..Tons.Capita.
1 0 0 0 0
Sea.Level.Rise..mm. Rainfall..mm. Population Renewable.Energy....
1 0 0 0 0
Extreme.Weather.Events Forest.Area....
1 0 0
This table provides a structural overview of the dataset, including variable types and missing values.
Several important observations can immediately be made:
The dataset contains a mix of numerical and categorical variables
Missing values do not appear to be present but we must be cautious
Variables represent aggregated national level indicators rather than controlled experimental measurements
This reinforces a key principle of what we seek to learn with this exploration:
Understanding the structure of a dataset is a prerequisite to any valid statistical analysis.
Failure in understanding the dataset’s structure and limitations may result in applying inappropriate methods or drawing conclusions that exceed the dataset’s scope\
Assumptions Made and Why?
In order to conduct statistical analysis on this dataset, several assumptions must be made. These assumptions are not guarantees of truth, but rather controlled simplifications that allow us to apply statistical tools responsibly. It is important to note that there may be more granular assumptions made at various points of the dataset’s creation (such as tabulation, methods, etc.) but we cannot possible know every one of those assumptions. For simplicity, we constrain ourselves to the assumptions we can make now:
1. Independence of Observations
We assume that each row (country-year observation) is independent
In reality, this may not strictly hold, as geopolitical, economic, and environmental relationships can introduce dependence between countries
Corrective Action: We can interpret results as associative and avoid overgeneralizing relationships across countries
2. Linearity (for regression models)
We assume a linear relationship between predictors (e.g., renewable energy %) and response (CO2 emissions)
Corrective Action: - We visually inspect scatterplots before modedling and we can validate this assumption using residual diagnostics
3. Homoscedasticity (constant variance of errors)
We assume that the variance of residuals is constant across all levels of the predictor
Corrective Action: Residual vs fitted plots are used to detect violations (may not be added yet)
4. Normality of residuals
We assume residuals follow a normal distribution (important for inference)
Corrective Action: QQ plots are used to evaluate this assumption (may not be added yet)
5. Awareness of Scope
The dataset represents a snapshot across countires and years, and should not be considered as a controlled experiment
Many relevant variables (policy, industrialization, population density, etc.) are not included
Corrective Action: All conclusions are framed strictly within the scope of observed variables. We will avoid causal claims and instead focus on statistical realtionships and limitations. Our objective is to highlight the limitations that are imposed by dataset scope and structure. We also show how this impacts our approach to model selection before running statistical tests.
Analyses and Support
This section focuses on extracting meaningful patterns while continuously validating whether the chosen statistical tools are appropriate for the dataset structure. We will probe the dataset from various angles using differing tests to see what conclusions we can draw. From that, we will scrutinize our conclusions and attempt to give reasoning behind their validity with respect to the test used and the scope of the dataset.
We will demonstrate this by attempting analysis, validating assumptions, and determining if a method is not appropriate.
Renewable Energy vs Co2 Emissions
ggplot(df_main, aes(x = Renewable.Energy...., y = CO2.Emissions..Tons.Capita.)) +geom_point(alpha =0.6) +geom_smooth(method ="lm", se =FALSE, color ="red") +geom_smooth(method ="loess", se =FALSE, color ="blue") +theme_minimal() +labs(title ="CO2 vs Renewable Energy: Linear vs Nonlinear Fit",subtitle ="Different models suggest different relationships",x ="Renewable Energy (%)",y ="CO2 Emissions" )
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Analysis
This visualization compares two different modeling approaches applied to the same data chunk. We use a a linear regression model and a nonparametric “Locally Estimated Scatterplot Smoothing” (LOESS) curve.
At first glance, both lines attempt to capture the relationship between renewable energy usage and CO2 emissions. However, upon closer inespection the two models diverge in shape around the beginning and middle portions of the data and merging together more tightly near the end.
This highlights a few important things to note:
The trend observed in a dataset is not absolute and rather depends on the model used to estimate it.
Our intuition or “gut feeling” can sometimes guide us (incorrectly) to look for information we expect which can shape the tests we choose to run and the influence the conclusions we draw.
We would expect that as renewable energy percentages increased, our Co2 emission levels would drop. However we see that this is not the case, rather the relationship is constant.
The linear model imposes a strict assumption that the relationship is linear across the entire range of the sample set. On the other hand, LOESS is designed to be flexible to localities in the data and thus as a result is less restrictive with its assumptions.
If these two models suggest different patterns, it indicates that:
The relationship is weak or unstable
A single global model may not be appropriate
In our case we can see that the patterns are close, but still different enough that we cannot confidently say that there is a strong relationship between reusable energy systems and a drop in Co2 emission levels relative to the tests we ran. Again, it is still possible that a model on a global level of abstraction is inadequate to capture some fine-grain relationship that exists, but is hidden here.
Thus without validating our assumptions, we can see how differing modelings can lead to different and potentially nonconflicting results which may taint our interpretation of the data and by extension the conclusion we draw.
Raw Data vs Country-Averaged Data
# aggregated versiondf_country_avg <- df_main |>group_by(Country) |>summarise(avg_co2 =mean(CO2.Emissions..Tons.Capita., na.rm =TRUE),avg_renew =mean(Renewable.Energy...., na.rm =TRUE) )ggplot() +geom_point(data = df_main,aes(x = Renewable.Energy...., y = CO2.Emissions..Tons.Capita.),alpha =0.2, color ="gray") +geom_point(data = df_country_avg,aes(x = avg_renew, y = avg_co2),color ="red", size =3) +theme_minimal() +labs(title ="Raw Data vs Country-Averaged Data",subtitle ="Aggregation can artificially simplify relationships",x ="Renewable Energy (%)",y ="CO2 Emissions" )
Analysis
This visualization overlays raw observations with country-level averages. We can see that the gray points represent individual observations, while the red points represent averages aggregated by country.
Notice that:
The aggregated data represented by the red dots appears to be more structured and less noisy. It also has a relatively tight spread relative to the surrounding data points.
The raw data represented by the gray dots, shows substantially more variability, as the points are scattered across the graph.
This scenario gives rise to potential illusions:
Aggregation of data can make relationships appear artificially stronger and more stable then what is likely the case in reality
Resultant data is often skewed as a result of a loss of information
Heterogeneity within groups is also lost due to the earlier loss of information (we lose granularity)
The way data is grouped or summarized can significantly alter the final interpretation and conclusion of the data
This is not to say that aggregation is alwasy bad. Through the use of averaging values we trade variablity within a country for an emphasized display of differences between countries. However, if conclusions are based solely on aggregated data, they may overstate the strength of relationships and also Ignore important variability.
Thus, this demonstrates the importance of analyzing both raw and aggregated forms of data before moving to draw conclusions.
Number of Observations Collected
df_counts <- df_main |>group_by(Country) |>summarise(count =n())ggplot(df_counts, aes(x =reorder(Country, count), y = count)) +geom_col(fill ="orange") +coord_flip() +theme_minimal() +labs(title ="Number of Observations per Country",subtitle ="Uneven sampling introduces bias",x ="Country",y ="Number of Observations" )
Analysis
This chart shows the number of observations collected for each country. Notice that the distribution is uneven, with some countries having significantly more observations than others. Indonesia has the highest number of recorded observations where Mexico has the fewest. This could be due to various reasons such as testing/tabulation quality, country specific environmental policies, or dataset scope.
Regardless of the reason, this introduces some fundamental issues:
The dataset does not represent all countries equally
Countries with more observations have a greater impact on statistical models derived from the dataset
Countries with fewer observations have higher levels of inconsistency/uncertainty and are more susceptible to random variation. The confidence of the observed trends for these countries is also reduced greatly as a result
This directly impacts the validity of conclusions that can be drawn from the dataset. Statistical results may reflect the structure of the dataset rather than true global patterns, and thus be skewed such as they no longer represent the reality of the real world. This also violates an implicit assumption often made in analysis that all groups are equally represented.
As a result of this structural shortcoming, any global conclusion drawn from this dataset must be interpreted cautiously as findings may be biased toward overrepresented countries.
Again, this highlights the importance of scrutinizing the structure and scope of a dataset before applying statistical methods, as even if the statistical methods are carried out correctly, they may still not truly represent the real world trends.
Population vs Renewable Energy
ggplot(df_main, aes(x = Population, y = Renewable.Energy....)) +geom_point(alpha =0.6) +geom_smooth(method ="lm", se =FALSE) +theme_minimal() +labs(title ="Population vs Renewable Energy",subtitle ="A relationship that may not be meaningful",x ="Population (%)",y ="Renewable Energy (%)" )
`geom_smooth()` using formula = 'y ~ x'
Analysis
This plot examines the relationship between population size and renewable energy usage. We can see from the graph that the relationship between population and renewable energy systems is almost nonexistent. The line is flat and does not change for the entire sample size. Even if a trend did appear in the between the two variables, interpreting the relationship requires caution.
Intuitively, we know that there is no direct mechanism linking population size to renewable energy percentage. Instead, both variables are likely influenced by other underlying factors such as:
Economic development
Geographic conditions
Government policy
This creates the risk of a spurious correlation where:
Two variables appear related, but are not connected
Even if there is the appearance of a relationship (in our example there is not), that relationship may have no real world meaning
Association does not imply causation and the perceived relationship may arise from hidden variables that were not taken into account when running the model or variables that were omitted from the dataset upon its creation
This example reinforces the idea that not all relationships are meaningful. In our example, we can clearly see that there is no association, but even if there was one, we must be very careful with out interpretation of it as the variables in question may not be directly related and may be impacted by some hidden variable that is not within our current consideration.
This can lead to a bad situation where we end up concluding that a correlation exists between two entities that in reality does not, and thus our final conclusion will be very misleading and in certain cases quite harmful.
Again, this demonstrates the importance of relativent domain knowledge when determining which models and variables to investigate and understanding the underlying dataset structure such that hidden, or implied variables can be accounted for. Without these considerations, it is easy to draw misleading conclusions from otherwise valid/mathematically sound statistical results.
Conclusions
This analysis demonstrates that statistical modeling is not simply about applying tools, but about understanding when those tools are appropriate. We hope we have delivered a compelling argument that highlights the importance of ensuring that the tools we do apply to our analysis are the correct ones, and that the conclusions we derive from the use of those tools acknowledge the limits and scope of our dataset.
In our dataset we found that:
Uneven sample sizes across countries introduced bias risk
Evidence of a weak correlation between renewable energy and CO2 emissions (violates our intuition)
Regression model explains almost none of the variability (extremely low R²)
Our approach of conducting varying tests and analysies from various angles on the dataset, has allowed us to discover many structural/scope limitations. It is imperative to understand that scrutinizing the scope, structure and limitations of any dataset is just as important as the results derived from it. In the case of our climate dataset we know that:
Data can be technically correct/mathematically sound but contextually misleading
Statistical rigor is essential to avoid misinformation
Our conclusions must be constrained by:
sample size
data structure
missing variables
Actionable Conclusions with Recommendations
Based on the analysis conducted on our dataset, the following recommendations are made. Specific recommendations will vary for every dataset, but these recommendations are general enough to be easily modified to apply to most common datasets:
1. Improve Data Collection
Ensure more balanced sampling across countries
Increase longitudinal depth (track changes over time)
2. Include Additional Variables
Current dataset omits critical variables such as industrial output, population size, energy policy, and economic development level
3. Avoid Overreliance on Simple Models
Linear regression is insufficient for complex environmental systems
Consider:
multivariate regression
time-series models
causal inference methods
4. Always Validate Model Assumptions
Before interpreting results, check residuals, evaluate fit quality, and question appropriateness of the model.
5. Communicate Uncertainty Clearly
As climatologists, people will view you as the symbol of authority on matters relating to climate change and global warming. It is better to present a cautious, limited conclusion than a confident but misleading one. It is also very important to disclose any uncertainties or gaps in data.
It is best practice to treat statistical results as evidence under a set of given constraints, rather then an absolute truth. This mindset is fundamental for maintaining scientific integrity in all fields of science including climate research.