Stats Final Project Draft 1

Author

Mohsin Khan

Target Audience

  • The target audience for this work will be climatologist, paleo-climatologists, or adjacent atmospheric sciences/meteorology professionals that may not have a background in statistics or data science

Background Info

  • Throughout history, whether by horseback across formidable mountain ranges and arid deserts, or by vessel contending with the ferocity of the restless seas, transmitting a message across the globe was a monumental undertaking. It is difficult to argue that any inherent benefit existed in the arduous trials of that era. However, the advent of modern technology has brought a different challenge: a relentless surge of misinformation and propaganda that saturates our collective consciousness. In this modern “storm” of disregard for objective truth, we must anchor ourselves in the fundamental principles of scrutiny and rigorous critical thinking. No discipline demands this more than the field of science. It is imperative that we scrutinize every facet of our own findings and the claims of others before arriving at our conclusion. As the public looks to us for guidance amidst this noise, failing to uphold these scientific standards risks misleading far more than just ourselves.
  • A primary vehicle for the dissemination of such misinformation is the manipulation of statistics and numbers. Within the context of climate change, this practice is particularly pervasive for those who wish to decesive. Consequently, it becomes the climatologist’s moral obligation to leverage their deep understanding of our Earth and its complex behaviors and combine it with robust statistical and data science techniques to find and present the truth hidden within the numbers. We must recognize that whether intentional or not, inherent biases and inconsistent collection methodologies can render a dataset an unreliable foundation for comprehensive conclusions. Even a dataset free from such malice or bias may remain restricted in scope. Therefore, any extrapolations we draw must be carefully calibrated to reflect those specific limitations.
  • It is in this light that we focus on our core objective: the vital importance of understanding the nature, structure, and constraints of a dataset. We aim to explore how we can accurately determine these boundaries and, crucially, how to curtail our conclusions to ensure they remain in strict accordance with the data’s actual scope.

Problem Statement

  • The core objective is to evaluate the structural framework of the dataset while acknowledging its constraints. This comprehensive understanding ensures that any subsequent analyses remains grounded in the data’s reality when approaching the dataset from differing angles of analysis, and allows for critique and scrutiny of any results that may be obtained.

  • Such an approach is realistic and pragmatically approaches data analysis as most datasets in the real world are not immune to containing bias either intentionally or unintentionally.

  • This is an imperative dilemma especially when dealing with topics with such importance as the health and well being of our global, something that will define the lives of generations to come.


Initial Exploratory Data Analysis (EDA)

Climate Dataset:

#just basic init slop for now
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pwr)
library(broom)
library(readr)
#this broke again for some reason :sob:
library(zoo)

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
#for multicollinearity check sense the car package is not working 
library(performance)
library(knitr)

df_main <- read.csv("climate_change_dataset.csv")

#output clean table to show overall structure of the dataset
df_summary <- data.frame(
  Column = colnames(df_main),
  Type = sapply(df_main, class),
  Missing_Values = sapply(df_main, function(x) sum(is.na(x)))
)
kable(df_summary, caption = "Dataset Structure Overview")
Dataset Structure Overview
Column Type Missing_Values
Year Year integer 0
Country Country character 0
Avg.Temperature…C. Avg.Temperature…C. numeric 0
CO2.Emissions..Tons.Capita. CO2.Emissions..Tons.Capita. numeric 0
Sea.Level.Rise..mm. Sea.Level.Rise..mm. numeric 0
Rainfall..mm. Rainfall..mm. integer 0
Population Population integer 0
Renewable.Energy…. Renewable.Energy…. numeric 0
Extreme.Weather.Events Extreme.Weather.Events integer 0
Forest.Area…. Forest.Area…. numeric 0
# summary statistics
summary(df_main)
      Year        Country          Avg.Temperature...C.
 Min.   :2000   Length:1000        Min.   : 5.00       
 1st Qu.:2005   Class :character   1st Qu.:12.18       
 Median :2012   Mode  :character   Median :20.10       
 Mean   :2011                      Mean   :19.88       
 3rd Qu.:2018                      3rd Qu.:27.23       
 Max.   :2023                      Max.   :34.90       
 CO2.Emissions..Tons.Capita. Sea.Level.Rise..mm. Rainfall..mm. 
 Min.   : 0.500              Min.   :1.00        Min.   : 501  
 1st Qu.: 5.575              1st Qu.:2.00        1st Qu.:1099  
 Median :10.700              Median :3.00        Median :1726  
 Mean   :10.426              Mean   :3.01        Mean   :1739  
 3rd Qu.:15.400              3rd Qu.:4.00        3rd Qu.:2362  
 Max.   :20.000              Max.   :5.00        Max.   :2999  
   Population        Renewable.Energy.... Extreme.Weather.Events
 Min.   :3.661e+06   Min.   : 5.10        Min.   : 0.000        
 1st Qu.:3.436e+08   1st Qu.:16.10        1st Qu.: 3.000        
 Median :7.131e+08   Median :27.15        Median : 8.000        
 Mean   :7.054e+08   Mean   :27.30        Mean   : 7.291        
 3rd Qu.:1.074e+09   3rd Qu.:38.92        3rd Qu.:11.000        
 Max.   :1.397e+09   Max.   :50.00        Max.   :14.000        
 Forest.Area....
 Min.   :10.10  
 1st Qu.:25.60  
 Median :41.15  
 Mean   :40.57  
 3rd Qu.:55.80  
 Max.   :70.00  
# missing values per column
df_main |> 
  summarise(across(everything(), ~ sum(is.na(.))))
  Year Country Avg.Temperature...C. CO2.Emissions..Tons.Capita.
1    0       0                    0                           0
  Sea.Level.Rise..mm. Rainfall..mm. Population Renewable.Energy....
1                   0             0          0                    0
  Extreme.Weather.Events Forest.Area....
1                      0               0

This table provides a structural overview of the dataset, including variable types and missing values.

Several important observations can immediately be made:

  • The dataset contains a mix of numerical and categorical variables
  • Missing values do not appear to be present but we must be cautious
  • Variables represent aggregated national level indicators rather than controlled experimental measurements

This reinforces a key principle of what we seek to learn with this exploration:

  • Understanding the structure of a dataset is a prerequisite to any valid statistical analysis.

  • Failure in understanding the dataset’s structure and limitations may result in applying inappropriate methods or drawing conclusions that exceed the dataset’s scope\


Assumptions Made and Why?

In order to conduct statistical analysis on this dataset, several assumptions must be made. These assumptions are not guarantees of truth, but rather controlled simplifications that allow us to apply statistical tools responsibly. It is important to note that there may be more granular assumptions made at various points of the dataset’s creation (such as tabulation, methods, etc.) but we cannot possible know every one of those assumptions. For simplicity, we constrain ourselves to the assumptions we can make now:

1. Independence of Observations

  • We assume that each row (country-year observation) is independent
  • In reality, this may not strictly hold, as geopolitical, economic, and environmental relationships can introduce dependence between countries

Corrective Action: We can interpret results as associative and avoid overgeneralizing relationships across countries


2. Linearity (for regression models)

  • We assume a linear relationship between predictors (e.g., renewable energy %) and response (CO2 emissions)

Corrective Action: - We visually inspect scatterplots before modedling and we can validate this assumption using residual diagnostics


3. Homoscedasticity (constant variance of errors)

  • We assume that the variance of residuals is constant across all levels of the predictor

Corrective Action: Residual vs fitted plots are used to detect violations (may not be added yet)


4. Normality of residuals

  • We assume residuals follow a normal distribution (important for inference)

Corrective Action: QQ plots are used to evaluate this assumption (may not be added yet)


5. Awareness of Scope

  • The dataset represents a snapshot across countires and years, and should not be considered as a controlled experiment
  • Many relevant variables (policy, industrialization, population density, etc.) are not included

Corrective Action: All conclusions are framed strictly within the scope of observed variables. We will avoid causal claims and instead focus on statistical realtionships and limitations. Our objective is to highlight the limitations that are imposed by dataset scope and structure. We also show how this impacts our approach to model selection before running statistical tests.


Analyses and Support

  • This section focuses on extracting meaningful patterns while continuously validating whether the chosen statistical tools are appropriate for the dataset structure. We will probe the dataset from various angles using differing tests to see what conclusions we can draw. From that, we will scrutinize our conclusions and attempt to give reasoning behind their validity with respect to the test used and the scope of the dataset.

  • We will demonstrate this by attempting analysis, validating assumptions, and determining if a method is not appropriate.


Renewable Energy vs Co2 Emissions

ggplot(df_main, aes(x = Renewable.Energy...., y = CO2.Emissions..Tons.Capita.)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  theme_minimal() +
  labs(
    title = "CO2 vs Renewable Energy: Linear vs Nonlinear Fit",
    subtitle = "Different models suggest different relationships",
    x = "Renewable Energy (%)",
    y = "CO2 Emissions"
  )
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Analysis

This visualization compares two different modeling approaches applied to the same data chunk. We use a a linear regression model and a nonparametric “Locally Estimated Scatterplot Smoothing” (LOESS) curve.

At first glance, both lines attempt to capture the relationship between renewable energy usage and CO2 emissions. However, upon closer inespection the two models diverge in shape around the beginning and middle portions of the data and merging together more tightly near the end.

This highlights a few important things to note:

  • The trend observed in a dataset is not absolute and rather depends on the model used to estimate it.

  • Our intuition or “gut feeling” can sometimes guide us (incorrectly) to look for information we expect which can shape the tests we choose to run and the influence the conclusions we draw.

  • We would expect that as renewable energy percentages increased, our Co2 emission levels would drop. However we see that this is not the case, rather the relationship is constant.

The linear model imposes a strict assumption that the relationship is linear across the entire range of the sample set. On the other hand, LOESS is designed to be flexible to localities in the data and thus as a result is less restrictive with its assumptions.

If these two models suggest different patterns, it indicates that:

  • The relationship is weak or unstable
  • A single global model may not be appropriate

In our case we can see that the patterns are close, but still different enough that we cannot confidently say that there is a strong relationship between reusable energy systems and a drop in Co2 emission levels relative to the tests we ran. Again, it is still possible that a model on a global level of abstraction is inadequate to capture some fine-grain relationship that exists, but is hidden here.

Thus without validating our assumptions, we can see how differing modelings can lead to different and potentially nonconflicting results which may taint our interpretation of the data and by extension the conclusion we draw.


Raw Data vs Country-Averaged Data

# aggregated version
df_country_avg <- df_main |>
  group_by(Country) |>
  summarise(
    avg_co2 = mean(CO2.Emissions..Tons.Capita., na.rm = TRUE),
    avg_renew = mean(Renewable.Energy...., na.rm = TRUE)
  )

ggplot() +
  geom_point(data = df_main,
             aes(x = Renewable.Energy...., y = CO2.Emissions..Tons.Capita.),
             alpha = 0.2, color = "gray") +
  geom_point(data = df_country_avg,
             aes(x = avg_renew, y = avg_co2),
             color = "red", size = 3) +
  theme_minimal() +
  labs(
    title = "Raw Data vs Country-Averaged Data",
    subtitle = "Aggregation can artificially simplify relationships",
    x = "Renewable Energy (%)",
    y = "CO2 Emissions"
  )

Analysis

This visualization overlays raw observations with country-level averages. We can see that the gray points represent individual observations, while the red points represent averages aggregated by country.

Notice that:

  • The aggregated data represented by the red dots appears to be more structured and less noisy. It also has a relatively tight spread relative to the surrounding data points.
  • The raw data represented by the gray dots, shows substantially more variability, as the points are scattered across the graph.

This scenario gives rise to potential illusions:

  • Aggregation of data can make relationships appear artificially stronger and more stable then what is likely the case in reality

  • Resultant data is often skewed as a result of a loss of information

  • Heterogeneity within groups is also lost due to the earlier loss of information (we lose granularity)

  • The way data is grouped or summarized can significantly alter the final interpretation and conclusion of the data

This is not to say that aggregation is alwasy bad. Through the use of averaging values we trade variablity within a country for an emphasized display of differences between countries. However, if conclusions are based solely on aggregated data, they may overstate the strength of relationships and also Ignore important variability.

Thus, this demonstrates the importance of analyzing both raw and aggregated forms of data before moving to draw conclusions.


Number of Observations Collected

df_counts <- df_main |>
  group_by(Country) |>
  summarise(count = n())

ggplot(df_counts, aes(x = reorder(Country, count), y = count)) +
  geom_col(fill = "orange") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Number of Observations per Country",
    subtitle = "Uneven sampling introduces bias",
    x = "Country",
    y = "Number of Observations"
  )

Analysis

This chart shows the number of observations collected for each country. Notice that the distribution is uneven, with some countries having significantly more observations than others. Indonesia has the highest number of recorded observations where Mexico has the fewest. This could be due to various reasons such as testing/tabulation quality, country specific environmental policies, or dataset scope.

Regardless of the reason, this introduces some fundamental issues:

  • The dataset does not represent all countries equally

  • Countries with more observations have a greater impact on statistical models derived from the dataset

  • Countries with fewer observations have higher levels of inconsistency/uncertainty and are more susceptible to random variation. The confidence of the observed trends for these countries is also reduced greatly as a result

This directly impacts the validity of conclusions that can be drawn from the dataset. Statistical results may reflect the structure of the dataset rather than true global patterns, and thus be skewed such as they no longer represent the reality of the real world. This also violates an implicit assumption often made in analysis that all groups are equally represented.

As a result of this structural shortcoming, any global conclusion drawn from this dataset must be interpreted cautiously as findings may be biased toward overrepresented countries.

Again, this highlights the importance of scrutinizing the structure and scope of a dataset before applying statistical methods, as even if the statistical methods are carried out correctly, they may still not truly represent the real world trends.


Population vs Renewable Energy

ggplot(df_main, aes(x = Population, y = Renewable.Energy....)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Population vs Renewable Energy",
    subtitle = "A relationship that may not be meaningful",
    x = "Population (%)",
    y = "Renewable Energy (%)"
  )
`geom_smooth()` using formula = 'y ~ x'

Analysis

This plot examines the relationship between population size and renewable energy usage. We can see from the graph that the relationship between population and renewable energy systems is almost nonexistent. The line is flat and does not change for the entire sample size. Even if a trend did appear in the between the two variables, interpreting the relationship requires caution.

Intuitively, we know that there is no direct mechanism linking population size to renewable energy percentage. Instead, both variables are likely influenced by other underlying factors such as:

  • Economic development
  • Geographic conditions
  • Government policy

This creates the risk of a spurious correlation where:

  • Two variables appear related, but are not connected

  • Even if there is the appearance of a relationship (in our example there is not), that relationship may have no real world meaning

  • Association does not imply causation and the perceived relationship may arise from hidden variables that were not taken into account when running the model or variables that were omitted from the dataset upon its creation

This example reinforces the idea that not all relationships are meaningful. In our example, we can clearly see that there is no association, but even if there was one, we must be very careful with out interpretation of it as the variables in question may not be directly related and may be impacted by some hidden variable that is not within our current consideration.

This can lead to a bad situation where we end up concluding that a correlation exists between two entities that in reality does not, and thus our final conclusion will be very misleading and in certain cases quite harmful.

Again, this demonstrates the importance of relativent domain knowledge when determining which models and variables to investigate and understanding the underlying dataset structure such that hidden, or implied variables can be accounted for. Without these considerations, it is easy to draw misleading conclusions from otherwise valid/mathematically sound statistical results.


Conclusions

This analysis demonstrates that statistical modeling is not simply about applying tools, but about understanding when those tools are appropriate. We hope we have delivered a compelling argument that highlights the importance of ensuring that the tools we do apply to our analysis are the correct ones, and that the conclusions we derive from the use of those tools acknowledge the limits and scope of our dataset.

In our dataset we found that:

  • Uneven sample sizes across countries introduced bias risk
  • Evidence of a weak correlation between renewable energy and CO2 emissions (violates our intuition)
  • Regression model explains almost none of the variability (extremely low R²)

Our approach of conducting varying tests and analysies from various angles on the dataset, has allowed us to discover many structural/scope limitations. It is imperative to understand that scrutinizing the scope, structure and limitations of any dataset is just as important as the results derived from it. In the case of our climate dataset we know that:

  • Data can be technically correct/mathematically sound but contextually misleading
  • Statistical rigor is essential to avoid misinformation
  • Our conclusions must be constrained by:
    • sample size
    • data structure
    • missing variables

Actionable Conclusions with Recommendations

Based on the analysis conducted on our dataset, the following recommendations are made. Specific recommendations will vary for every dataset, but these recommendations are general enough to be easily modified to apply to most common datasets:


1. Improve Data Collection

  • Ensure more balanced sampling across countries
  • Increase longitudinal depth (track changes over time)

2. Include Additional Variables

Current dataset omits critical variables such as industrial output, population size, energy policy, and economic development level


3. Avoid Overreliance on Simple Models

  • Linear regression is insufficient for complex environmental systems
  • Consider:
    • multivariate regression
    • time-series models
    • causal inference methods

4. Always Validate Model Assumptions

Before interpreting results, check residuals, evaluate fit quality, and question appropriateness of the model.


5. Communicate Uncertainty Clearly

As climatologists, people will view you as the symbol of authority on matters relating to climate change and global warming. It is better to present a cautious, limited conclusion than a confident but misleading one. It is also very important to disclose any uncertainties or gaps in data.

It is best practice to treat statistical results as evidence under a set of given constraints, rather then an absolute truth. This mindset is fundamental for maintaining scientific integrity in all fields of science including climate research.


PDF Embed


Video Recording