Week 5 Data Dive Documentation

Introduction

This data dive is to study my climate change dataset to evaluate the importance of the risks of undocumented or poorly documented data. As mentioned in Leon’s lecture, understanding data is crucial for making valid inferences, and the more you know about your data, the better your inference is going to be.

climate_data <- read.csv("C:/Users/IU Student/Downloads/climate_change_dataset.csv")
str(climate_data)

## 'data.frame':    1000 obs. of  10 variables:
##  $ Year                       : int  2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
##  $ Country                    : chr  "UK" "USA" "France" "Argentina" ...
##  $ Avg.Temperature...C.       : num  8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
##  $ CO2.Emissions..Tons.Capita.: num  9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
##  $ Sea.Level.Rise..mm.        : num  3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
##  $ Rainfall..mm.              : int  1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
##  $ Population                 : int  530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
##  $ Renewable.Energy....       : num  20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
##  $ Extreme.Weather.Events     : int  14 8 9 7 4 12 10 1 4 5 ...
##  $ Forest.Area....            : num  59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...

Explanation of what this code does: First I loaded the climate change dataset into R and examined its structure. This shows us the column names, data types, and gives us a glimpse at what we’re working with. The str() function helps understand if numbers are stored as numbers and text as text.

Unclear Columns Until Reading Documentation

Extreme Weather Events (Count vs. Category?)

When I first looked at this column, I wasn’t sure if “Extreme Weather Events” represented: - A count of events (14 hurricanes, 8 floods, etc.) - A severity scale (0 = none, 14 = catastrophic) - A categorical ranking system

climate_data %>%
  group_by(Extreme.Weather.Events) %>%
  summarise(count = n()) %>%
  arrange(Extreme.Weather.Events)

## # A tibble: 15 × 2
##    Extreme.Weather.Events count
##                     <int> <int>
##  1                      0    74
##  2                      1    62
##  3                      2    58
##  4                      3    67
##  5                      4    50
##  6                      5    63
##  7                      6    52
##  8                      7    57
##  9                      8    83
## 10                      9    76
## 11                     10    68
## 12                     11    61
## 13                     12    71
## 14                     13    77
## 15                     14    81

what this code does: So here I am grouping all the data by the extreme weather events values and counting how many times each value appears.

Why this matters: If this is a count, then 0 is perfectly reasonable (no extreme events that year). If it’s a severity scale, 0 might mean data not collected rather than no events. Without documentation clarifying this is a COUNT of discrete events, I might have incorrectly treated zeros as missing data and removed them from analysis. This would have changed any correlation analysis between climate factors and extreme weather frequency.

Population (Units? Measurement Method?)

The Population column presents values in the hundreds of millions without any indication of - - Units (is this in thousands? millions? actual count?) - Measurement timing (census data? estimate? projection?) - Geographic scope (metropolitan area? entire country?)

climate_data %>%
  group_by(Country) %>%
  summarise(
    min_pop = min(Population),
    max_pop = max(Population),
    pop_range = max(Population) - min(Population)
  ) %>%
  arrange(desc(pop_range))

## # A tibble: 15 × 4
##    Country       min_pop    max_pop  pop_range
##    <chr>           <int>      <int>      <int>
##  1 Russia       11186886 1397016073 1385829187
##  2 Brazil        9355425 1393981934 1384626509
##  3 Japan         5467801 1385374585 1379906784
##  4 UK           20916998 1387457528 1366540530
##  5 Mexico       22094509 1388289771 1366195262
##  6 Australia    26451115 1388230143 1361779028
##  7 Germany      16901724 1377569278 1360667554
##  8 India         9918562 1366390185 1356471623
##  9 France       24043592 1379671819 1355628227
## 10 China         3660891 1358197397 1354536506
## 11 Canada       45925925 1395185778 1349259853
## 12 Indonesia    10978954 1358606331 1347627377
## 13 USA          41988408 1380798693 1338810285
## 14 South Africa 46184187 1383289354 1337105167
## 15 Argentina    46153952 1356785799 1310631847

Explanation of what this code does: For each country in my dataset, I will calculate the minimum population, maximum population, and the difference between them. Then I sort by the range to see which countries show the most variation. This helps spot potential data quality issues.

Why this matters: Looking at the UK with populations ranging from ~21 million to ~1.4 billion raises immediate red flags. The actual UK population is around 67 million. If I assumed these were real population figures and calculated per-capita emissions or resource usage, my conclusions would be completely wrong. Documentation should specify if these are synthetic values or scaled values, or contain measurement errors.

Unclear Even AFTER Reading Documentation

The Population Anomaly

Even if documentation exists for this dataset, the population values appear to be fundamentally inconsistent with reality. Let me try to show why this is concerning -

actual_populations <- data.frame(
  Country = c("UK", "USA", "China", "India", "Brazil", "Germany", 
              "France", "Canada", "Australia", "Mexico", "Russia",
              "South Africa", "Argentina", "Indonesia", "Japan"),
  Actual_Pop_Millions = c(67, 331, 1439, 1380, 213, 83,
                          65, 38, 26, 129, 146,
                          60, 45, 274, 126)
)
climate_data %>%
  filter(Year == 2020) %>%
  select(Country, Population) %>%
  mutate(Dataset_Pop_Millions = Population / 1000000) %>%
  left_join(actual_populations, by = "Country") %>%
  mutate(Difference_Millions = Dataset_Pop_Millions - Actual_Pop_Millions) %>%
  arrange(desc(abs(Difference_Millions)))

##         Country Population Dataset_Pop_Millions Actual_Pop_Millions
## 1     Australia 1332348555           1332.34856                  26
## 2     Argentina 1225881222           1225.88122                  45
## 3        Brazil 1391047250           1391.04725                 213
## 4  South Africa 1142896221           1142.89622                  60
## 5     Australia 1103288537           1103.28854                  26
## 6        Brazil 1246729629           1246.72963                 213
## 7        Russia 1128364538           1128.36454                 146
## 8     Australia  977900597            977.90060                  26
## 9         China  491426196            491.42620                1439
## 10       Brazil 1106863791           1106.86379                 213
## 11       France  943264951            943.26495                  65
## 12          USA 1199856255           1199.85625                 331
## 13        China  573451526            573.45153                1439
## 14       Brazil 1042719729           1042.71973                 213
## 15    Argentina  817572573            817.57257                  45
## 16 South Africa  805165392            805.16539                  60
## 17          USA 1072790036           1072.79004                 331
## 18       Brazil  937363773            937.36377                 213
## 19       Russia  863711591            863.71159                 146
## 20        China  738275078            738.27508                1439
## 21        Japan  733726265            733.72627                 126
## 22       Russia  655879251            655.87925                 146
## 23 South Africa  491644777            491.64478                  60
## 24        India  988849880            988.84988                1380
## 25       Russia  536210430            536.21043                 146
## 26       Russia  521663999            521.66400                 146
## 27        China 1074542897           1074.54290                1439
## 28    Indonesia  633829632            633.82963                 274
## 29        Japan  399126588            399.12659                 126
## 30        China 1202028857           1202.02886                1439
## 31          USA  515313114            515.31311                 331
## 32    Indonesia  445421988            445.42199                 274
## 33    Argentina  210449747            210.44975                  45
## 34       Russia   11186886             11.18689                 146
## 35       Russia  278429462            278.42946                 146
## 36    Indonesia  378091478            378.09148                 274
## 37       Mexico   30190953             30.19095                 129
## 38 South Africa  148881368            148.88137                  60
## 39       Canada  117312606            117.31261                  38
## 40 South Africa  121780956            121.78096                  60
## 41           UK   31063151             31.06315                  67
## 42          USA  320264723            320.26472                 331
## 43    Australia   26451115             26.45112                  26
##    Difference_Millions
## 1          1306.348555
## 2          1180.881222
## 3          1178.047250
## 4          1082.896221
## 5          1077.288537
## 6          1033.729629
## 7           982.364538
## 8           951.900597
## 9          -947.573804
## 10          893.863791
## 11          878.264951
## 12          868.856255
## 13         -865.548474
## 14          829.719729
## 15          772.572573
## 16          745.165392
## 17          741.790036
## 18          724.363773
## 19          717.711591
## 20         -700.724922
## 21          607.726265
## 22          509.879251
## 23          431.644777
## 24         -391.150120
## 25          390.210430
## 26          375.663999
## 27         -364.457103
## 28          359.829632
## 29          273.126588
## 30         -236.971143
## 31          184.313114
## 32          171.421988
## 33          165.449747
## 34         -134.813114
## 35          132.429462
## 36          104.091478
## 37          -98.809047
## 38           88.881368
## 39           79.312606
## 40           61.780956
## 41          -35.936849
## 42          -10.735277
## 43            0.451115

Explanation of what this code does I created a table of real world population data from 2020, then compare it to what’s in my dataset for the same year. By calculating the difference in millions, we can see how far off the dataset values are from reality. The filter gets only 2020 data, select picks specific columns, and mutate creates new calculated columns.

The Documentation Gap: Even with perfect documentation, if the documentation doesn’t really address why UK appears to have a population of 980 million in 2022 (15 times its actual size, I checked), I would have a critical problem. This could indicate - - Synthetic/simulated data with unrealistic parameters - Encoding errors (perhaps mixing population with another metric) - Corrupted data during collection or transfer - Intentional obfuscation for privacy (but then why keep country names? Hmm)

Visualizing the Population Issue from Multiple Perspectives

Again as mentioned in Leon’s lecture about data with multiple subsets, when you see something like this… that’s probably means you’re dealing with multiple subsets of data, and you shouldn’t be mixing apples and oranges. Let’s explore this population issue visually.

The Population Volatility

pop_volatility <- climate_data %>%
  arrange(Country, Year) %>%
  group_by(Country) %>%
  mutate(
    pop_change_pct = (Population - lag(Population)) / lag(Population) * 100
  ) %>%
  filter(!is.na(pop_change_pct))

ggplot(pop_volatility, aes(x = Year, y = pop_change_pct, color = Country)) +
  geom_line(alpha = 0.6) +
  geom_hline(yintercept = c(-5, 5), linetype = "dashed", color = "red") +
  labs(
    title = "Year-over-Year Population Change: Biologically Impossible Volatility",
    subtitle = "Red lines show ±5% threshold - no real country changes this fast",
    x = "Year",
    y = "Population Change (%)",
    color = "Country"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Explanation of what this does: I calculated the percentage change in population from one year to the next for each country. The lag() function gets the previous year’s value, let me compute growth rates. Then I plot these changes over time with lines for each country, adding red reference lines at ±5% to show what would be extreme but still possible changes.

What’s unclear and why it’s concerning: Real-world population changes rarely exceed 3% annually (even accounting for migration). This dataset shows countries experiencing 50-100%+ population swings year-to-year, which is biologically and socially impossible. This suggests the “Population” column might actually encode something else entirely - like probably GDP, energy consumption, or is completely synthetic. Not sure about this.

Risk assessment: If a researcher used this data to model climate impact on population density, resource allocation, or per-capita emissions without noticing this issue, their conclusions would be really really wrong. Policy recommendations based on such analysis could misallocate billions in climate adaptation funding.

3.2 Perspective 2: Cross-Country Population Distribution

ggplot(climate_data, aes(x = reorder(Country, Population, median), 
                         y = Population / 1000000, 
                         fill = Country)) +
  geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.size = 2) +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Population Distribution by Country: Suspicious Overlaps",
    subtitle = "Countries with vastly different actual populations show similar ranges",
    x = "Country",
    y = "Population (Millions)",
    caption = "Red dots indicate statistical outliers within each country"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Explanation of what thisdoes: So I made a boxplot for each country showing how their population values are distributed across all years in the dataset. The reorder() function sorts countries by their median population for easier comparison. Converting to millions makes the y-axis more readable, and coord_flip() turns it sideways so country names don’t overlap.

What’s unclear and why it’s a concern: Notice how countries like Australia (actual population: 26 million) and China (actual population: 1.4 billion) show overlapping population ranges in this dataset. This is impossible in reality - they differ by a factor of 50+. The fact that small and large countries have similar statistical distributions suggests this data is either: 1. Randomly generated without real-world constraints 2. Measuring something other than human population 3. Contains a systematic encoding/scaling error

I’m not sure which one it is more likely to be but either are possible.

Risk assessment: Like mentioned in the lecture, if we calculated the covariance between population and CO2 emissions using this data, we’d be combining what should be distinct subsets. Any inference about population’s relationship with climate variables would be meaningless.

Checking for Missing Data in Categorical Columns

For this analysis, I’ll treat Country and create a categorical version of Extreme Weather Events.

Country (Categorical Analysis)

sum(is.na(climate_data$Country))

## [1] 0

sum(climate_data$Country == "")

## [1] 0

expected_g20 <- c("Argentina", "Australia", "Brazil", "Canada", "China", 
                  "France", "Germany", "India", "Indonesia", "Italy", 
                  "Japan", "Mexico", "Russia", "Saudi Arabia", "South Africa",
                  "South Korea", "Turkey", "UK", "USA")

actual_countries <- unique(climate_data$Country)
missing_g20 <- setdiff(expected_g20, actual_countries)

cat("Explicitly missing (NA) countries:", sum(is.na(climate_data$Country)), "\n")

## Explicitly missing (NA) countries: 0

cat("Empty string countries:", sum(climate_data$Country == ""), "\n")

## Empty string countries: 0

cat("Implicitly missing G20 countries:", paste(missing_g20, collapse = ", "), "\n")

## Implicitly missing G20 countries: Italy, Saudi Arabia, South Korea, Turkey

year_country_combos <- climate_data %>%
  group_by(Country, Year) %>%
  summarise(n = n(), .groups = "drop")

years <- unique(climate_data$Year)
countries <- unique(climate_data$Country)
all_combos <- expand.grid(Year = years, Country = countries)
missing_combos <- all_combos %>%
  anti_join(year_country_combos, by = c("Year", "Country"))

cat("Empty groups (year-country combinations with no data):", nrow(missing_combos), "\n")

## Empty groups (year-country combinations with no data): 20

Explanation of what this code does: Here I am checking three types of missing data - explicitly missing (NA values), implicitly missing (countries I would expect but aren’t in the dataset), and empty groups (combinations of year and country that should exist but don’t). The expand.grid() function creates all possible year-country pairs, then anti_join() finds which combinations exist in our complete list but not in our actual data.

Findings: - Explicitly missing rows: None found (no NA values in Country column) - Implicitly missing rows: Several major G20 economies are missing (Italy, Saudi Arabia, South Korea, Turkey), which creates selection bias if this is meant to represent global climate patterns - Empty groups: The dataset appears to have some year country combinations but not all, suggesting non random sampling

Why this matters: If I were trying to make global inferences about climate change but missing major economies and polluters, the conclusions will be systematically biased toward the countries included.

Extreme Weather Events (Categorical Analysis)

climate_data_cat <- climate_data %>%
  mutate(
    Weather_Severity = cut(Extreme.Weather.Events,
                          breaks = c(-1, 0, 4, 9, 14),
                          labels = c("None", "Low", "Moderate", "High"))
  )

cat("Explicitly missing Weather_Severity:", 
    sum(is.na(climate_data_cat$Weather_Severity)), "\n")

## Explicitly missing Weather_Severity: 0

climate_data_cat %>%
  group_by(Weather_Severity) %>%
  summarise(count = n()) %>%
  print()

## # A tibble: 4 × 2
##   Weather_Severity count
##   <fct>            <int>
## 1 None                74
## 2 Low                237
## 3 Moderate           331
## 4 High               358

country_severity <- climate_data_cat %>%
  group_by(Country, Weather_Severity) %>%
  summarise(n = n(), .groups = "drop") %>%
  spread(Weather_Severity, n, fill = 0)

cat("\nCountries missing certain severity levels:\n")

## 
## Countries missing certain severity levels:

print(country_severity)

## # A tibble: 15 × 5
##    Country       None   Low Moderate  High
##    <chr>        <dbl> <dbl>    <dbl> <dbl>
##  1 Argentina        7    19       20    21
##  2 Australia        2    17       24    14
##  3 Brazil           6    14       28    19
##  4 Canada           5    17       21    24
##  5 China            8    12       22    25
##  6 France           1    16       15    34
##  7 Germany          7    14       18    22
##  8 India            6    11       30    23
##  9 Indonesia        4    20       22    29
## 10 Japan            2    17       17    27
## 11 Mexico           9    11       19    16
## 12 Russia           1    23       21    29
## 13 South Africa     5    16       32    20
## 14 UK               5    13       17    30
## 15 USA              6    17       25    25

Explanation of what this does: I will convert the numeric extreme weather events into categories (None, Low, Moderate, High) using the cut() function with specific breakpoints. Then check how many observations fall into each category and whether certain countries are missing certain severity levels. The spread() function pivots the data to show which combinations exist or don’t exist.

Findings: - Explicitly missing rows: A few due to how cut() handles boundaries - Implicitly missing rows: Some countries may never show “None” or “High” severity, which could indicate maybe either real climate patterns OR sampling bias - Empty groups: If certain countries never experience extreme weather in this dataset but do in reality, it indicates incomplete data collection

Why this matters: Missing severity levels for certain countries could lead to incorrect conclusions about climate vulnerability patterns. So I might wrongly conclude some regions are safe when they’re just under sampled. I mentioned this in an earlier data dive as well.

Outlier Detection for continuous variables

Defining Outliers for average temperature

For temperature data, I’ll use kind of an unconventional approach: context-aware outlier detection based on geographic reality rather than pure statistics.

temp_stats <- climate_data %>%
  summarise(
    Q1 = quantile(Avg.Temperature...C., 0.25),
    Q3 = quantile(Avg.Temperature...C., 0.75),
    IQR = Q3 - Q1,
    Lower_Fence = Q1 - 1.5 * IQR,
    Upper_Fence = Q3 + 1.5 * IQR
  )

print(temp_stats)

##       Q1     Q3   IQR Lower_Fence Upper_Fence
## 1 12.175 27.225 15.05       -10.4        49.8

statistical_outliers <- climate_data %>%
  filter(Avg.Temperature...C. < temp_stats$Lower_Fence | 
         Avg.Temperature...C. > temp_stats$Upper_Fence) %>%
  select(Year, Country, Avg.Temperature...C.) %>%
  arrange(Avg.Temperature...C.)

cat("Statistical outliers (IQR method):", nrow(statistical_outliers), "\n")

## Statistical outliers (IQR method): 0

Explanation of what this code does: I first calculate the first quartile (Q1), third quartile (Q3), and interquartile range (IQR) for temperature. The standard outlier definition is any value below Q1 - 1.5×IQR or above Q3 + 1.5×IQR. We then filter our data to find which observations fall outside these “fences” and count how many statistical outliers exist.

My outlier definition: Instead of relying solely on statistical measures, I define outliers as temperatures that are geographically implausible for a given country. For example: - UK showing average temperature of 33.5°C (realistic for a heat wave day, but not annual average) - Russia showing 5.3°C (their actual average is around -5°C)

geographic_norms <- data.frame(
  Country = c("UK", "USA", "China", "India", "Brazil", "Germany", 
              "France", "Canada", "Australia", "Mexico", "Russia",
              "South Africa", "Argentina", "Indonesia", "Japan"),
  Expected_Temp = c(9, 12, 14, 25, 25, 9,
                   12, -5, 22, 21, -5,
                   17, 14, 26, 11),
  Tolerance = c(3, 4, 4, 3, 3, 3,
               3, 4, 3, 3, 4,
               3, 3, 2, 3)  
)

geographic_outliers <- climate_data %>%
  left_join(geographic_norms, by = "Country") %>%
  mutate(
    Temp_Deviation = abs(Avg.Temperature...C. - Expected_Temp),
    Is_Geographic_Outlier = Temp_Deviation > Tolerance
  ) %>%
  filter(Is_Geographic_Outlier == TRUE) %>%
  select(Year, Country, Avg.Temperature...C., Expected_Temp, Temp_Deviation)

cat("Geographic outliers:", nrow(geographic_outliers), "\n")

## Geographic outliers: 811

cat("Percentage of data that are geographic outliers:", 
    round(nrow(geographic_outliers) / nrow(climate_data) * 100, 1), "%\n")

## Percentage of data that are geographic outliers: 81.1 %

Explanation of what this does: I will basically create a reference table with expected average temperatures for each country based on real climate data. Then I join this with the dataset and calculate how far each observation deviates from what’s expected. If the deviation exceeds the tolerance threshold (which varies by country), I can flag it as a geographic outlier.

Visualizing Temperature Outliers

climate_data_outliers <- climate_data %>%
  left_join(geographic_norms, by = "Country") %>%
  mutate(
    Is_Statistical_Outlier = Avg.Temperature...C. < temp_stats$Lower_Fence | 
                            Avg.Temperature...C. > temp_stats$Upper_Fence,
    Temp_Deviation = abs(Avg.Temperature...C. - Expected_Temp),
    Is_Geographic_Outlier = Temp_Deviation > Tolerance,
    Outlier_Type = case_when(
      Is_Statistical_Outlier & Is_Geographic_Outlier ~ "Both",
      Is_Statistical_Outlier ~ "Statistical Only",
      Is_Geographic_Outlier ~ "Geographic Only",
      TRUE ~ "Not Outlier"
    )
  )

ggplot(climate_data_outliers, 
       aes(x = Year, y = Avg.Temperature...C., color = Outlier_Type)) +
  geom_point(alpha = 0.6, size = 2) +
  facet_wrap(~ Country, scales = "free_y") +
  scale_color_manual(values = c("Both" = "red", 
                                "Statistical Only" = "orange",
                                "Geographic Only" = "purple",
                                "Not Outlier" = "gray70")) +
  labs(
    title = "Temperature Outliers: Statistical vs. Geographic Context",
    subtitle = "Red points violate both statistical and geographic norms",
    x = "Year",
    y = "Average Temperature (°C)",
    color = "Outlier Type"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Explanation of what this does: I will combine the statistical and geographic outlier detection into one dataset, creating a new variable that categorizes each point. The facet_wrap() creates separate small plots for each country. I will use different colors to show which points are outliers by different criteria - red for both types, orange for statistical only, purple for geographic only, and gray for normal values.

Why this matters

Statistical outliers tell what’s unusual within this dataset
Geographic outliers tell what’s impossible given real-world climate
Both indicates likely data quality issues

Looking at the visualization, you can see that many data points are geographically impossible, suggesting this dataset may be synthetic or contains serious measurement errors. Nearly 81.1% of temperature readings are geographic outliers.

Significance: If I were to use this data to model temperature trends or climate change impacts, I would need to either: 1. Filter out geographic outliers entirely 2. Investigate whether “Average Temperature” means something different than annual mean 3. Treat this as synthetic data for methods testing only, not real-world inference

The risk of not catching these outliers is publishing false conclusions about climate trends or making incorrect predictions about future warming patterns.

Conclusion

So this analysis reveals many documentation and data quality issues:

Documentation gaps create ambiguity problems about units, measurement methods, and variable meanings
Population data appears fundamentally inconsistent with reality, which tells me there is synthetic generation or encoding errors
Missing data patterns show systematic gaps that could bias global climate inferences
Temperature outliers indicates geographic implausibility in a majority of observations

These issues compound when we try to make inferences. The covariance between population and emissions would be meaningless. Confidence intervals calculated from this data might be mathematically correct but practically useless.

Some recommendations: - Cross-validate against external sources before trusting any variable - Use domain knowledge (geography, biology) alongside statistical methods for outlier detection - Be transparent about data limitations in any analysis or reporting

Without addressing these issues, no amount of sophisticated statistical analysis can produce valid insights in this dataset.