This data dive is to study my climate change dataset to evaluate the importance of the risks of undocumented or poorly documented data. As mentioned in Leon’s lecture, understanding data is crucial for making valid inferences, and the more you know about your data, the better your inference is going to be.
climate_data <- read.csv("C:/Users/IU Student/Downloads/climate_change_dataset.csv")
str(climate_data)
## 'data.frame': 1000 obs. of 10 variables:
## $ Year : int 2006 2019 2014 2010 2007 2020 2006 2018 2022 2010 ...
## $ Country : chr "UK" "USA" "France" "Argentina" ...
## $ Avg.Temperature...C. : num 8.9 31 33.9 5.9 26.9 32.3 30.7 33.9 27.8 18.3 ...
## $ CO2.Emissions..Tons.Capita.: num 9.3 4.8 2.8 1.8 5.6 1.4 11.6 6 16.6 1.9 ...
## $ Sea.Level.Rise..mm. : num 3.1 4.2 2.2 3.2 2.4 2.7 3.9 4.5 1.5 3.5 ...
## $ Rainfall..mm. : int 1441 2407 1241 1892 1743 2100 1755 827 1966 2599 ...
## $ Population : int 530911230 107364344 441101758 1069669579 124079175 1202028857 586706107 83947380 980305187 849496137 ...
## $ Renewable.Energy.... : num 20.4 49.2 33.3 23.7 12.5 49.4 41.9 17.7 8.2 7.5 ...
## $ Extreme.Weather.Events : int 14 8 9 7 4 12 10 1 4 5 ...
## $ Forest.Area.... : num 59.8 31 35.5 17.7 17.4 47.2 50.5 56.6 43.4 48.7 ...
Explanation of what this code does: First I loaded the climate change dataset into R and examined its structure. This shows us the column names, data types, and gives us a glimpse at what we’re working with. The str() function helps understand if numbers are stored as numbers and text as text.
When I first looked at this column, I wasn’t sure if “Extreme Weather Events” represented: - A count of events (14 hurricanes, 8 floods, etc.) - A severity scale (0 = none, 14 = catastrophic) - A categorical ranking system
climate_data %>%
group_by(Extreme.Weather.Events) %>%
summarise(count = n()) %>%
arrange(Extreme.Weather.Events)
## # A tibble: 15 × 2
## Extreme.Weather.Events count
## <int> <int>
## 1 0 74
## 2 1 62
## 3 2 58
## 4 3 67
## 5 4 50
## 6 5 63
## 7 6 52
## 8 7 57
## 9 8 83
## 10 9 76
## 11 10 68
## 12 11 61
## 13 12 71
## 14 13 77
## 15 14 81
what this code does: So here I am grouping all the data by the extreme weather events values and counting how many times each value appears.
Why this matters: If this is a count, then 0 is perfectly reasonable (no extreme events that year). If it’s a severity scale, 0 might mean data not collected rather than no events. Without documentation clarifying this is a COUNT of discrete events, I might have incorrectly treated zeros as missing data and removed them from analysis. This would have changed any correlation analysis between climate factors and extreme weather frequency.
The Population column presents values in the hundreds of millions without any indication of - - Units (is this in thousands? millions? actual count?) - Measurement timing (census data? estimate? projection?) - Geographic scope (metropolitan area? entire country?)
climate_data %>%
group_by(Country) %>%
summarise(
min_pop = min(Population),
max_pop = max(Population),
pop_range = max(Population) - min(Population)
) %>%
arrange(desc(pop_range))
## # A tibble: 15 × 4
## Country min_pop max_pop pop_range
## <chr> <int> <int> <int>
## 1 Russia 11186886 1397016073 1385829187
## 2 Brazil 9355425 1393981934 1384626509
## 3 Japan 5467801 1385374585 1379906784
## 4 UK 20916998 1387457528 1366540530
## 5 Mexico 22094509 1388289771 1366195262
## 6 Australia 26451115 1388230143 1361779028
## 7 Germany 16901724 1377569278 1360667554
## 8 India 9918562 1366390185 1356471623
## 9 France 24043592 1379671819 1355628227
## 10 China 3660891 1358197397 1354536506
## 11 Canada 45925925 1395185778 1349259853
## 12 Indonesia 10978954 1358606331 1347627377
## 13 USA 41988408 1380798693 1338810285
## 14 South Africa 46184187 1383289354 1337105167
## 15 Argentina 46153952 1356785799 1310631847
Explanation of what this code does: For each country in my dataset, I will calculate the minimum population, maximum population, and the difference between them. Then I sort by the range to see which countries show the most variation. This helps spot potential data quality issues.
Why this matters: Looking at the UK with populations ranging from ~21 million to ~1.4 billion raises immediate red flags. The actual UK population is around 67 million. If I assumed these were real population figures and calculated per-capita emissions or resource usage, my conclusions would be completely wrong. Documentation should specify if these are synthetic values or scaled values, or contain measurement errors.
Even if documentation exists for this dataset, the population values appear to be fundamentally inconsistent with reality. Let me try to show why this is concerning -
actual_populations <- data.frame(
Country = c("UK", "USA", "China", "India", "Brazil", "Germany",
"France", "Canada", "Australia", "Mexico", "Russia",
"South Africa", "Argentina", "Indonesia", "Japan"),
Actual_Pop_Millions = c(67, 331, 1439, 1380, 213, 83,
65, 38, 26, 129, 146,
60, 45, 274, 126)
)
climate_data %>%
filter(Year == 2020) %>%
select(Country, Population) %>%
mutate(Dataset_Pop_Millions = Population / 1000000) %>%
left_join(actual_populations, by = "Country") %>%
mutate(Difference_Millions = Dataset_Pop_Millions - Actual_Pop_Millions) %>%
arrange(desc(abs(Difference_Millions)))
## Country Population Dataset_Pop_Millions Actual_Pop_Millions
## 1 Australia 1332348555 1332.34856 26
## 2 Argentina 1225881222 1225.88122 45
## 3 Brazil 1391047250 1391.04725 213
## 4 South Africa 1142896221 1142.89622 60
## 5 Australia 1103288537 1103.28854 26
## 6 Brazil 1246729629 1246.72963 213
## 7 Russia 1128364538 1128.36454 146
## 8 Australia 977900597 977.90060 26
## 9 China 491426196 491.42620 1439
## 10 Brazil 1106863791 1106.86379 213
## 11 France 943264951 943.26495 65
## 12 USA 1199856255 1199.85625 331
## 13 China 573451526 573.45153 1439
## 14 Brazil 1042719729 1042.71973 213
## 15 Argentina 817572573 817.57257 45
## 16 South Africa 805165392 805.16539 60
## 17 USA 1072790036 1072.79004 331
## 18 Brazil 937363773 937.36377 213
## 19 Russia 863711591 863.71159 146
## 20 China 738275078 738.27508 1439
## 21 Japan 733726265 733.72627 126
## 22 Russia 655879251 655.87925 146
## 23 South Africa 491644777 491.64478 60
## 24 India 988849880 988.84988 1380
## 25 Russia 536210430 536.21043 146
## 26 Russia 521663999 521.66400 146
## 27 China 1074542897 1074.54290 1439
## 28 Indonesia 633829632 633.82963 274
## 29 Japan 399126588 399.12659 126
## 30 China 1202028857 1202.02886 1439
## 31 USA 515313114 515.31311 331
## 32 Indonesia 445421988 445.42199 274
## 33 Argentina 210449747 210.44975 45
## 34 Russia 11186886 11.18689 146
## 35 Russia 278429462 278.42946 146
## 36 Indonesia 378091478 378.09148 274
## 37 Mexico 30190953 30.19095 129
## 38 South Africa 148881368 148.88137 60
## 39 Canada 117312606 117.31261 38
## 40 South Africa 121780956 121.78096 60
## 41 UK 31063151 31.06315 67
## 42 USA 320264723 320.26472 331
## 43 Australia 26451115 26.45112 26
## Difference_Millions
## 1 1306.348555
## 2 1180.881222
## 3 1178.047250
## 4 1082.896221
## 5 1077.288537
## 6 1033.729629
## 7 982.364538
## 8 951.900597
## 9 -947.573804
## 10 893.863791
## 11 878.264951
## 12 868.856255
## 13 -865.548474
## 14 829.719729
## 15 772.572573
## 16 745.165392
## 17 741.790036
## 18 724.363773
## 19 717.711591
## 20 -700.724922
## 21 607.726265
## 22 509.879251
## 23 431.644777
## 24 -391.150120
## 25 390.210430
## 26 375.663999
## 27 -364.457103
## 28 359.829632
## 29 273.126588
## 30 -236.971143
## 31 184.313114
## 32 171.421988
## 33 165.449747
## 34 -134.813114
## 35 132.429462
## 36 104.091478
## 37 -98.809047
## 38 88.881368
## 39 79.312606
## 40 61.780956
## 41 -35.936849
## 42 -10.735277
## 43 0.451115
Explanation of what this code does I created a table of real world population data from 2020, then compare it to what’s in my dataset for the same year. By calculating the difference in millions, we can see how far off the dataset values are from reality. The filter gets only 2020 data, select picks specific columns, and mutate creates new calculated columns.
The Documentation Gap: Even with perfect documentation, if the documentation doesn’t really address why UK appears to have a population of 980 million in 2022 (15 times its actual size, I checked), I would have a critical problem. This could indicate - - Synthetic/simulated data with unrealistic parameters - Encoding errors (perhaps mixing population with another metric) - Corrupted data during collection or transfer - Intentional obfuscation for privacy (but then why keep country names? Hmm)
Again as mentioned in Leon’s lecture about data with multiple subsets, when you see something like this… that’s probably means you’re dealing with multiple subsets of data, and you shouldn’t be mixing apples and oranges. Let’s explore this population issue visually.
pop_volatility <- climate_data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
mutate(
pop_change_pct = (Population - lag(Population)) / lag(Population) * 100
) %>%
filter(!is.na(pop_change_pct))
ggplot(pop_volatility, aes(x = Year, y = pop_change_pct, color = Country)) +
geom_line(alpha = 0.6) +
geom_hline(yintercept = c(-5, 5), linetype = "dashed", color = "red") +
labs(
title = "Year-over-Year Population Change: Biologically Impossible Volatility",
subtitle = "Red lines show ±5% threshold - no real country changes this fast",
x = "Year",
y = "Population Change (%)",
color = "Country"
) +
theme_minimal() +
theme(legend.position = "bottom")
Explanation of what this does: I calculated the percentage change in population from one year to the next for each country. The lag() function gets the previous year’s value, let me compute growth rates. Then I plot these changes over time with lines for each country, adding red reference lines at ±5% to show what would be extreme but still possible changes.
What’s unclear and why it’s concerning: Real-world population changes rarely exceed 3% annually (even accounting for migration). This dataset shows countries experiencing 50-100%+ population swings year-to-year, which is biologically and socially impossible. This suggests the “Population” column might actually encode something else entirely - like probably GDP, energy consumption, or is completely synthetic. Not sure about this.
Risk assessment: If a researcher used this data to model climate impact on population density, resource allocation, or per-capita emissions without noticing this issue, their conclusions would be really really wrong. Policy recommendations based on such analysis could misallocate billions in climate adaptation funding.
ggplot(climate_data, aes(x = reorder(Country, Population, median),
y = Population / 1000000,
fill = Country)) +
geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.size = 2) +
coord_flip() +
scale_y_continuous(labels = comma) +
labs(
title = "Population Distribution by Country: Suspicious Overlaps",
subtitle = "Countries with vastly different actual populations show similar ranges",
x = "Country",
y = "Population (Millions)",
caption = "Red dots indicate statistical outliers within each country"
) +
theme_minimal() +
theme(legend.position = "none")
Explanation of what thisdoes: So I made a boxplot for each country showing how their population values are distributed across all years in the dataset. The reorder() function sorts countries by their median population for easier comparison. Converting to millions makes the y-axis more readable, and coord_flip() turns it sideways so country names don’t overlap.
What’s unclear and why it’s a concern: Notice how countries like Australia (actual population: 26 million) and China (actual population: 1.4 billion) show overlapping population ranges in this dataset. This is impossible in reality - they differ by a factor of 50+. The fact that small and large countries have similar statistical distributions suggests this data is either: 1. Randomly generated without real-world constraints 2. Measuring something other than human population 3. Contains a systematic encoding/scaling error
I’m not sure which one it is more likely to be but either are possible.
Risk assessment: Like mentioned in the lecture, if we calculated the covariance between population and CO2 emissions using this data, we’d be combining what should be distinct subsets. Any inference about population’s relationship with climate variables would be meaningless.
For this analysis, I’ll treat Country and create a categorical version of Extreme Weather Events.
sum(is.na(climate_data$Country))
## [1] 0
sum(climate_data$Country == "")
## [1] 0
expected_g20 <- c("Argentina", "Australia", "Brazil", "Canada", "China",
"France", "Germany", "India", "Indonesia", "Italy",
"Japan", "Mexico", "Russia", "Saudi Arabia", "South Africa",
"South Korea", "Turkey", "UK", "USA")
actual_countries <- unique(climate_data$Country)
missing_g20 <- setdiff(expected_g20, actual_countries)
cat("Explicitly missing (NA) countries:", sum(is.na(climate_data$Country)), "\n")
## Explicitly missing (NA) countries: 0
cat("Empty string countries:", sum(climate_data$Country == ""), "\n")
## Empty string countries: 0
cat("Implicitly missing G20 countries:", paste(missing_g20, collapse = ", "), "\n")
## Implicitly missing G20 countries: Italy, Saudi Arabia, South Korea, Turkey
year_country_combos <- climate_data %>%
group_by(Country, Year) %>%
summarise(n = n(), .groups = "drop")
years <- unique(climate_data$Year)
countries <- unique(climate_data$Country)
all_combos <- expand.grid(Year = years, Country = countries)
missing_combos <- all_combos %>%
anti_join(year_country_combos, by = c("Year", "Country"))
cat("Empty groups (year-country combinations with no data):", nrow(missing_combos), "\n")
## Empty groups (year-country combinations with no data): 20
Explanation of what this code does: Here I am checking three types of missing data - explicitly missing (NA values), implicitly missing (countries I would expect but aren’t in the dataset), and empty groups (combinations of year and country that should exist but don’t). The expand.grid() function creates all possible year-country pairs, then anti_join() finds which combinations exist in our complete list but not in our actual data.
Findings: - Explicitly missing rows: None found (no NA values in Country column) - Implicitly missing rows: Several major G20 economies are missing (Italy, Saudi Arabia, South Korea, Turkey), which creates selection bias if this is meant to represent global climate patterns - Empty groups: The dataset appears to have some year country combinations but not all, suggesting non random sampling
Why this matters: If I were trying to make global inferences about climate change but missing major economies and polluters, the conclusions will be systematically biased toward the countries included.
climate_data_cat <- climate_data %>%
mutate(
Weather_Severity = cut(Extreme.Weather.Events,
breaks = c(-1, 0, 4, 9, 14),
labels = c("None", "Low", "Moderate", "High"))
)
cat("Explicitly missing Weather_Severity:",
sum(is.na(climate_data_cat$Weather_Severity)), "\n")
## Explicitly missing Weather_Severity: 0
climate_data_cat %>%
group_by(Weather_Severity) %>%
summarise(count = n()) %>%
print()
## # A tibble: 4 × 2
## Weather_Severity count
## <fct> <int>
## 1 None 74
## 2 Low 237
## 3 Moderate 331
## 4 High 358
country_severity <- climate_data_cat %>%
group_by(Country, Weather_Severity) %>%
summarise(n = n(), .groups = "drop") %>%
spread(Weather_Severity, n, fill = 0)
cat("\nCountries missing certain severity levels:\n")
##
## Countries missing certain severity levels:
print(country_severity)
## # A tibble: 15 × 5
## Country None Low Moderate High
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Argentina 7 19 20 21
## 2 Australia 2 17 24 14
## 3 Brazil 6 14 28 19
## 4 Canada 5 17 21 24
## 5 China 8 12 22 25
## 6 France 1 16 15 34
## 7 Germany 7 14 18 22
## 8 India 6 11 30 23
## 9 Indonesia 4 20 22 29
## 10 Japan 2 17 17 27
## 11 Mexico 9 11 19 16
## 12 Russia 1 23 21 29
## 13 South Africa 5 16 32 20
## 14 UK 5 13 17 30
## 15 USA 6 17 25 25
Explanation of what this does: I will convert the numeric extreme weather events into categories (None, Low, Moderate, High) using the cut() function with specific breakpoints. Then check how many observations fall into each category and whether certain countries are missing certain severity levels. The spread() function pivots the data to show which combinations exist or don’t exist.
Findings: - Explicitly missing rows: A few due to how cut() handles boundaries - Implicitly missing rows: Some countries may never show “None” or “High” severity, which could indicate maybe either real climate patterns OR sampling bias - Empty groups: If certain countries never experience extreme weather in this dataset but do in reality, it indicates incomplete data collection
For temperature data, I’ll use kind of an unconventional approach: context-aware outlier detection based on geographic reality rather than pure statistics.
temp_stats <- climate_data %>%
summarise(
Q1 = quantile(Avg.Temperature...C., 0.25),
Q3 = quantile(Avg.Temperature...C., 0.75),
IQR = Q3 - Q1,
Lower_Fence = Q1 - 1.5 * IQR,
Upper_Fence = Q3 + 1.5 * IQR
)
print(temp_stats)
## Q1 Q3 IQR Lower_Fence Upper_Fence
## 1 12.175 27.225 15.05 -10.4 49.8
statistical_outliers <- climate_data %>%
filter(Avg.Temperature...C. < temp_stats$Lower_Fence |
Avg.Temperature...C. > temp_stats$Upper_Fence) %>%
select(Year, Country, Avg.Temperature...C.) %>%
arrange(Avg.Temperature...C.)
cat("Statistical outliers (IQR method):", nrow(statistical_outliers), "\n")
## Statistical outliers (IQR method): 0
Explanation of what this code does: I first calculate the first quartile (Q1), third quartile (Q3), and interquartile range (IQR) for temperature. The standard outlier definition is any value below Q1 - 1.5×IQR or above Q3 + 1.5×IQR. We then filter our data to find which observations fall outside these “fences” and count how many statistical outliers exist.
My outlier definition: Instead of relying solely on statistical measures, I define outliers as temperatures that are geographically implausible for a given country. For example: - UK showing average temperature of 33.5°C (realistic for a heat wave day, but not annual average) - Russia showing 5.3°C (their actual average is around -5°C)
geographic_norms <- data.frame(
Country = c("UK", "USA", "China", "India", "Brazil", "Germany",
"France", "Canada", "Australia", "Mexico", "Russia",
"South Africa", "Argentina", "Indonesia", "Japan"),
Expected_Temp = c(9, 12, 14, 25, 25, 9,
12, -5, 22, 21, -5,
17, 14, 26, 11),
Tolerance = c(3, 4, 4, 3, 3, 3,
3, 4, 3, 3, 4,
3, 3, 2, 3)
)
geographic_outliers <- climate_data %>%
left_join(geographic_norms, by = "Country") %>%
mutate(
Temp_Deviation = abs(Avg.Temperature...C. - Expected_Temp),
Is_Geographic_Outlier = Temp_Deviation > Tolerance
) %>%
filter(Is_Geographic_Outlier == TRUE) %>%
select(Year, Country, Avg.Temperature...C., Expected_Temp, Temp_Deviation)
cat("Geographic outliers:", nrow(geographic_outliers), "\n")
## Geographic outliers: 811
cat("Percentage of data that are geographic outliers:",
round(nrow(geographic_outliers) / nrow(climate_data) * 100, 1), "%\n")
## Percentage of data that are geographic outliers: 81.1 %
Explanation of what this does: I will basically create a reference table with expected average temperatures for each country based on real climate data. Then I join this with the dataset and calculate how far each observation deviates from what’s expected. If the deviation exceeds the tolerance threshold (which varies by country), I can flag it as a geographic outlier.
climate_data_outliers <- climate_data %>%
left_join(geographic_norms, by = "Country") %>%
mutate(
Is_Statistical_Outlier = Avg.Temperature...C. < temp_stats$Lower_Fence |
Avg.Temperature...C. > temp_stats$Upper_Fence,
Temp_Deviation = abs(Avg.Temperature...C. - Expected_Temp),
Is_Geographic_Outlier = Temp_Deviation > Tolerance,
Outlier_Type = case_when(
Is_Statistical_Outlier & Is_Geographic_Outlier ~ "Both",
Is_Statistical_Outlier ~ "Statistical Only",
Is_Geographic_Outlier ~ "Geographic Only",
TRUE ~ "Not Outlier"
)
)
ggplot(climate_data_outliers,
aes(x = Year, y = Avg.Temperature...C., color = Outlier_Type)) +
geom_point(alpha = 0.6, size = 2) +
facet_wrap(~ Country, scales = "free_y") +
scale_color_manual(values = c("Both" = "red",
"Statistical Only" = "orange",
"Geographic Only" = "purple",
"Not Outlier" = "gray70")) +
labs(
title = "Temperature Outliers: Statistical vs. Geographic Context",
subtitle = "Red points violate both statistical and geographic norms",
x = "Year",
y = "Average Temperature (°C)",
color = "Outlier Type"
) +
theme_minimal() +
theme(legend.position = "bottom")
Explanation of what this does: I will combine the statistical and geographic outlier detection into one dataset, creating a new variable that categorizes each point. The facet_wrap() creates separate small plots for each country. I will use different colors to show which points are outliers by different criteria - red for both types, orange for statistical only, purple for geographic only, and gray for normal values.
Why this matters
Looking at the visualization, you can see that many data points are geographically impossible, suggesting this dataset may be synthetic or contains serious measurement errors. Nearly 81.1% of temperature readings are geographic outliers.
Significance: If I were to use this data to model temperature trends or climate change impacts, I would need to either: 1. Filter out geographic outliers entirely 2. Investigate whether “Average Temperature” means something different than annual mean 3. Treat this as synthetic data for methods testing only, not real-world inference
The risk of not catching these outliers is publishing false conclusions about climate trends or making incorrect predictions about future warming patterns.
So this analysis reveals many documentation and data quality issues:
These issues compound when we try to make inferences. The covariance between population and emissions would be meaningless. Confidence intervals calculated from this data might be mathematically correct but practically useless.
Some recommendations: - Cross-validate against external sources before trusting any variable - Use domain knowledge (geography, biology) alongside statistical methods for outlier detection - Be transparent about data limitations in any analysis or reporting
Without addressing these issues, no amount of sophisticated statistical analysis can produce valid insights in this dataset.