Week 6: gapminder dataset

library(tidyverse)
library(gapminder)
library(ggthemes)
gapminder_raw <- gapminder_unfiltered
gapminder_unfiltered|>
  rename(life_exp = "lifeExp",
         population = "pop",
         gdp_per_cap = "gdpPercap") |>
  mutate(years_since = year(now()) - year)
## # A tibble: 3,313 × 7
##    country     continent  year life_exp population gdp_per_cap years_since
##    <fct>       <fct>     <int>    <dbl>      <int>       <dbl>       <dbl>
##  1 Afghanistan Asia       1952     28.8    8425333        779.          73
##  2 Afghanistan Asia       1957     30.3    9240934        821.          68
##  3 Afghanistan Asia       1962     32.0   10267083        853.          63
##  4 Afghanistan Asia       1967     34.0   11537966        836.          58
##  5 Afghanistan Asia       1972     36.1   13079460        740.          53
##  6 Afghanistan Asia       1977     38.4   14880372        786.          48
##  7 Afghanistan Asia       1982     39.9   12881816        978.          43
##  8 Afghanistan Asia       1987     40.8   13867957        852.          38
##  9 Afghanistan Asia       1992     41.7   16317921        649.          33
## 10 Afghanistan Asia       1997     41.8   22227415        635.          28
## # ℹ 3,303 more rows
gapminder<-gapminder_unfiltered

Goal 1: Business Scenario

Customer or Audience

The primary audience for this analysis is international health and economic bodies, such as those at the World Health Organization (WHO), United Nations (UN), and the World Bank. These stakeholders are interested in understanding the drivers of health and economic outcomes to inform global development policies.

Problem Statement

To inform development policy, global organizations want to understand which economic and demographic indicators most significantly influence life expectancy across countries and over time.

Scope

The analysis leverages the gapminder dataset, which includes: lifeExp (life expectancy), gdpPercap (GDP per capita), pop (population), continent, country, and year (geographic and temporal context). We would use exploratory data analysis techniques, regression modelling and time series analysis among many other to understand the dependency of variables and overall data that is presented to us.

There are a few assumptions that we need consider for the analysis:

  • Country-level data is representative of regional/global trends.

  • The dataset provides a valid sample for inferential conclusions.

Objective

Identify and quantify the key predictors that influence life expectancy and examine how these relationships vary across regions and time.

Goal 2: Model Critique

For the mentioned business proposal, the following analysis would give us further enhanced inferences on top of the exploration that was performed in the lab notebook

1. Multivariate Regression Models

Critique: The lab primarily focuses on simple comparisons and correlation metrics, which limit the ability to account for confounding variables.

Improvement: Fit a multivariate linear regression model with predictors like gdpPercap, pop, continent, and year.

  1. Life expectancy is influenced by multiple factors simultaneously. - Multivariate regression helps isolate the effect of each variable while controlling for others. - Adding categorical variables (like continent) allows exploration of regional disparities.
  2. The direction and magnitude of influence for each variable (e.g., does GDP per capita have a stronger effect than population size?), statistical significance of predictors (using p-values and confidence intervals) and the overall model fit (e.g., R-squared) to evaluate how well predictors explain life expectancy will give us valuable conclusions to take away.

Example:

model<-lm(lifeExp ~ gdpPercap + pop + continent + year, data = gapminder)
summary(model)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop + continent + year, data = gapminder)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.0477  -2.8077   0.2937   3.3614  21.1087 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.527e+02  1.337e+01 -26.374   <2e-16 ***
## gdpPercap          2.750e-04  1.126e-05  24.414   <2e-16 ***
## pop                1.573e-09  1.053e-09   1.494    0.135    
## continentAmericas  1.550e+01  3.813e-01  40.643   <2e-16 ***
## continentAsia      1.094e+01  3.701e-01  29.544   <2e-16 ***
## continentEurope    1.971e+01  3.342e-01  58.997   <2e-16 ***
## continentFSU       1.572e+01  5.725e-01  27.462   <2e-16 ***
## continentOceania   1.699e+01  5.188e-01  32.756   <2e-16 ***
## year               2.027e-01  6.760e-03  29.978   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.042 on 3304 degrees of freedom
## Multiple R-squared:  0.7372, Adjusted R-squared:  0.7366 
## F-statistic:  1159 on 8 and 3304 DF,  p-value: < 2.2e-16

2. Bootstrapped Confidence Intervals for Regression Coefficients

Critique: While confidence intervals are mentioned, the lab doesn’t utilize bootstrapping to empirically validate parameter uncertainty.

Improvement: Apply bootstrapping to regression coefficients by repeatedly resampling the data.

  1. Bootstrapping does not rely on strict normality assumptions. It’s especially useful when the sample size is small or the distribution of residuals is skewed. It also provides a more robust sense of variability in the estimates.
  2. This excerise provides us wiith confidence intervals based on empirical distributions rather than theoretical ones. We are also able to get more reliable inferences about the stability and reliability of regression coefficients.

Example:

library(boot)
boot_fn <- function(data, index) {
  coef(lm(lifeExp ~ gdpPercap, data = data, subset = index))
}
boot_results<-boot(gapminder, boot_fn, R = 1000)
boot_results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = gapminder, statistic = boot_fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 5.781577e+01 -7.764607e-03 4.153905e-01
## t2* 6.562489e-04  1.375730e-06 3.668699e-05
plot(boot_results)

3. Time-Series and Panel Data Exploration

Critique: The original lab does not take advantage of the temporal nature of the Gapminder dataset.

Improvement: Use a grouped analysis to explore how life expectancy has changed over time within countries.

  1. Life expectancy trends over time can vary by country, and these trends may reflect unique historical, social, or political influences. Time-based analysis also helps us distinguish between global and local effects. Additionally, this approach reveals patterns that static models (e.g., for one year only) might miss.
  2. Rate of change in life expectancy by country (lifeExp_slope, estimated annual rate of change in life expectancy for each country or how much life expectancy increased or decreased per year on average). This helps us identify outlier countries where life expectancy improved or worsened rapidly. Thereby, helping us support evidence-based policy recommendations tailored by region and time frame.

Example:

#gapminder |>
 #group_by(country) |>
  #summarise(lifeExp_slope = coef(lm(lifeExp ~ year))[2])


country_slopes <- gapminder |>
  group_by(country) |>
  summarise(lifeExp_slope = coef(lm(lifeExp ~ year))[2])

top_bottom_countries <- country_slopes %>%
  arrange(desc(lifeExp_slope)) %>%
  slice(c(1:10, (n() - 9):n()))

# Create a flag for visualization
top_bottom_countries <- top_bottom_countries %>%
  mutate(
    type = ifelse(row_number() <= 10, "Top 10", "Bottom 10")
  )

# Plot
ggplot(top_bottom_countries, aes(x = reorder(country, lifeExp_slope), y = lifeExp_slope, fill = type)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top and Bottom 10 Countries by Life Expectancy Slope",
    x = "Country",
    y = "Life Expectancy Slope (Years per Year)"
  ) +
  scale_fill_manual(values = c("Top 10" = "steelblue", "Bottom 10" = "firebrick")) +
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) 

Goal 3: Ethical and Epistemological Concerns

  • Bias and Representation: The gapminder dataset, while comprehensive, may under represent certain countries, especially those with limited data infrastructure. This underrepresentation can lead to biased analyses and conclusions.

    Example: During the COVID-19 pandemic, discrepancies in data reporting were observedbetween authoritarian and democratic regimes. A study highlighted that authoritarian governments were more likely to manipulate COVID-19 data, leading to underreporting of cases and deaths. This manipulation can skew global health analyses and policy decisions. ​Analyses based on such data may overlook the true impact of health crises in underrepresented regions, leading to misinformed policy decisions and resource allocations.

  • Misinterpretation and Misuse of Data: Misinterpretation or misuse of data can lead to flawed policy decisions, eroding public trust and potentially causing harm.

    Example: In 2020, studies published in prestigious journals like The Lancet and The New England Journal of Medicine, based on data from the company Surgisphere, claimed that hydroxychloroquine increased mortality in COVID-19 patients. These studies influenced global health policies, including the World Health Organization halting clinical trials. However, the data’s validity was later questioned, leading to retractions of the studies. This incident underscores the critical need for data transparency and rigorous peer review to prevent the dissemination of misleading information that can have global health implications.

  • Uncertainty and Overgeneralization: Statistical models often rely on assumptions that may not hold true across different contexts, leading to overgeneralizations.

    Example: The Institute for Health Metrics and Evaluation (IHME) released COVID-19 mortality projections that significantly influenced U.S. policy decisions. However, these models faced criticism for being overly optimistic and not accounting for various uncertainties, leading to potential underestimation of the pandemic’s impact. ​Over reliance on models without considering their limitations can result in policies that are ill-prepared for worst-case scenarios, emphasizing the need for cautious interpretation of model predictions.

  • Stakeholder Impact: Data-driven decisions can have profound effects on populations, especially when data is misrepresented or misinterpreted.

    Example: In 2013, the World Health Organization erroneously reported that half of new HIV cases in Greece were self-inflicted to obtain financial benefits. This claim was later retracted, but not before causing significant stigma and misinformation. People living with HIV were stigmatized further. Public health efforts were undermined by shiftingattention from structural issues to individualblame. Such instances highlight the ethical responsibility of organizations to ensure data accuracy and the potential harm caused by misinformation.