library(tidyverse)
library(gapminder)
library(ggthemes)
gapminder_raw <- gapminder_unfiltered
gapminder_unfiltered|>
rename(life_exp = "lifeExp",
population = "pop",
gdp_per_cap = "gdpPercap") |>
mutate(years_since = year(now()) - year)
## # A tibble: 3,313 × 7
## country continent year life_exp population gdp_per_cap years_since
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 73
## 2 Afghanistan Asia 1957 30.3 9240934 821. 68
## 3 Afghanistan Asia 1962 32.0 10267083 853. 63
## 4 Afghanistan Asia 1967 34.0 11537966 836. 58
## 5 Afghanistan Asia 1972 36.1 13079460 740. 53
## 6 Afghanistan Asia 1977 38.4 14880372 786. 48
## 7 Afghanistan Asia 1982 39.9 12881816 978. 43
## 8 Afghanistan Asia 1987 40.8 13867957 852. 38
## 9 Afghanistan Asia 1992 41.7 16317921 649. 33
## 10 Afghanistan Asia 1997 41.8 22227415 635. 28
## # ℹ 3,303 more rows
gapminder<-gapminder_unfiltered
The primary audience for this analysis is international health and economic bodies, such as those at the World Health Organization (WHO), United Nations (UN), and the World Bank. These stakeholders are interested in understanding the drivers of health and economic outcomes to inform global development policies.
To inform development policy, global organizations want to understand which economic and demographic indicators most significantly influence life expectancy across countries and over time.
The analysis leverages the gapminder dataset, which includes: lifeExp (life expectancy), gdpPercap (GDP per capita), pop (population), continent, country, and year (geographic and temporal context). We would use exploratory data analysis techniques, regression modelling and time series analysis among many other to understand the dependency of variables and overall data that is presented to us.
There are a few assumptions that we need consider for the analysis:
Country-level data is representative of regional/global trends.
The dataset provides a valid sample for inferential conclusions.
Identify and quantify the key predictors that influence life expectancy and examine how these relationships vary across regions and time.
For the mentioned business proposal, the following analysis would give us further enhanced inferences on top of the exploration that was performed in the lab notebook
Critique: The lab primarily focuses on simple comparisons and correlation metrics, which limit the ability to account for confounding variables.
Improvement: Fit a multivariate linear regression model with predictors like gdpPercap, pop, continent, and year.
Example:
model<-lm(lifeExp ~ gdpPercap + pop + continent + year, data = gapminder)
summary(model)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop + continent + year, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.0477 -2.8077 0.2937 3.3614 21.1087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.527e+02 1.337e+01 -26.374 <2e-16 ***
## gdpPercap 2.750e-04 1.126e-05 24.414 <2e-16 ***
## pop 1.573e-09 1.053e-09 1.494 0.135
## continentAmericas 1.550e+01 3.813e-01 40.643 <2e-16 ***
## continentAsia 1.094e+01 3.701e-01 29.544 <2e-16 ***
## continentEurope 1.971e+01 3.342e-01 58.997 <2e-16 ***
## continentFSU 1.572e+01 5.725e-01 27.462 <2e-16 ***
## continentOceania 1.699e+01 5.188e-01 32.756 <2e-16 ***
## year 2.027e-01 6.760e-03 29.978 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.042 on 3304 degrees of freedom
## Multiple R-squared: 0.7372, Adjusted R-squared: 0.7366
## F-statistic: 1159 on 8 and 3304 DF, p-value: < 2.2e-16
Critique: While confidence intervals are mentioned, the lab doesn’t utilize bootstrapping to empirically validate parameter uncertainty.
Improvement: Apply bootstrapping to regression coefficients by repeatedly resampling the data.
Example:
library(boot)
boot_fn <- function(data, index) {
coef(lm(lifeExp ~ gdpPercap, data = data, subset = index))
}
boot_results<-boot(gapminder, boot_fn, R = 1000)
boot_results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = gapminder, statistic = boot_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 5.781577e+01 -7.764607e-03 4.153905e-01
## t2* 6.562489e-04 1.375730e-06 3.668699e-05
plot(boot_results)
Critique: The original lab does not take advantage of the temporal nature of the Gapminder dataset.
Improvement: Use a grouped analysis to explore how life expectancy has changed over time within countries.
Example:
#gapminder |>
#group_by(country) |>
#summarise(lifeExp_slope = coef(lm(lifeExp ~ year))[2])
country_slopes <- gapminder |>
group_by(country) |>
summarise(lifeExp_slope = coef(lm(lifeExp ~ year))[2])
top_bottom_countries <- country_slopes %>%
arrange(desc(lifeExp_slope)) %>%
slice(c(1:10, (n() - 9):n()))
# Create a flag for visualization
top_bottom_countries <- top_bottom_countries %>%
mutate(
type = ifelse(row_number() <= 10, "Top 10", "Bottom 10")
)
# Plot
ggplot(top_bottom_countries, aes(x = reorder(country, lifeExp_slope), y = lifeExp_slope, fill = type)) +
geom_col() +
coord_flip() +
labs(
title = "Top and Bottom 10 Countries by Life Expectancy Slope",
x = "Country",
y = "Life Expectancy Slope (Years per Year)"
) +
scale_fill_manual(values = c("Top 10" = "steelblue", "Bottom 10" = "firebrick")) +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
Bias and Representation: The gapminder dataset, while comprehensive, may under represent certain countries, especially those with limited data infrastructure. This underrepresentation can lead to biased analyses and conclusions.
Example: During the COVID-19 pandemic, discrepancies in data reporting were observedbetween authoritarian and democratic regimes. A study highlighted that authoritarian governments were more likely to manipulate COVID-19 data, leading to underreporting of cases and deaths. This manipulation can skew global health analyses and policy decisions. Analyses based on such data may overlook the true impact of health crises in underrepresented regions, leading to misinformed policy decisions and resource allocations.
Misinterpretation and Misuse of Data: Misinterpretation or misuse of data can lead to flawed policy decisions, eroding public trust and potentially causing harm.
Example: In 2020, studies published in prestigious journals like The Lancet and The New England Journal of Medicine, based on data from the company Surgisphere, claimed that hydroxychloroquine increased mortality in COVID-19 patients. These studies influenced global health policies, including the World Health Organization halting clinical trials. However, the data’s validity was later questioned, leading to retractions of the studies. This incident underscores the critical need for data transparency and rigorous peer review to prevent the dissemination of misleading information that can have global health implications.
Uncertainty and Overgeneralization: Statistical models often rely on assumptions that may not hold true across different contexts, leading to overgeneralizations.
Example: The Institute for Health Metrics and Evaluation (IHME) released COVID-19 mortality projections that significantly influenced U.S. policy decisions. However, these models faced criticism for being overly optimistic and not accounting for various uncertainties, leading to potential underestimation of the pandemic’s impact. Over reliance on models without considering their limitations can result in policies that are ill-prepared for worst-case scenarios, emphasizing the need for cautious interpretation of model predictions.
Stakeholder Impact: Data-driven decisions can have profound effects on populations, especially when data is misrepresented or misinterpreted.
Example: In 2013, the World Health Organization erroneously reported that half of new HIV cases in Greece were self-inflicted to obtain financial benefits. This claim was later retracted, but not before causing significant stigma and misinformation. People living with HIV were stigmatized further. Public health efforts were undermined by shiftingattention from structural issues to individualblame. Such instances highlight the ethical responsibility of organizations to ensure data accuracy and the potential harm caused by misinformation.