I chose week 6 lab notes for this week’s Model Critique assignment.
For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
Audience: Public health agencies, governmental policymakers, or international organizations like the WHO (World Health Organization) that focus on improving health outcomes based on demographic trends.
Problem statement: The health department needs to know which demographic factors most significantly influence life expectancy in order to design targeted interventions for improving public health.
Relevant Variables:
Life expectancy (life_exp) – dependent variable.
Population size (population), GDP per capita (gdp_per_cap)– independent variables.
Analysis Methods:
Descriptive statistics (for initial exploration).
Correlation analysis: To understand how strongly life expectancy correlates with other demographic factors like population size and GDP.
Regression analysis: To assess how population and GDP (and other variables) affect life expectancy.
Bootstrapping: For confidence interval estimation, especially if data is non-normally distributed.
Visualization: Scatter plots and regression plots for visual insights.
Assumptions:
Data is representative of the regions/countries of interest.
Assumes no major outliers or missing data for key variables.
The goal is to identify the factors that most strongly influence life expectancy across different countries.
We will consider the effect of population size and GDP per capita (and possibly other factors) on life expectancy.
The analysis will help determine what policy interventions (e.g., increasing healthcare access or education) might improve life expectancy.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(ggthemes)
# let's start by saving the original dataset as "raw"
gapminder_raw <- gapminder_unfiltered
# we choose to overwrite the "filtered" gm data
gapminder <- gapminder_unfiltered |>
# rename columns
rename(life_exp = "lifeExp",
population = "pop",
gdp_per_cap = "gdpPercap") |>
# get deviation column
mutate(years_since = year(now()) - year)
gapminder <-
gapminder |>
group_by(year) |>
mutate(pop_rank = rank(population),
pop_rank_desc = rank(desc(population))) |>
ungroup()
gapminder <-
gapminder |>
group_by(continent, year) |>
mutate(pop_median = median(population),
pop_half = ifelse(population >= median(population),
"upper 50%",
"lower 50%")) |>
ungroup()
# plotting just for a randomly selected year
plot_year <- sample(gapminder$year, 1)
gapminder |>
filter(year == plot_year) |>
ggplot() +
geom_point(mapping = aes(x = gdp_per_cap,
y = continent,
colour = pop_half)) +
scale_x_log10(labels=\(x) paste("$", x/1e3, "K", sep = "")) +
annotation_logticks(sides = 'b') +
scale_colour_brewer(palette = "Dark2") +
theme_hc() +
labs(title = paste("GDP Per Capita Across Countries in",
plot_year),
subtitle = "Colored by population (above/below the median)",
x = 'GDP Per Capita',
y = 'Continent',
colour = 'Country Population')
# breaks need to cover lowest and highest values
oom_breaks <- 10 ^ c(4, 5, 6, 7, 8, 9, 10)
# there should be one LESS label (these are "in between breaks")
oom_labels <- c("10-100K", "100K-1M", "1-10M", "10-100M", "100M-1B", ">1B")
gapminder <-
gapminder |>
mutate(pop_oom = cut(population, breaks = oom_breaks,
labels = oom_labels, right = TRUE))
gapminder <-
gapminder |>
group_by(country) |>
mutate(gdp_diff = gdp_per_cap - lag(gdp_per_cap, n = 1,
order_by = year),
year_diff = year - lag(year, n = 1,
order_by = year)) |>
ungroup()
gapminder <-
gapminder |>
mutate(is_us = country == "United States")
# here, we apply the `length` function to each split string
ct <- map_int(str_split(gapminder$country, " "), length)
gapminder <- gapminder |>
mutate(country_name_ct = str_count(country, " ") + 1)
# saving this vector for a bootstrapping example
life_exp_2000s <-
gapminder |>
filter(year >= 2000) |>
pluck("life_exp")
ggplot() +
geom_histogram(mapping = aes(x = life_exp_2000s),
colour='white') +
theme_hc()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
bootstrap <- function (x, func=mean, n_iter=10^4) {
# empty vector to be filled with values from each iteration
func_values <- c(NULL)
# we simulate sampling `n_iter` times
for (i in 1:n_iter) {
# pull the sample
x_sample <- sample(x, size = length(x), replace = TRUE)
# add on this iteration's value to the collection
func_values <- c(func_values, func(x_sample))
}
return(func_values)
}
life_exp_means <- bootstrap(life_exp_2000s)
ggplot() +
geom_histogram(mapping = aes(x = life_exp_means),
color='white') +
labs(title = "10K Bootstrapped Sample Means of `life_exp_2000s`",
subtitle = "A *simulation* of the true sampling distribution",
x = "Bootstrapped Sample Mean",
y = "Number of Samples") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
life_exp_se_boot <- sd(life_exp_means)
life_exp_se_boot
## [1] 0.4426836
life_exp_se_est <- sd(life_exp_2000s) / sqrt(length(life_exp_2000s))
life_exp_se_est
## [1] 0.4439835
avg_life_exp <- mean(life_exp_2000s)
avg_life_exp
## [1] 70.37521
# *rounded* mean and std dev from our sample of data
f_life_exp <- \(x) dnorm(x, mean = 70, sd = 0.4)
ggplot() +
geom_function(xlim = c(68.5, 71.5),
fun = f_life_exp) +
geom_segment(mapping = aes(x = avg_life_exp - 0.4,
y = f_life_exp(avg_life_exp),
xend = avg_life_exp + 0.4,
yend = f_life_exp(avg_life_exp),
linetype = "proposed interval"),
color = "gray") +
geom_point(mapping = aes(x = avg_life_exp,
y = f_life_exp(avg_life_exp),
color = "our sample"), size = 2) +
labs(title = "*Possible* Sampling Distribution for `life_exp_2000s`",
subtitle ='In this case, the "true" mean is 70 years',
x = "Sample Mean",
y = "Probability Density",
color = "",
linetype = "") +
theme_minimal()
# we could pick anything here ...
mu_ <- 2
se_ <- 4
z_ <- 1 # number of stdevs from the mean
# the percentage of values within z_ stdevs from the mean
pnorm(mu_ + z_ * se_, mu_, se_) - pnorm(mu_ - z_ * se_, mu_, se_)
## [1] 0.6826895
P <- 0.95 # % confidence
# z score (we need (1 - P)/2 of data to the right/upper tail)
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)
ggplot() +
geom_function(xlim = c(-4, 4), fun = function(x) dnorm(x)) +
geom_vline(mapping = aes(xintercept = c(-z_score, z_score)),
color = "orange") +
geom_text(mapping = aes(x = -z_score - 0.1, y = .25),
label = "z-score (2.5%)", color = "orange", hjust = "right") +
geom_text(mapping = aes(x = z_score + 0.1, y = .25),
label = "z-score (97.5%)", color = "orange", hjust = "left") +
scale_x_continuous(breaks = -4:4) +
labs(title = "Standard Normal Curve",
x = "Standard Deviations from Mean 0",
y = "Probability Density") +
theme_minimal() +
theme(legend.position="none")
P <- 0.95 # % confidence
# z score (we need (1 - P)/2 of data to the right/upper tail)
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)
ggplot() +
geom_function(xlim = c(-4, 4), fun = function(x) dnorm(x)) +
geom_vline(mapping = aes(xintercept = c(-z_score, z_score)),
color = "orange") +
geom_text(mapping = aes(x = -z_score - 0.1, y = .25),
label = "z-score (2.5%)", color = "orange", hjust = "right") +
geom_text(mapping = aes(x = z_score + 0.1, y = .25),
label = "z-score (97.5%)", color = "orange", hjust = "left") +
scale_x_continuous(breaks = -4:4) +
labs(title = "Standard Normal Curve",
x = "Standard Deviations from Mean 0",
y = "Probability Density") +
theme_minimal() +
theme(legend.position="none")
P <- .95 # % confidence
n <- length(life_exp_2000s)
# t-statistic with n-1 degrees of freedom
t_star <- qt(p = (1 - P)/2, df=n - 1, lower.tail=FALSE)
# t half-width
CI_t <- t_star * life_exp_se_boot
# z-score
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)
# z half-width
CI_z <- z_score * life_exp_se_boot
c(avg_life_exp - CI_t, avg_life_exp + CI_t) # t-distribution
## [1] 69.50565 71.24478
c(avg_life_exp - CI_z, avg_life_exp + CI_z) # normal distribution
## [1] 69.50757 71.24286
# for the sake of example, get 100 means and their standard error
boot_n <- 100
boot_means <- bootstrap(life_exp_2000s, n_iter = boot_n)
boot_se <- sd(boot_means)
true_mean <- mean(life_exp_2000s)
t_star <- qt(p = (1 - P)/2, df=boot_n - 1, lower.tail=FALSE)
lowers <- boot_means - t_star * boot_se
uppers <- boot_means + t_star * boot_se
mask <- (lowers < true_mean) & (true_mean < uppers)
ggplot() +
geom_segment(mapping = aes(x = lowers, xend = uppers,
y = 1:length(boot_means),
yend = 1:length(boot_means),
color = mask)) +
geom_vline(xintercept = true_mean, color = "orange") +
scale_color_manual(values = c("darkgray", "red"),
breaks = c(TRUE, FALSE)) +
theme_minimal() +
theme(legend.position="none") +
labs(title = "100 Confidence Intervals for `life_exp`",
x = "CI boundaries",
y = "Iteration")
library(boot)
# we use a "_" so we don't overwrite the original function
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 1000) {
# the `boot` library needs the function in this format
boot_func <- \(x, i) func(x[i], na.rm=TRUE)
b <- boot(v, boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
boot_ci(life_exp_2000s, mean, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (69.52, 71.26 )
## Calculations and Intervals on Original Scale
boot_ci(life_exp_2000s, median, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (72.40, 74.19 )
## Calculations and Intervals on Original Scale
w <- rep(1, 10)
x <- 1:10
y <- 10:1
z <- rep(c(1, 2), 5)
# explore different combinations of these
cov(w, x)
## [1] 0
plot_data <-
gapminder |>
filter(country == "Portugal")
plot_data |>
ggplot() +
geom_point(mapping = aes(x = population, y = life_exp)) +
labs(x = "Population", y = "Life Expectancy",
title = "Life Expectancy as Population Increases",
subtitle = paste("Covariance:",
round(cov(plot_data$population,
plot_data$life_exp), 2))) +
theme_minimal()
round(cov(plot_data$life_exp, plot_data$pop_rank), 2)
## [1] 18.56
round(cor(plot_data$population, plot_data$life_exp), 2)
## [1] 0.97
round(cor(plot_data$years_since, plot_data$pop_rank), 2)
## [1] -0.08
s_rho <- cor(x = as.numeric(gapminder$pop_oom),
y = gapminder$country_name_ct,
method = "spearman")
s_rho
## [1] -0.06403513
gapminder |>
group_by(pop_oom, country_name_ct) |>
count() |>
ggplot() +
geom_point(mapping = aes(x = pop_oom, y = country_name_ct, size=n)) +
labs(x = "Population Order of Magnitude",
y = "Country Name Length (in words)",
title = "Length of Country Name vs. Population",
subtitle = paste("Spearman's Rho = ", round(s_rho, 2)),
size = "Number of Rows") +
theme_minimal()
plot_data <-
gapminder |>
filter(country == "Portugal")
plot_data |>
ggplot() +
geom_line(mapping = aes(x = year, y = population), color='lightblue') +
geom_point(mapping = aes(x = year, y = population), color='darkblue') +
scale_y_continuous(labels = \(x) paste(x / 10^6, "M")) +
labs(x = "Year", y = "Population (in millions)",
title = "Population in Portugal, Over Time",
subtitle = paste("Correlation:",
round(cor(plot_data$population,
plot_data$life_exp), 2),
" Observations:",
count(plot_data))) +
theme_hc()
Issue: The lab uses regression models to predict life expectancy based on population and GDP. However, these variables might be highly correlated (multicollinearity), which can lead to inaccurate estimates of model parameters.
Solution: Use the Variance Inflation Factor (VIF) to detect multicollinearity between predictors (population and GDP). If multicollinearity exists, consider removing or combining correlated predictors.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
# Check for VIF values
vif(lm(life_exp ~ population + gdp_per_cap, data = plot_data))
## population gdp_per_cap
## 12.96098 12.96098
Issue: The assumption of a linear relationship between life expectancy and predictors might be inappropriate. Life expectancy might not increase linearly with GDP or population size.
Solution: Test for non-linear relationships by transforming variables or fitting polynomial regression models.
# Example of fitting a polynomial model
lm_poly <- lm(life_exp ~ poly(gdp_per_cap, 2) + poly(population, 2), data = plot_data)
summary(lm_poly)
##
## Call:
## lm(formula = life_exp ~ poly(gdp_per_cap, 2) + poly(population,
## 2), data = plot_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.40615 -0.40999 -0.03885 0.39137 1.66683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.08410 0.08303 844.088 < 2e-16 ***
## poly(gdp_per_cap, 2)1 26.30986 2.56959 10.239 3.66e-14 ***
## poly(gdp_per_cap, 2)2 -6.11034 1.37528 -4.443 4.55e-05 ***
## poly(population, 2)1 17.42440 2.52709 6.895 6.65e-09 ***
## poly(population, 2)2 -0.96729 1.45190 -0.666 0.508
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6323 on 53 degrees of freedom
## Multiple R-squared: 0.9891, Adjusted R-squared: 0.9883
## F-statistic: 1208 on 4 and 53 DF, p-value: < 2.2e-16
Issue: In the lab, confidence intervals for parameters are not calculated using bootstrapping.
Solution: Implement bootstrapping to estimate the confidence intervals for model parameters, especially when the assumptions for traditional methods (like t-distribution) may not hold.
# Bootstrapping to estimate confidence intervals for model coefficients
set.seed(123)
boot_fn <- function(data, indices) {
model <- lm(life_exp ~ population + gdp_per_cap, data = data[indices, ])
return(coef(model))
}
boot_results <- boot(data = plot_data, statistic = boot_fn, R = 1000)
boot.ci(boot_results, type = "bca")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_results, type = "bca")
##
## Intervals :
## Level BCa
## 95% (-2.19, 38.74 )
## Calculations and Intervals on Original Scale
Bootstrapping provides a robust way to estimate confidence intervals and can be used when the data does not follow a normal distribution.
The bootstrap confidence interval for our parameters suggests that there is uncertainty in the estimate, and the true value of the parameter could include no effect at all (zero). Further investigations are required but not extending for this assignment purpose.
Potential Bias in Data: If the dataset includes countries with significantly different healthcare systems or economic statuses, the relationship between life expectancy and demographic factors might not be generalizable across all regions.
Solution: We can consider controlling for factors such as healthcare spending, education, or urbanization to account for these biases. Also, evaluate how underrepresented countries or regions might impact the analysis.
Misinterpretation of Results: If the model suggests that population size or GDP directly influences life expectancy, policymakers may take inappropriate actions without considering the broader social, political, or economic context.
Solution: Use caution when interpreting regression coefficients. For example, a strong positive correlation between GDP and life expectancy might not imply causation, and other unmeasured factors might be at play.
Resource Allocation: If the model incorrectly emphasizes certain factors (e.g., GDP) while underestimating others (e.g., education or healthcare), it could lead to skewed policy recommendations that disproportionately favor wealthier countries, ignoring the needs of poorer populations.
Solution: Ensure that all relevant variables (like healthcare access, education level, and income inequality) are considered in the model to avoid reinforcing existing inequalities.
Data Quality: The assumptions made about the data’s quality (e.g., missing data, sampling biases) could influence the model’s validity. What is measured, how it’s measured, and who measures it can all affect conclusions drawn.
Solution: Carefully check the data for missing values and assess the sampling method. Conduct sensitivity analyses to understand how results change with different assumptions.
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.