library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(ggplot2)
gapminder_raw <- gapminder_unfiltered
gapminder <- gapminder_unfiltered |>
  rename(life_exp = "lifeExp",
         population = "pop",
         gdp_per_cap = "gdpPercap") |>
  mutate(years_since = year(now()) - year)

Goal 1: Business Scenario

Goal 2: Model Critique

Improvement 1

The data uses mean life expectancy which can be heavily skewed by higher infant mortality rates. If you look below, you can see the difference in life expectancy over time for a country with high infant mortality (Afganistan) and low infant mortality (United Kingdom). We propose adding a variable for infant mortality rate to help control for this effect

    gapminder$infant <- ifelse(gapminder$country == "United Kingdom", 3.9, NA)
    gapminder$infant <- ifelse(gapminder$country == "Afghanistan", 50, gapminder$infant)

    gapminder_filter <- gapminder %>%
      filter(country %in% c("United Kingdom", "Afghanistan"))

    ggplot(data = gapminder_filter, aes(x = year, y = life_exp)) + 
      geom_point(aes(color = country)) + 
       labs(x = "Year",
                y = "Life Expectancy",
                title = "Life Expectancy for Countries with Low and High Infant Mortality")

Improvement 2

Additionally, we suggest using a generalized linear model instead of the bootstrapping. This will help determine the effect of each independent variable, so we can better make decisions in the future. We use a Poisson glm to account for the fact that the life expectancy variable more closely approximates a Poisson distribution.

model <- glm(round(life_exp) ~ year + gdp_per_cap + continent, data = gapminder, family = "poisson")
summary(model)
## 
## Call:
## glm(formula = round(life_exp) ~ year + gdp_per_cap + continent, 
##     family = "poisson", data = gapminder)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.458e+00  2.768e-01  -8.882   <2e-16 ***
## year               3.203e-03  1.398e-04  22.913   <2e-16 ***
## gdp_per_cap        3.675e-06  2.157e-07  17.038   <2e-16 ***
## continentAmericas  2.785e-01  8.210e-03  33.919   <2e-16 ***
## continentAsia      2.076e-01  7.960e-03  26.077   <2e-16 ***
## continentEurope    3.393e-01  7.279e-03  46.617   <2e-16 ***
## continentFSU       2.794e-01  1.180e-02  23.666   <2e-16 ***
## continentOceania   3.006e-01  1.073e-02  28.012   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7605.1  on 3312  degrees of freedom
## Residual deviance: 2260.4  on 3305  degrees of freedom
## AIC: 22152
## 
## Number of Fisher Scoring iterations: 4

Improvement 3

Finally, the data is imbalanced when it comes to records for individual countries. European countries have records for more years than countries in Africa or Asia. This means the the European data is over represented in the data. We would either need to collect data from the other countries to balance the data set, or condence into one record per decade to better balance the data.

gap_group <- gapminder %>%
  group_by(country) %>%
  summarise(n_records = n(), 
            continent = continent) %>%
  distinct()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
ggplot(data = gap_group, aes(x = reorder(country, n_records), y = n_records)) + 
  geom_point(aes(color = continent)) + 
  theme(axis.text.x=element_blank()) + 
   labs(x = "",
            y = "Survey Records",
            title = "Number of Observations by Continent")

Goal 3: Ethical and Epistemological Concerns