Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.

Group:

Kael, Kaitlyn, Chin-yu, Chandler, Connor

Description

First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).

For the lab your group has been assigned, consider issues with models, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).

Share your model critique in this notebook as your data dive submission for the week.

As a start, think about the context of the lab and consider the following:

  • Analytical issues, such as model assumptions

  • Overcoming biases (existing or potential)

  • Possible risks or societal implications

  • Crucial issues which might not be measurable

Treat this exercise as if the analyses in your assigned lab (i.e., the one you are critiquing) were to be published, made available to the public in a press release, or used at some large company (e.g., for mpg data, imagine if Toyota used the conclusions to drive strategic decisions).

Lab 6 Critique

Load Data and Packages

# let's start by saving the original dataset as "raw"
gapminder_raw <- gapminder_unfiltered

gapminder <- gapminder_unfiltered |>
  # rename columns
  rename(life_exp = "lifeExp",
         population = "pop",
         gdp_per_cap = "gdpPercap") |>
  # get deviation column
  mutate(years_since = year(now()) - year)

# it's best to test things out before setting variables
gapminder <- gapminder |> 
  group_by(continent, year) |>
  mutate(life_exp_dev = life_exp - mean(life_exp),  # deviation
         life_exp_avg = mean(life_exp)) |>       # group average
  arrange(continent, year)

gapminder <- ungroup(gapminder)

gapminder <- gapminder |> 
  group_by(continent, year) |>
  mutate(perc_of_continent = population/sum(population)) |>
  ungroup()

gapminder <- 
  gapminder |>
  group_by(year) |>                       # within each year ...
  mutate(pop_rank = rank(population),
         pop_rank_desc = rank(desc(population))) |>  # rank order populations
  ungroup()

gapminder <-
  gapminder |>
    group_by(continent, year) |>
    mutate(pop_median = median(population),
           pop_half = ifelse(population >= median(population),
                                 "upper half",
                                 "lower half")) |>
    ungroup()

First Critique -

#original code
plot_data <- 
  gapminder %>%
    filter(country == "Portugal")

plot_data %>%
  ggplot(mapping = aes(x = population, y = life_exp)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Population", y = "Life Expectancy",
       title = "Life Expectancy as Population Increases",
       subtitle = paste("Covariance:", 
                        round(cov(plot_data$population, 
                                  plot_data$life_exp), 2),
                        "\nPearson Correlation:",
                        round(cor(plot_data$population, 
                                  plot_data$life_exp), 2))) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The key to looking at this chart is to understand that correlation, even if the value is extremely high as it is here, does not equal causation. For example, simply increasing your population will not guarantee an increase the countries life expectancy. There are likely a lot of confounding variables that are not modeled here that are playing a factor.

So even though simple linear regression appears to be a good fit to model this data there should be some more research done. We had more ethical issues than technical with this chart. We also thought that this was an unusual chart to create because the life expectancy of a country is dependent on more than just the population. It might have been better to use multivariate regression to consider other variables that might affect life expectancy, such as GDP, obesity rate, healthcare availability, and access to technology. We would need more information to perform a well rounded multivariate regression.

#altered code
plot_data <-gapminder
  

plot_data |>
  ggplot(aes(x = population, y = life_exp)) +
  geom_point(aes(color = country)) +
  scale_color_manual(values = c ("United States" = "blue", "China" = "red", "India" = "gold", "Portugal" = 'green')) +
  #geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Population", y = "Life Expectancy",
    title = "Life Expectancy as Population Increases",
    subtitle = paste("Covariance:", round(cov(plot_data$population,
    plot_data$life_exp), 2))) +
  theme_minimal()

This is an extension of the first plot. with certain countries highlighted. This shows that we would have to treat each country a little bit different or segment the data to include some binary variable that splits the super big countries from the average sized countries.

Second Critique -

# plotting just for 2001
plot_year <- 2001

gapminder |>
  filter(year == plot_year) |>
  ggplot() +
  geom_point(mapping = aes(x = population, 
                           y = continent,
                           colour = pop_half)) +
  scale_x_log10() +
  scale_colour_brewer(palette = "Dark2") +
  theme_hc() +
  labs(title = paste("Population 50th percentiles for the year", plot_year))

Here we can see the data is very sparse and in some continents there are only two countries represented for the year, which is obviously not the case.

# plotting just for 2002
plot_year <- 2002

gapminder |>
  filter(year == plot_year) |>
  ggplot() +
  geom_point(mapping = aes(x = population, 
                           y = continent,
                           colour = pop_half)) +
  scale_x_log10() +
  scale_colour_brewer(palette = "Dark2") +
  theme_hc() +
  labs(title = paste("Population 50th percentiles for the year", plot_year))

Now looking at just one year later we can see that there is now a lot of observations present for all continents. Another interesting thing to notice is that Africa as a continent was not present in the 2001 data and has a lot of observations for 2002.

gap_by_year <- gapminder %>%
  group_by(year) %>%
  summarise(year_count = n())
gap_by_year %>%
  ggplot() +
  geom_line(mapping = aes(x = year,y = year_count)) +
  geom_text_repel(
    data = subset(gap_by_year, year_count > 100),
    aes(x=year, y=year_count, label = year),
    size = 3,
    point.padding = 0.2, 
    nudge_x = .3,
    segment.curvature = -1e-20,
    arrow = arrow(length = unit(0.015, "npc"))
  ) +
  theme_hc() +
  labs(title = "Observation Count by Year",
       x = "Year", 
       y = "Observation Count")

This second chart displays how there are varying amounts of observations reported each year, supporting the idea of misrepresentation.

There is bias introduced in the original method for plotting the population data percentiles because most countries are not reporting their information each year to a census, unlike larger countries like the USA and Canada. For each year, the average for each continent will vary because not all countries are submitting population data.

If you were to consider the average population for the Americas in 2001 and Brazil did not have data fror that year, your average will be inherently bias towards the countries that did submit their data. When using this model to make inferences, we may make the wrong inferences.

The ethical implications or missing this problem could be huge as the countries that are able to consistently gather data on a year to year basis appear to be the more developed countries. So by randomly looking at the data we are more likely to get information with the developed countries data, which could lead to under representing these other countries.

It would be beneficial to try and fill in the missing data for countries. One way to do this would be by taking the average of the two years that surround the desired year statistics. For example: Uganda has not submitted 1999 data, only 1995 and 2000. We could then use the average of these two numbers to compute a more realistic average between all countries in a continent. It is important to recognize that these data will not be entirely correct, but may follow the correct trend.