Overview

This series of posts is intended to get the reader up speed on how to import, format, and use the economic data of Thomas Piketty, Gabriel Zucman, and Emmanuel Saez. Piketty is most known in the US for his seminal 2014 work Capital in the Twenty-First Century, and Saez and Zucman recently released The Triumph of Injustice: How the Rich Dodge Taxes and How to Make Them Pay.

Summary

This is the section that determines where everyone fits on the income & wealth scale. Am I in the Top 1%, the bottom 50%, or somewhere in the middle? In the machine learning section the distributions will be used as a proxy for an actual dollar amount of income & wealth.

All of our calculations to this point have been row-wise so there was no need to treat each year separately. In this chapter we will create income & wealth distributions by year, or group-wise. I will show you a very powerful pattern for doing bulk processing like this that uses few lines of code and no loops (at least no explicit loops).

Code

Notes

Two new variables are created to hold the income and wealth classes. The first splits everyone into bottom_50, middle_class, and top_10 percent. The second splits the top 1% into bottom_90, top_10, top_1, top_01, and gt_top_01 (gt = greater than top 0.1%; these are the highest earners and wealthiest people in America).

This initial dataset contains six years

library(tidyverse) 
dina_proportioned <- readRDS("temp/dina_proportioned.RDS")

unique(dina_proportioned$year)
## [1] 1968 1978 1988 1998 2008 2018
## Levels: 1968 1978 1988 1998 2008 2018

Dimensions of dataset

dim(dina_proportioned)
## [1] 321530    241

Create a base function that can work with one year

I want to be very explicit about this calculation. Remember that we are interested in where someone falls on the income or wealth spectrum in relation to other people. Each record has a population_weight that indicates how many people it represents.

Steps:
1. Get the total adult population for the given year
2. Sort income/wealth from smallest to largest
3. Calculate running total on population_weight
4. Divide the running total for each record by the total population to indicate where cohort is on the distribution.
5. Assign categories depending on where cohort fits on the distribution.
6. Assign categories for those in the top 1%.

create_distributions_base <- function(df, years = NULL) {
  adult_population <- df %>%
    filter(year == years) %>%
    summarize(adult_population = sum(population_weight)) %>%
    pull()
  
  income_wealth_dist <- df %>%
    filter(year == years) %>%
    arrange(ttl_income_national_pretax) %>%
    mutate(cumm_population_income = cumsum(population_weight),
           dist_income_national_pretax = cumm_population_income / adult_population, 
           cumm_income_national_pretax = cumsum(ttl_income_national_pretax)) %>%
    arrange(ttl_wealth_net) %>%
    mutate(cumm_population_wealth = cumsum(population_weight),
           dist_wealth = cumm_population_wealth / adult_population, 
           cumm_wealth = cumsum(ttl_wealth_net)) %>%
    mutate(income_class = if_else(dist_income_national_pretax < .50, "bottom_50",
                                  if_else(dist_income_national_pretax >= .50 &
                                            dist_income_national_pretax < .90, "middle_class",
                                          "top_ten")),
           income_class = factor(income_class, ordered = TRUE,
                                 levels = c("bottom_50", "middle_class", "top_ten")),
           income_class_t10 = if_else(dist_income_national_pretax < .90, "bottom_90",
                                      if_else(dist_income_national_pretax >= .90 &
                                                dist_income_national_pretax < .99, 
                                              "top_10",
                                              if_else(dist_income_national_pretax >= .99 &
                                                        dist_income_national_pretax < .999, 
                                                      "top_1", 
                                                      if_else(dist_income_national_pretax >= .999 &
                                                                dist_income_national_pretax < .9999, 
                                                              "top_01",
                                                              "gt_top_01")))),
           income_class_t10 = factor(income_class_t10, ordered = TRUE, 
                                     levels = c("bottom_90", "top_10", "top_1", 
                                                "top_01", "gt_top_01")), 
           
           wealth_class = if_else(dist_wealth < .50, "bottom_50",
                                  if_else(dist_wealth >= .50 &
                                            dist_wealth < .90, "middle_class",
                                          "top_ten")), 
           wealth_class = factor(wealth_class, ordered = TRUE, 
                                 levels = c("bottom_50", "middle_class", "top_ten")),
           wealth_class_t10 = if_else(dist_wealth < .90, "bottom_90",
                                      if_else(dist_wealth >= .90 &
                                                dist_wealth < .99, 
                                              "top_10",
                                              if_else(dist_wealth >= .99 &
                                                        dist_wealth < .999, 
                                                      "top_1", 
                                                      if_else(dist_wealth >= .999 &
                                                                dist_wealth < .9999, 
                                                              "top_01", 
                                                              "gt_top_01")))),
           wealth_class_t10 = factor(wealth_class_t10,
                                     ordered = TRUE, 
                                     levels = c("bottom_90", "top_10", "top_1", 
                                                "top_01", "gt_top_01"))
    )
  
  return(income_wealth_dist)
}

dina_df7 <- create_distributions_base(dina_proportioned, 
                                      years = 1968)
dim(dina_df7)
## [1] 32794   251

If I try to run this with all of the years, it fails.

dina_df7 <- create_distributions_base(dina_proportioned, 
                                      years = unique(dina_proportioned$year))
longer object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object length

We need to generalize the function so that it can iterate over the individual years.

Generalize the function

This is what the new multi-year function call looks like

dina_df7 <- map_dfr(unique(dina_proportioned$year), 
                   ~ create_distributions_base(dina_proportioned, .x))

Wrapper function

Most of the time you just want to create distributions on whatever years you are working with, but not always, so a function that can handle several scenarios is needed.
It should allow the user to simply pass the dataframe and have the function work it’s magic.

It should also allow the user to specify one or more years from their data that they want processed.

This is more of an ease-of-use feature so you don’t have to change which function to use depending on the circumstances.
* If no years are specified then the function will iterate over all years in the dataset
* If the user wants specific years they can specify one or more in the years argument

Code

create_distributions <- function(df, years = NULL) {
  if(length(years) == 1) {
    dist_df <- create_distributions_base(df, years)
    print(table(dist_df$year, dist_df$income_class))
    return(dist_df)
  } 
  if(length(years) > 1){
    dist_df <- map_dfr(unique(years), 
                       ~ create_distributions_base(df, .x))
    print(table(dist_df$year, dist_df$income_class))
    return(dist_df)
  }
  
  data_years <- unique(df$year)
  if(length(data_years) == 1) {
    dist_df <- create_distributions_base(df, data_years)
    print(table(dist_df$year, dist_df$income_class))
    return(dist_df)
  } 
  if(length(data_years) > 1){
    dist_df <- map_dfr(unique(data_years), 
                       ~ create_distributions_base(df, .x))
    print(table(dist_df$year, dist_df$income_class))
    return(dist_df)
  }
  
  print("Could not process your request. Check that your data has at least one year")
}

Test default mode

dina_df7 <- create_distributions(dina_proportioned) 
##       
##        bottom_50 middle_class top_ten
##   1968     11334         5813   15647
##   1978     26440        17388   23160
##   1988     19000         8270   16800
##   1998     17204         9378   23643
##   2008     22023        12166   24504
##   2018     27530        15716   25514

Test with a single year

dina_df7 <- create_distributions(dina_proportioned, years = 1968) 
##       
##        bottom_50 middle_class top_ten
##   1968     11334         5813   15647
##   1978         0            0       0
##   1988         0            0       0
##   1998         0            0       0
##   2008         0            0       0
##   2018         0            0       0

Test with multiple years

dina_df7 <- create_distributions(dina_proportioned, years = c(1968, 2018)) 
##       
##        bottom_50 middle_class top_ten
##   1968     11334         5813   15647
##   1978         0            0       0
##   1988         0            0       0
##   1998         0            0       0
##   2008         0            0       0
##   2018     27530        15716   25514

End notes

This ends the formatting section. In the next chapter I’ll provide a single function that will fully process every year in your subset folder so that we’re ready to move forward to the machine learning chapter.

Next up: One Big Formatting Function

Appendix: Final distribution functions & tests

Copy & paste this

# Base function
create_distributions_base <- function(df, years = NULL) {
  adult_population <- df %>%
    filter(year == years) %>%
    summarize(adult_population = sum(population_weight)) %>%
    pull()
  
  income_wealth_dist <- df %>%
    filter(year == years) %>%
    arrange(ttl_income_national_pretax) %>%
    mutate(cumm_population_income = cumsum(population_weight),
           dist_income_national_pretax = cumm_population_income / adult_population, 
           cumm_income_national_pretax = cumsum(ttl_income_national_pretax)) %>%
    arrange(ttl_wealth_net) %>%
    mutate(cumm_population_wealth = cumsum(population_weight),
           dist_wealth = cumm_population_wealth / adult_population, 
           cumm_wealth = cumsum(ttl_wealth_net)) %>%
    mutate(income_class = if_else(dist_income_national_pretax < .50, "bottom_50",
                                  if_else(dist_income_national_pretax >= .50 &
                                            dist_income_national_pretax < .90, "middle_class",
                                          "top_ten")),
           income_class = factor(income_class, ordered = TRUE,
                                 levels = c("bottom_50", "middle_class", "top_ten")),
           income_class_t10 = if_else(dist_income_national_pretax < .90, "bottom_90",
                                      if_else(dist_income_national_pretax >= .90 &
                                                dist_income_national_pretax < .99, 
                                              "top_10",
                                              if_else(dist_income_national_pretax >= .99 &
                                                        dist_income_national_pretax < .999, 
                                                      "top_1", 
                                                      if_else(dist_income_national_pretax >= .999 &
                                                                dist_income_national_pretax < .9999, 
                                                              "top_01",
                                                              "gt_top_01")))),
           income_class_t10 = factor(income_class_t10, ordered = TRUE, 
                                     levels = c("bottom_90", "top_10", "top_1", 
                                                "top_01", "gt_top_01")), 
           
           wealth_class = if_else(dist_wealth < .50, "bottom_50",
                                  if_else(dist_wealth >= .50 &
                                            dist_wealth < .90, "middle_class",
                                          "top_ten")), 
           wealth_class = factor(wealth_class, ordered = TRUE, 
                                 levels = c("bottom_50", "middle_class", "top_ten")),
           wealth_class_t10 = if_else(dist_wealth < .90, "bottom_90",
                                      if_else(dist_wealth >= .90 &
                                                dist_wealth < .99, 
                                              "top_10",
                                              if_else(dist_wealth >= .99 &
                                                        dist_wealth < .999, 
                                                      "top_1", 
                                                      if_else(dist_wealth >= .999 &
                                                                dist_wealth < .9999, 
                                                              "top_01", 
                                                              "gt_top_01")))),
           wealth_class_t10 = factor(wealth_class_t10,
                                     ordered = TRUE, 
                                     levels = c("bottom_90", "top_10", "top_1", 
                                                "top_01", "gt_top_01"))
    )
  
  return(income_wealth_dist)
}


# Wrapper function - This is the one you call, not base  
create_distributions <- function(df, years = NULL) {
  if(length(years) == 1) {
    dist_df <- create_distributions_base(df, years)
    print(table(dist_df$year, dist_df$income_class))
    return(dist_df)
  } 
  if(length(years) > 1){
    dist_df <- map_dfr(unique(years), 
                       ~ create_distributions_base(df, .x))
    print(table(dist_df$year, dist_df$income_class))
    return(dist_df)
  }
  
  data_years <- unique(df$year)
  if(length(data_years) == 1) {
    dist_df <- create_distributions_base(df, data_years)
    print(table(dist_df$year, dist_df$income_class))
    return(dist_df)
  } 
  if(length(data_years) > 1){
    dist_df <- map_dfr(unique(data_years), 
                       ~ create_distributions_base(df, .x))
    print(table(dist_df$year, dist_df$income_class))
    return(dist_df)
  }
  
  print("Could not process your request. Check that your data has at least one year")
}


# Test 1
dina_df7 <- create_distributions(dina_proportioned) 
dim(dina_df7)

# Test 2
dina_df7 <- create_distributions(dina_proportioned, years = 1968) 
dim(dina_df7)

# Test 3
dina_df7 <- create_distributions(dina_proportioned, years = c(1968, 2018)) 
dim(dina_df7)

END