Overview
This series of posts is intended to get the reader up speed on how to import, format, and use the economic data of Thomas Piketty, Gabriel Zucman, and Emmanuel Saez. Piketty is most known in the US for his seminal 2014 work Capital in the Twenty-First Century, and Saez and Zucman recently released The Triumph of Injustice: How the Rich Dodge Taxes and How to Make Them Pay.
Summary
This is the section that determines where everyone fits on the income & wealth scale. Am I in the Top 1%, the bottom 50%, or somewhere in the middle? In the machine learning section the distributions will be used as a proxy for an actual dollar amount of income & wealth.
All of our calculations to this point have been row-wise so there was no need to treat each year separately. In this chapter we will create income & wealth distributions by year, or group-wise. I will show you a very powerful pattern for doing bulk processing like this that uses few lines of code and no loops (at least no explicit loops).
Code
Notes
Two new variables are created to hold the income and wealth classes. The first splits everyone into bottom_50, middle_class, and top_10 percent. The second splits the top 1% into bottom_90, top_10, top_1, top_01, and gt_top_01 (gt = greater than top 0.1%; these are the highest earners and wealthiest people in America).
This initial dataset contains six years
library(tidyverse)
dina_proportioned <- readRDS("temp/dina_proportioned.RDS")
unique(dina_proportioned$year)## [1] 1968 1978 1988 1998 2008 2018
## Levels: 1968 1978 1988 1998 2008 2018
Dimensions of dataset
dim(dina_proportioned)## [1] 321530 241
Create a base function that can work with one year
I want to be very explicit about this calculation. Remember that we are interested in where someone falls on the income or wealth spectrum in relation to other people. Each record has a population_weight that indicates how many people it represents.
Steps:
1. Get the total adult population for the given year
2. Sort income/wealth from smallest to largest
3. Calculate running total on population_weight
4. Divide the running total for each record by the total population to indicate where cohort is on the distribution.
5. Assign categories depending on where cohort fits on the distribution.
6. Assign categories for those in the top 1%.
create_distributions_base <- function(df, years = NULL) {
adult_population <- df %>%
filter(year == years) %>%
summarize(adult_population = sum(population_weight)) %>%
pull()
income_wealth_dist <- df %>%
filter(year == years) %>%
arrange(ttl_income_national_pretax) %>%
mutate(cumm_population_income = cumsum(population_weight),
dist_income_national_pretax = cumm_population_income / adult_population,
cumm_income_national_pretax = cumsum(ttl_income_national_pretax)) %>%
arrange(ttl_wealth_net) %>%
mutate(cumm_population_wealth = cumsum(population_weight),
dist_wealth = cumm_population_wealth / adult_population,
cumm_wealth = cumsum(ttl_wealth_net)) %>%
mutate(income_class = if_else(dist_income_national_pretax < .50, "bottom_50",
if_else(dist_income_national_pretax >= .50 &
dist_income_national_pretax < .90, "middle_class",
"top_ten")),
income_class = factor(income_class, ordered = TRUE,
levels = c("bottom_50", "middle_class", "top_ten")),
income_class_t10 = if_else(dist_income_national_pretax < .90, "bottom_90",
if_else(dist_income_national_pretax >= .90 &
dist_income_national_pretax < .99,
"top_10",
if_else(dist_income_national_pretax >= .99 &
dist_income_national_pretax < .999,
"top_1",
if_else(dist_income_national_pretax >= .999 &
dist_income_national_pretax < .9999,
"top_01",
"gt_top_01")))),
income_class_t10 = factor(income_class_t10, ordered = TRUE,
levels = c("bottom_90", "top_10", "top_1",
"top_01", "gt_top_01")),
wealth_class = if_else(dist_wealth < .50, "bottom_50",
if_else(dist_wealth >= .50 &
dist_wealth < .90, "middle_class",
"top_ten")),
wealth_class = factor(wealth_class, ordered = TRUE,
levels = c("bottom_50", "middle_class", "top_ten")),
wealth_class_t10 = if_else(dist_wealth < .90, "bottom_90",
if_else(dist_wealth >= .90 &
dist_wealth < .99,
"top_10",
if_else(dist_wealth >= .99 &
dist_wealth < .999,
"top_1",
if_else(dist_wealth >= .999 &
dist_wealth < .9999,
"top_01",
"gt_top_01")))),
wealth_class_t10 = factor(wealth_class_t10,
ordered = TRUE,
levels = c("bottom_90", "top_10", "top_1",
"top_01", "gt_top_01"))
)
return(income_wealth_dist)
}
dina_df7 <- create_distributions_base(dina_proportioned,
years = 1968)
dim(dina_df7)## [1] 32794 251
If I try to run this with all of the years, it fails.
dina_df7 <- create_distributions_base(dina_proportioned,
years = unique(dina_proportioned$year))longer object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object lengthWe need to generalize the function so that it can iterate over the individual years.
Generalize the function
This is what the new multi-year function call looks like
dina_df7 <- map_dfr(unique(dina_proportioned$year),
~ create_distributions_base(dina_proportioned, .x))- Since we want a single data frame returned we’ll use the
map_dfr()function
- The
~tells R that this function is going to be used iteratively over a vector that is passed in from the left side
- Identify all of the unique years in the dataset and map them to the
.xparameter of the function.
- The function will itereate over each
yearand return a single dataframe. The.xrepresents the changing parameter passed in from the left side.
Note: This will not work on a dataset with a single year. Below I create a wrapper to call the proper function regardless of how many years are being passed.
Wrapper function
Most of the time you just want to create distributions on whatever years you are working with, but not always, so a function that can handle several scenarios is needed.
It should allow the user to simply pass the dataframe and have the function work it’s magic.
It should also allow the user to specify one or more years from their data that they want processed.
This is more of an ease-of-use feature so you don’t have to change which function to use depending on the circumstances.
* If no years are specified then the function will iterate over all years in the dataset
* If the user wants specific years they can specify one or more in the years argument
Code
create_distributions <- function(df, years = NULL) {
if(length(years) == 1) {
dist_df <- create_distributions_base(df, years)
print(table(dist_df$year, dist_df$income_class))
return(dist_df)
}
if(length(years) > 1){
dist_df <- map_dfr(unique(years),
~ create_distributions_base(df, .x))
print(table(dist_df$year, dist_df$income_class))
return(dist_df)
}
data_years <- unique(df$year)
if(length(data_years) == 1) {
dist_df <- create_distributions_base(df, data_years)
print(table(dist_df$year, dist_df$income_class))
return(dist_df)
}
if(length(data_years) > 1){
dist_df <- map_dfr(unique(data_years),
~ create_distributions_base(df, .x))
print(table(dist_df$year, dist_df$income_class))
return(dist_df)
}
print("Could not process your request. Check that your data has at least one year")
}Test default mode
dina_df7 <- create_distributions(dina_proportioned) ##
## bottom_50 middle_class top_ten
## 1968 11334 5813 15647
## 1978 26440 17388 23160
## 1988 19000 8270 16800
## 1998 17204 9378 23643
## 2008 22023 12166 24504
## 2018 27530 15716 25514
Test with a single year
dina_df7 <- create_distributions(dina_proportioned, years = 1968) ##
## bottom_50 middle_class top_ten
## 1968 11334 5813 15647
## 1978 0 0 0
## 1988 0 0 0
## 1998 0 0 0
## 2008 0 0 0
## 2018 0 0 0
Test with multiple years
dina_df7 <- create_distributions(dina_proportioned, years = c(1968, 2018)) ##
## bottom_50 middle_class top_ten
## 1968 11334 5813 15647
## 1978 0 0 0
## 1988 0 0 0
## 1998 0 0 0
## 2008 0 0 0
## 2018 27530 15716 25514
End notes
This ends the formatting section. In the next chapter I’ll provide a single function that will fully process every year in your subset folder so that we’re ready to move forward to the machine learning chapter.
Next up: One Big Formatting Function
Appendix: Final distribution functions & tests
Copy & paste this
# Base function
create_distributions_base <- function(df, years = NULL) {
adult_population <- df %>%
filter(year == years) %>%
summarize(adult_population = sum(population_weight)) %>%
pull()
income_wealth_dist <- df %>%
filter(year == years) %>%
arrange(ttl_income_national_pretax) %>%
mutate(cumm_population_income = cumsum(population_weight),
dist_income_national_pretax = cumm_population_income / adult_population,
cumm_income_national_pretax = cumsum(ttl_income_national_pretax)) %>%
arrange(ttl_wealth_net) %>%
mutate(cumm_population_wealth = cumsum(population_weight),
dist_wealth = cumm_population_wealth / adult_population,
cumm_wealth = cumsum(ttl_wealth_net)) %>%
mutate(income_class = if_else(dist_income_national_pretax < .50, "bottom_50",
if_else(dist_income_national_pretax >= .50 &
dist_income_national_pretax < .90, "middle_class",
"top_ten")),
income_class = factor(income_class, ordered = TRUE,
levels = c("bottom_50", "middle_class", "top_ten")),
income_class_t10 = if_else(dist_income_national_pretax < .90, "bottom_90",
if_else(dist_income_national_pretax >= .90 &
dist_income_national_pretax < .99,
"top_10",
if_else(dist_income_national_pretax >= .99 &
dist_income_national_pretax < .999,
"top_1",
if_else(dist_income_national_pretax >= .999 &
dist_income_national_pretax < .9999,
"top_01",
"gt_top_01")))),
income_class_t10 = factor(income_class_t10, ordered = TRUE,
levels = c("bottom_90", "top_10", "top_1",
"top_01", "gt_top_01")),
wealth_class = if_else(dist_wealth < .50, "bottom_50",
if_else(dist_wealth >= .50 &
dist_wealth < .90, "middle_class",
"top_ten")),
wealth_class = factor(wealth_class, ordered = TRUE,
levels = c("bottom_50", "middle_class", "top_ten")),
wealth_class_t10 = if_else(dist_wealth < .90, "bottom_90",
if_else(dist_wealth >= .90 &
dist_wealth < .99,
"top_10",
if_else(dist_wealth >= .99 &
dist_wealth < .999,
"top_1",
if_else(dist_wealth >= .999 &
dist_wealth < .9999,
"top_01",
"gt_top_01")))),
wealth_class_t10 = factor(wealth_class_t10,
ordered = TRUE,
levels = c("bottom_90", "top_10", "top_1",
"top_01", "gt_top_01"))
)
return(income_wealth_dist)
}
# Wrapper function - This is the one you call, not base
create_distributions <- function(df, years = NULL) {
if(length(years) == 1) {
dist_df <- create_distributions_base(df, years)
print(table(dist_df$year, dist_df$income_class))
return(dist_df)
}
if(length(years) > 1){
dist_df <- map_dfr(unique(years),
~ create_distributions_base(df, .x))
print(table(dist_df$year, dist_df$income_class))
return(dist_df)
}
data_years <- unique(df$year)
if(length(data_years) == 1) {
dist_df <- create_distributions_base(df, data_years)
print(table(dist_df$year, dist_df$income_class))
return(dist_df)
}
if(length(data_years) > 1){
dist_df <- map_dfr(unique(data_years),
~ create_distributions_base(df, .x))
print(table(dist_df$year, dist_df$income_class))
return(dist_df)
}
print("Could not process your request. Check that your data has at least one year")
}
# Test 1
dina_df7 <- create_distributions(dina_proportioned)
dim(dina_df7)
# Test 2
dina_df7 <- create_distributions(dina_proportioned, years = 1968)
dim(dina_df7)
# Test 3
dina_df7 <- create_distributions(dina_proportioned, years = c(1968, 2018))
dim(dina_df7)