Overview
This series of posts is intended to get the reader up speed on how to import, format, and use the economic data of Thomas Piketty, Gabriel Zucman, and Emmanuel Saez. Piketty is most known in the US for his seminal 2014 work Capital in the Twenty-First Century, and Saez and Zucman recently released The Triumph of Injustice: How the Rich Dodge Taxes and How to Make Them Pay.
Summary
This chapter creates proportions. For instance we can look at a person’s wealth and see how much of their wealth comes from each bucket. This is needed when we use machine learning to infer the tax code of 1968 and apply it to 2018.
ttl_wealth_net =
assets_equity +
assets_currency +
assets_housing +
assets_business +
assets_pension_lifeins +
liabilities_household
Notes
In this chapter I introduce the prop_ prefix. Any variable with that prefix indicates that it is used in a proportion calculation. If you want all the proportion columns they can be selected like this select(contains("prop_"))
This code looks a lot more complicated than it really is. The reason is that there were some totals that were zero and would cause “div by 0” errors. I added an if_else() to almost all lines to handle this. In the recon this leads to some zero’s instead of 1’s. As long as you aren’t getting anything besides zeros & ones be confident that everything is adding up correctly. There was one recon that was returning numbers very close to 1 but not exactly. I’ll have to take a closer look to see if there is anything to be concerned about.
Also there were some elements that were used in more than one proportion. The ttl_income_factor_labor field is used to calculate ttl_income_factor and ttl_income_pretax_labor. In these cases I added a suffix so that there wasn’t a name-clash when assigning the proportion. i.e prop_ttl_income_factor_labor_ilpr
I’m not going to go step-by-step like in the last chapter. This is very similar and not as long. Copy & paste the code into RStudio and look at the output. After this we’ll be up to 243 columns.
library(tidyverse)
dina_reconciled <- readRDS("temp/dina_reconciled.RDS")
dina_proportioned <- dina_reconciled %>%
mutate( # ttl_income_factor_labor
prop_income_factor_labor_wages = if_else(ttl_income_factor_labor != 0,
income_factor_labor_wages / ttl_income_factor_labor, 0),
prop_income_factor_labor_mixed = if_else(ttl_income_factor_labor != 0,
income_factor_labor_mixed / ttl_income_factor_labor, 0),
prop_income_factor_labor_sales_taxes = if_else(ttl_income_factor_labor != 0,
income_factor_labor_sales_taxes / ttl_income_factor_labor, 0),
recon_prop_ttl_income_factor_labor =
prop_income_factor_labor_wages +
prop_income_factor_labor_mixed +
prop_income_factor_labor_sales_taxes
) %>%
mutate( # ttl_income_factor_capital
prop_income_factor_capital_housing = if_else(ttl_income_factor_capital != 0,
income_factor_capital_housing / ttl_income_factor_capital, 0),
prop_income_factor_capital_equity = if_else(ttl_income_factor_capital != 0,
income_factor_capital_equity / ttl_income_factor_capital, 0),
prop_income_factor_capital_interest = if_else(ttl_income_factor_capital != 0,
income_factor_capital_interest / ttl_income_factor_capital, 0),
prop_income_factor_capital_business = if_else(ttl_income_factor_capital != 0,
income_factor_capital_business / ttl_income_factor_capital, 0),
prop_income_factor_capital_pension_benefits = if_else(ttl_income_factor_capital != 0,
income_factor_capital_pension_benefits / ttl_income_factor_capital, 0),
prop_payments_interest = if_else(ttl_income_factor_capital != 0,
payments_interest / ttl_income_factor_capital, 0),
recon_prop_ttl_income_factor_capital =
prop_income_factor_capital_housing +
prop_income_factor_capital_equity +
prop_income_factor_capital_interest +
prop_income_factor_capital_business +
prop_income_factor_capital_pension_benefits +
prop_payments_interest
) %>%
mutate( # ttl_income_factor
prop_ttl_income_factor_labor_if = if_else(ttl_income_factor != 0,
ttl_income_factor_labor / ttl_income_factor, 0),
prop_ttl_income_factor_capital_if = if_else(ttl_income_factor != 0,
ttl_income_factor_capital / ttl_income_factor, 0),
recon_prop_ttl_income_factor =
prop_ttl_income_factor_labor_if +
prop_ttl_income_factor_capital_if
) %>%
mutate( # ttl_contributions_social_insurance
prop_contributions_social_insurance_pensions = if_else(ttl_contributions_social_insurance != 0,
contributions_social_insurance_pensions / ttl_contributions_social_insurance, 0),
prop_contributions_social_insurance_di_ui = if_else(ttl_contributions_social_insurance != 0,
contributions_social_insurance_di_ui / ttl_contributions_social_insurance, 0),
recon_prop_ttl_contributions_social_insurance =
prop_contributions_social_insurance_pensions +
prop_contributions_social_insurance_di_ui
) %>%
mutate( # ttl_income_pretax_labor
prop_ttl_income_factor_labor_ilpr = if_else(ttl_income_pretax_labor != 0,
ttl_income_factor_labor / ttl_income_pretax_labor, 0),
prop_ttl_contributions_social_insurance = if_else(ttl_income_pretax_labor != 0,
ttl_contributions_social_insurance / ttl_income_pretax_labor, 0),
prop_income_social_share_labor = if_else(ttl_income_pretax_labor != 0,
income_social_share_labor / ttl_income_pretax_labor, 0),
recon_prop_ttl_income_pretax_labor =
prop_ttl_income_factor_labor_ilpr +
prop_ttl_contributions_social_insurance +
prop_income_social_share_labor
) %>%
mutate( # ttl_income_pretax_capital
prop_ttl_income_factor_capital_icpr = if_else(ttl_income_pretax_capital != 0,
ttl_income_factor_capital / ttl_income_pretax_capital, 0),
prop_income_investment_payable_pensions = if_else(ttl_income_pretax_capital != 0,
income_investment_payable_pensions / ttl_income_pretax_capital, 0),
prop_income_social_share_capital = if_else(ttl_income_pretax_capital != 0,
income_social_share_capital / ttl_income_pretax_capital, 0),
recon_prop_ttl_income_pretax_capital =
prop_ttl_income_factor_capital_icpr +
prop_income_investment_payable_pensions +
prop_income_social_share_capital
) %>%
mutate( # ttl_income_pretax
prop_ttl_income_pretax_labor = if_else(ttl_income_pretax != 0,
ttl_income_pretax_labor / ttl_income_pretax, 0),
prop_ttl_income_pretax_capital = if_else(ttl_income_pretax != 0,
ttl_income_pretax_capital / ttl_income_pretax, 0),
recon_prop_ttl_income_pretax =
prop_ttl_income_pretax_labor +
prop_ttl_income_pretax_capital
) %>%
mutate( # ttl_income_national_factor
prop_ttl_income_factor_inf = if_else(ttl_income_national_factor != 0,
ttl_income_factor / ttl_income_national_factor, 0),
prop_income_social_collective_property_paid_by_govt = if_else(ttl_income_national_factor != 0,
income_social_collective_property_paid_by_govt / ttl_income_national_factor, 0),
prop_income_social_collective_non_profit = if_else(ttl_income_national_factor != 0,
income_social_collective_non_profit / ttl_income_national_factor, 0),
recon_prop_ttl_income_national_factor =
prop_ttl_income_factor_inf +
prop_income_social_collective_property_paid_by_govt +
prop_income_social_collective_non_profit
) %>%
mutate( # ttl_income_national_pretax
prop_ttl_income_pretax = if_else(ttl_income_national_pretax != 0,
ttl_income_pretax / ttl_income_national_pretax, 0),
prop_income_social_collective_property_paid_by_govt = if_else(ttl_income_national_pretax != 0,
income_social_collective_property_paid_by_govt / ttl_income_national_pretax, 0),
prop_income_social_collective_non_profit = if_else(ttl_income_national_pretax != 0,
income_social_collective_non_profit / ttl_income_national_pretax, 0),
prop_surplus_primary_public_pension_system = if_else(ttl_income_national_pretax != 0,
surplus_primary_public_pension_system / ttl_income_national_pretax, 0),
prop_income_investment_pensions_payable = if_else(ttl_income_national_pretax != 0,
income_investment_pensions_payable / ttl_income_national_pretax, 0),
recon_prop_ttl_income_national_pretax =
prop_ttl_income_pretax +
prop_income_social_collective_property_paid_by_govt +
prop_income_social_collective_non_profit +
prop_surplus_primary_public_pension_system +
prop_income_investment_pensions_payable
) %>%
mutate( # ttl_income_national_posttax
prop_ttl_income_disposable_extended_inpo = if_else(ttl_income_national_posttax != 0,
ttl_income_disposable_extended / ttl_income_national_posttax, 0),
prop_income_social_collective_property_paid_by_govt_inpo = if_else(ttl_income_national_posttax != 0,
income_social_collective_property_paid_by_govt / ttl_income_national_posttax, 0),
prop_income_social_collective_non_profit_inpo = if_else(ttl_income_national_posttax != 0,
income_social_collective_non_profit / ttl_income_national_posttax, 0),
prop_surplus_primary_public_pension_system_inpo = if_else(ttl_income_national_posttax != 0,
surplus_primary_public_pension_system / ttl_income_national_posttax, 0),
prop_income_investment_pensions_payable_inpo = if_else(ttl_income_national_posttax != 0,
income_investment_pensions_payable / ttl_income_national_posttax, 0),
prop_surplus_primary_government_inpo = if_else(ttl_income_national_posttax != 0,
surplus_primary_government / ttl_income_national_posttax, 0),
recon_prop_ttl_income_national_posttax =
prop_ttl_income_disposable_extended_inpo +
prop_income_social_collective_property_paid_by_govt_inpo +
prop_income_social_collective_non_profit_inpo +
prop_surplus_primary_public_pension_system_inpo +
prop_income_investment_pensions_payable_inpo +
prop_surplus_primary_government_inpo
) %>%
mutate( # ttl_wealth_net
prop_assets_equity = if_else(ttl_wealth_net != 0,
assets_equity / ttl_wealth_net, 0),
prop_assets_currency = if_else(ttl_wealth_net != 0,
assets_currency / ttl_wealth_net, 0),
prop_assets_housing = if_else(ttl_wealth_net != 0,
assets_housing / ttl_wealth_net, 0),
prop_assets_business = if_else(ttl_wealth_net != 0,
assets_business / ttl_wealth_net, 0),
prop_assets_pension_lifeins = if_else(ttl_wealth_net != 0,
assets_pension_lifeins / ttl_wealth_net, 0),
prop_liabilities_household = if_else(ttl_wealth_net != 0,
liabilities_household / ttl_wealth_net, 0),
recon_prop_ttl_wealth_net =
prop_assets_equity +
prop_assets_currency +
prop_assets_housing +
prop_assets_business +
prop_assets_pension_lifeins +
prop_liabilities_household
)
dina_proportioned %>%
select(contains("recon_prop_")) %>%
glimpse()## Observations: 321,530
## Variables: 11
## $ recon_prop_ttl_income_factor_labor <dbl> 1, 0, 1, 1, 1, 0, 1, 0,…
## $ recon_prop_ttl_income_factor_capital <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ recon_prop_ttl_income_factor <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ recon_prop_ttl_contributions_social_insurance <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ recon_prop_ttl_income_pretax_labor <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ recon_prop_ttl_income_pretax_capital <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ recon_prop_ttl_income_pretax <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ recon_prop_ttl_income_national_factor <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ recon_prop_ttl_income_national_pretax <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ recon_prop_ttl_income_national_posttax <dbl> 1.002893, 1.003051, 1.0…
## $ recon_prop_ttl_wealth_net <dbl> 1, 1, 1, 1, 1, 1, 1, 1,…
End Notes
In the recon section we created effective tax rates for each cohort which will serve as the result variable in the machine learning chapter. The proportion data created here will serve as the bulk of the numeric predictors. We will not be using national income at all; I’ll explain why when we get there. The last piece that is needed are the distributions.
Next up: Creating Income & Wealth Distributions
saveRDS(dina_proportioned, "temp/dina_proportioned.RDS")