Overview

This series of posts is intended to get the reader up speed on how to import, format, and use the economic data of Thomas Piketty, Gabriel Zucman, and Emmanuel Saez. Piketty is most known in the US for his seminal 2014 work Capital in the Twenty-First Century, and Saez and Zucman recently released The Triumph of Injustice: How the Rich Dodge Taxes and How to Make Them Pay.

Summary

As we saw in the previous chapter the variable names are cryptic. I’ve worked with this data a bit and there is a naming convention but it isn’t easy to pick up or very consistent.

There are two types of variables in the source files:
1. Categorical variables for grouping. They will be converted to factors and have their levels recoded to something more meaningful
2. Other numeric variables that need clearer names

In this part we will:

Download the Data appendix from the economists’ site

The Data appendix contains a lot of useful information and can answer many of the questions that you have as you work with the data. The appendix guided my decisions in renaming variables.
Data appendix

Load libraries and read data

library(tidyverse)

dina_df <- readRDS("temp/Dina_df.RDS")

A sampling of the source variable names

If you want to work with these then more power to you. I prefer variable names that are clear and can be selected easily.

sample(names(dina_df), 10)
##  [1] "ploco"      "hwequ"      "fainc"      "dicao"      "prisupgov" 
##  [6] "rentalmort" "dicsh"      "fkhou"      "waghealth"  "fninc"

Convert grouping variables to factors and rename and/or recode their levels

After doing a little exploration I found that these variables have fixed values and should be considered factors. They have ambiguous codes that will be converted to words.

The authors do provide a household id so that values can be aggregated at the household level which is more traditional approach. They instead put their focus on the individual so I will largely ignore the household id grouping variable.

By the way, here are how many household types data indicates for the selected years.

dina_df %>%
  group_by(year) %>%
  summarize(household_types = n_distinct(id))
## # A tibble: 6 x 2
##   year  household_types
##   <chr>           <int>
## 1 1968            19378
## 2 1978            40882
## 3 1988            29413
## 4 1998            32790
## 5 2008            39298
## 6 2018            47682

Let’s get those categorical variables looking better

I’ll use if_else() if there are two levels and recode() if there are more.

group_names <- names(dina_df[5:15])

dina_df2 <- dina_df %>%
  mutate_at(group_names, as.character) %>%
  mutate(gender = if_else(female == "1", "Female", "Male", "Unknown"),
         agegroup_primary = recode(ageprim, "0" = "20-64", "20" = "20-44", 
                                   "45" = "45-64", "65" = "65 Plus"),
         agegroup_secondary = recode(agesec, "0" = "20-64", "20" = "20-44", 
                                     "45" = "45-64", "65" = "65 Plus"), 
         agegroup_imputed = recode(age, "0" = "20-64", "20" = "20-44", 
                                   "45" = "45-64", "65" = "65 Plus"), 
         labor_status_primary = if_else(oldexm == "1", "Retired", "Working", "Unknown"),
         labor_status_secondary = if_else(oldexf == "1", "Retired", "Working", "Unknown"),
         labor_status_imputed = if_else(oldexf == "1", "Retired", "Working", "Unknown"), 
         filing_status = if_else(married == "1", "Married", "Single", "Unknown"), 
         earner_status = if_else(second == "0", "Primary", "Secondary", "Unknown"), 
         num_kids = xkidspop, 
         filer_status = if_else(filer == "1", "Filer", "Not filer", "Unknown")
  ) %>%
  select(-group_names) %>%
  select(year, everything())

grouping_vars <- names(dina_df2[136:146])

dina_df3 <- dina_df2 %>%
  mutate_at(grouping_vars, as.factor)

glimpse(sample_n(dina_df3[136:146], 10))
## Observations: 10
## Variables: 11
## $ gender                 <fct> Male, Female, Female, Female, Female, Female, …
## $ agegroup_primary       <fct> 20-44, 20-64, 65 Plus, 65 Plus, 45-64, 45-64, …
## $ agegroup_secondary     <fct> 20-44, 20-64, 65 Plus, 65 Plus, 45-64, 45-64, …
## $ agegroup_imputed       <fct> 20-44, 20-64, 65 Plus, 65 Plus, 45-64, 45-64, …
## $ labor_status_primary   <fct> Working, Working, Retired, Retired, Working, W…
## $ labor_status_secondary <fct> Working, Working, Working, Retired, Working, W…
## $ labor_status_imputed   <fct> Working, Working, Working, Retired, Working, W…
## $ filing_status          <fct> Married, Married, Single, Married, Married, Si…
## $ earner_status          <fct> Primary, Secondary, Primary, Secondary, Second…
## $ num_kids               <fct> 2, 3, 0, 0, 0, 0, 0, 0, 4, 0
## $ filer_status           <fct> Filer, Filer, Not filer, Filer, Filer, Not fil…

Renaming the numeric variables

All of the other variables are numeric but they would be helped by more descriptive names. This part is long because all of the remaining 135 variables need their names changed and I want to make it as readable as possible. Note that the ttl_ vars are generally close to the vars that make it up.

I have taken multiple passes at these names and I think the current naming convention is pretty stable. I hope so because changing a name now means changing the reconciliation section I’ll show in a future post. I’ll also have a future post on how to create a simple data dictionary for trouble-shooting. Once created you can use it to manage your column names and any notes you want to make.

The naming convention will allow you to use some dplyr functions like contains(), starts_with(), and ends_with() to get the specific columns you’re looking for.

For instance, if you wanted all the columns regarding wealth you would use the following code

dina_final %>%
  select(contains("wealth"))

Rename the remaining variables

Names that begin with ttl_ are aggregations provided in the source data of other variables. In the next chapter we reconcile all of the authors’ aggregations with our own calculations to ensure that they match. All other numeric data should be considered granular and can’t be decomposed any further within this data set.

grouping_vars <- names(dina_df2[136:146])

dina_df4 <- dina_df3 %>%
  mutate_at(grouping_vars, as.factor) %>%
  rename(
    # Socio-demographic
    tax_unit_id = id,
    population_weight = dweght,
    population_weight_ptu = dweghttaxu,
    
    # Core Income & Wealth
    ttl_income_fiscal_excl_capgains = fiinc,
    income_fiscal_wages_pensions = fiwag,
    income_fiscal_business = fibus,
    income_fiscal_rents = firen,
    income_fiscal_interest = fiint,
    income_fiscal_dividends = fidiv,
    income_fiscal_nonfiler_default = fnps,
    
    ttl_income_fiscal_incl_capgains = fninc,
    income_fiscal_capital_gains = fikgi,
    
    ttl_income_factor = fainc,
    ttl_income_factor_labor = flinc,
    income_factor_labor_wages = flemp,
    income_factor_labor_mixed = flmil,
    income_factor_labor_sales_taxes = flprl,
    ttl_income_factor_capital = fkinc,
    income_factor_capital_housing = fkhou,
    income_factor_capital_equity = fkequ,
    income_factor_capital_interest = fkfix,
    income_factor_capital_business = fkbus,
    income_factor_capital_pension_benefits = fkpen,
    
    ttl_income_pretax_labor = plinc,
    ttl_contributions_social_insurance = plcon,
    contributions_social_insurance_pensions = plpco,
    income_social_share_labor = plbel,
    ttl_income_pretax_capital = pkinc,
    income_investment_payable_pensions = pkpen,
    income_social_share_capital = pkbek,
    
    ttl_income_disposable_extended = diinc,
    income_cash_disposable = dicsh,
    income_social_inkind_transfers = inkindinc, 
    ttl_income_social_collective = colexp,
    income_social_collective_property_paid_by_govt = govin, 
    income_social_collective_non_profit = npinc, 
    income_social_collective_education = educ, 
    
    ttl_income_national_factor = princ,
    
    ttl_income_national_pretax = peinc,
    ttl_income_pretax = ptinc,
    surplus_primary_public_pension_system = prisupen,
    income_investment_pensions_payable = invpen,
    
    ttl_income_national_posttax = poinc,
    surplus_primary_private_pension_system = prisupenprivate,
    surplus_primary_government = prisupgov,
    
    ttl_wealth_net = hweal,
    assets_equity = hwequ,
    assets_currency = hwfix,
    assets_housing = hwhou,
    assets_business = hwbus,
    assets_pension_lifeins = hwpen,
    liabilities_household = hwdeb,
    
    # Detailed income & wealth
    ttl_income_national_posttax_2 = poinc2,
    ttl_income_social_benefits = ben,
    ttl_income_social_othercash = dicao,
    ttl_income_social_inkind = otherkin,
    ttl_income_social_share = plben,
    ttl_income_social_collective_2 = colexp2,
    ttl_taxes_payments_contributions = tax,
    
    ttl_income_pretax_pension = ptnin, 
    income_pretax_pension_labor = plnin,
    income_pretax_pension_capital = pknin,
    
    ttl_contributions_social_insurance_govt = govcontrib, 
    contributions_social_insurance_govt_pensions_ui_di = ssuicontrib,
    contributions_social_insurance_govt_other = othercontrib,
    
    income_pension_taxable = peninc,
    income_schedule_net = schcinc,
    income_s_corp_net = scorinc,
    income_partnership_net = partinc,
    income_rental_net = rentinc,
    income_estate_trust_net = estinc,
    income_royalty_net = rylinc,
    income_other_in_agi = othinc,
    income_capital_main_house_asset = fkhoumain,
    income_capital_rental_house = fkhourent,
    income_social_insurance_retirement = ssinc_oa,
    contributions_social_pension = wagpen,
    contributions_social_insurance_di_ui = ploco,
    income_social_insurance_disability = ssinc_di,
    income_social_insurance_unemployment = uiinc,
    income_social_cash_supplemental = disup,
    income_social_cash = dicab, 
    income_social_cash_veterans = divet,
    income_social_taxcredit = dicred,
    income_social_othercash_tanf = tanfinc,
    income_social_othercash_cashlocalstate = othben,
    income_social_health_medicare = medicare,
    income_social_health_medicaid = medicaid,
    contributions_social_health_wages = waghealth,
    income_social_inkind_pell = pell,
    income_social_inkind_vethealth = vethealth,

    wages_all_filers_taxable = flwag,
    wages_all_filers_taxable_supplements = flsup,
    
    benefits_pension = plpbe,
    benefits_pension_capital_share = pkpbk, 
    benefits_pension_labor_share = plpbl,
    benefits_di_ui = plobe,
    
    taxes_capital_sales_excise = fkprk,
    taxes_property_housing = proprestax,
    taxes_property_business = propbustax,
    taxes_income_wealth_current = ditax,
    taxes_federal_income = ditaf,
    taxes_state_income = ditas,
    taxes_sales_excise = salestax, 
    taxes_corporate = corptax,
    taxes_estate = estatetax,
    
    payments_interest = fkdeb,
    payments_interest_mortgage = fkmor,
    payments_interest_nonmortgage = fknmo,
    
    ttl_wealth_personal_net = hwealnokg,
    wealth_rental_housing_gross = rentalhome,
    wealth_rental_housing_mortgages = rentalmort,
    wealth_rental_housing_net = rental,
    wealth_main_housing_gross = ownerhome,
    wealth_main_housing_mortgages = ownermort, 
    wealth_main_housing_net = housing,
    wealth_partnership = partw, 
    wealth_sole_proprietor = soleprop, 
    wealth_s_corp = scorw,
    wealth_equity = equity, 
    wealth_taxable_bond = taxbond, 
    wealth_muni_bond = muni,
    wealth_currency = currency, 
    wealth_non_mortgage_debt = nonmort,
    wealth_household_financial_assets = hwfin,
    wealth_household_nonfinancial_assets = hwnfa
  ) 

sort(sample(names(dina_df4), 10))
##  [1] "agegroup_secondary"              "benefits_pension"               
##  [3] "diwco"                           "income_factor_labor_mixed"      
##  [5] "income_fiscal_dividends"         "income_other_in_agi"            
##  [7] "ttl_income_factor_labor"         "ttl_income_fiscal_excl_capgains"
##  [9] "ttl_wealth_personal_net"         "wages_all_filers_taxable"

Some final modifications

For some reason the authors multipled the population by 100,000 to create the population weight. After dividing that back out you should get the official adult population (age 20+) of the US for each year.

dina_df5 <- dina_df4 %>%
  select(tax_unit_id, grouping_vars, everything()) %>%
  mutate(
    year = factor(year),
    population_weight = population_weight / 100000,
    population_weight_ptu = population_weight_ptu / 100000)

dina_df5 %>%
  group_by(year) %>%
  summarize(total_population = sum(population_weight)) 
## # A tibble: 6 x 2
##   year  total_population
##   <fct>            <dbl>
## 1 1968        123507000.
## 2 1978        149142000.
## 3 1988        174344000.
## 4 1998        196795000.
## 5 2008        220884000.
## 6 2018        243014623.

A nearly perfect doubling of the population in the 50 years between 1968 and 2018.

Next Up: Reconciling the professors’ calculations

saveRDS(dina_df5, "temp/dina_renamed.RDS")

END