Overview
This series of posts is intended to get the reader up speed on how to import, format, and use the economic data of Thomas Piketty, Gabriel Zucman, and Emmanuel Saez. Piketty is most known in the US for his seminal 2014 work Capital in the Twenty-First Century, and Saez and Zucman recently released The Triumph of Injustice: How the Rich Dodge Taxes and How to Make Them Pay.
Summary
As we saw in the previous chapter the variable names are cryptic. I’ve worked with this data a bit and there is a naming convention but it isn’t easy to pick up or very consistent.
There are two types of variables in the source files:
1. Categorical variables for grouping. They will be converted to factors and have their levels recoded to something more meaningful
2. Other numeric variables that need clearer names
In this part we will:
- Download the Data Appendix
- Give factors better level names
- Give other variables more descriptive names
- Correct the population weighting
Download the Data appendix from the economists’ site
The Data appendix contains a lot of useful information and can answer many of the questions that you have as you work with the data. The appendix guided my decisions in renaming variables.
Load libraries and read data
library(tidyverse)
dina_df <- readRDS("temp/Dina_df.RDS")A sampling of the source variable names
If you want to work with these then more power to you. I prefer variable names that are clear and can be selected easily.
sample(names(dina_df), 10)## [1] "ploco" "hwequ" "fainc" "dicao" "prisupgov"
## [6] "rentalmort" "dicsh" "fkhou" "waghealth" "fninc"
Convert grouping variables to factors and rename and/or recode their levels
After doing a little exploration I found that these variables have fixed values and should be considered factors. They have ambiguous codes that will be converted to words.
The authors do provide a household id so that values can be aggregated at the household level which is more traditional approach. They instead put their focus on the individual so I will largely ignore the household id grouping variable.
By the way, here are how many household types data indicates for the selected years.
dina_df %>%
group_by(year) %>%
summarize(household_types = n_distinct(id))## # A tibble: 6 x 2
## year household_types
## <chr> <int>
## 1 1968 19378
## 2 1978 40882
## 3 1988 29413
## 4 1998 32790
## 5 2008 39298
## 6 2018 47682
Let’s get those categorical variables looking better
I’ll use if_else() if there are two levels and recode() if there are more.
group_names <- names(dina_df[5:15])
dina_df2 <- dina_df %>%
mutate_at(group_names, as.character) %>%
mutate(gender = if_else(female == "1", "Female", "Male", "Unknown"),
agegroup_primary = recode(ageprim, "0" = "20-64", "20" = "20-44",
"45" = "45-64", "65" = "65 Plus"),
agegroup_secondary = recode(agesec, "0" = "20-64", "20" = "20-44",
"45" = "45-64", "65" = "65 Plus"),
agegroup_imputed = recode(age, "0" = "20-64", "20" = "20-44",
"45" = "45-64", "65" = "65 Plus"),
labor_status_primary = if_else(oldexm == "1", "Retired", "Working", "Unknown"),
labor_status_secondary = if_else(oldexf == "1", "Retired", "Working", "Unknown"),
labor_status_imputed = if_else(oldexf == "1", "Retired", "Working", "Unknown"),
filing_status = if_else(married == "1", "Married", "Single", "Unknown"),
earner_status = if_else(second == "0", "Primary", "Secondary", "Unknown"),
num_kids = xkidspop,
filer_status = if_else(filer == "1", "Filer", "Not filer", "Unknown")
) %>%
select(-group_names) %>%
select(year, everything())
grouping_vars <- names(dina_df2[136:146])
dina_df3 <- dina_df2 %>%
mutate_at(grouping_vars, as.factor)
glimpse(sample_n(dina_df3[136:146], 10))## Observations: 10
## Variables: 11
## $ gender <fct> Male, Female, Female, Female, Female, Female, …
## $ agegroup_primary <fct> 20-44, 20-64, 65 Plus, 65 Plus, 45-64, 45-64, …
## $ agegroup_secondary <fct> 20-44, 20-64, 65 Plus, 65 Plus, 45-64, 45-64, …
## $ agegroup_imputed <fct> 20-44, 20-64, 65 Plus, 65 Plus, 45-64, 45-64, …
## $ labor_status_primary <fct> Working, Working, Retired, Retired, Working, W…
## $ labor_status_secondary <fct> Working, Working, Working, Retired, Working, W…
## $ labor_status_imputed <fct> Working, Working, Working, Retired, Working, W…
## $ filing_status <fct> Married, Married, Single, Married, Married, Si…
## $ earner_status <fct> Primary, Secondary, Primary, Secondary, Second…
## $ num_kids <fct> 2, 3, 0, 0, 0, 0, 0, 0, 4, 0
## $ filer_status <fct> Filer, Filer, Not filer, Filer, Filer, Not fil…
Renaming the numeric variables
All of the other variables are numeric but they would be helped by more descriptive names. This part is long because all of the remaining 135 variables need their names changed and I want to make it as readable as possible. Note that the ttl_ vars are generally close to the vars that make it up.
I have taken multiple passes at these names and I think the current naming convention is pretty stable. I hope so because changing a name now means changing the reconciliation section I’ll show in a future post. I’ll also have a future post on how to create a simple data dictionary for trouble-shooting. Once created you can use it to manage your column names and any notes you want to make.
The naming convention will allow you to use some dplyr functions like contains(), starts_with(), and ends_with() to get the specific columns you’re looking for.
For instance, if you wanted all the columns regarding wealth you would use the following code
dina_final %>%
select(contains("wealth"))Rename the remaining variables
Names that begin with ttl_ are aggregations provided in the source data of other variables. In the next chapter we reconcile all of the authors’ aggregations with our own calculations to ensure that they match. All other numeric data should be considered granular and can’t be decomposed any further within this data set.
grouping_vars <- names(dina_df2[136:146])
dina_df4 <- dina_df3 %>%
mutate_at(grouping_vars, as.factor) %>%
rename(
# Socio-demographic
tax_unit_id = id,
population_weight = dweght,
population_weight_ptu = dweghttaxu,
# Core Income & Wealth
ttl_income_fiscal_excl_capgains = fiinc,
income_fiscal_wages_pensions = fiwag,
income_fiscal_business = fibus,
income_fiscal_rents = firen,
income_fiscal_interest = fiint,
income_fiscal_dividends = fidiv,
income_fiscal_nonfiler_default = fnps,
ttl_income_fiscal_incl_capgains = fninc,
income_fiscal_capital_gains = fikgi,
ttl_income_factor = fainc,
ttl_income_factor_labor = flinc,
income_factor_labor_wages = flemp,
income_factor_labor_mixed = flmil,
income_factor_labor_sales_taxes = flprl,
ttl_income_factor_capital = fkinc,
income_factor_capital_housing = fkhou,
income_factor_capital_equity = fkequ,
income_factor_capital_interest = fkfix,
income_factor_capital_business = fkbus,
income_factor_capital_pension_benefits = fkpen,
ttl_income_pretax_labor = plinc,
ttl_contributions_social_insurance = plcon,
contributions_social_insurance_pensions = plpco,
income_social_share_labor = plbel,
ttl_income_pretax_capital = pkinc,
income_investment_payable_pensions = pkpen,
income_social_share_capital = pkbek,
ttl_income_disposable_extended = diinc,
income_cash_disposable = dicsh,
income_social_inkind_transfers = inkindinc,
ttl_income_social_collective = colexp,
income_social_collective_property_paid_by_govt = govin,
income_social_collective_non_profit = npinc,
income_social_collective_education = educ,
ttl_income_national_factor = princ,
ttl_income_national_pretax = peinc,
ttl_income_pretax = ptinc,
surplus_primary_public_pension_system = prisupen,
income_investment_pensions_payable = invpen,
ttl_income_national_posttax = poinc,
surplus_primary_private_pension_system = prisupenprivate,
surplus_primary_government = prisupgov,
ttl_wealth_net = hweal,
assets_equity = hwequ,
assets_currency = hwfix,
assets_housing = hwhou,
assets_business = hwbus,
assets_pension_lifeins = hwpen,
liabilities_household = hwdeb,
# Detailed income & wealth
ttl_income_national_posttax_2 = poinc2,
ttl_income_social_benefits = ben,
ttl_income_social_othercash = dicao,
ttl_income_social_inkind = otherkin,
ttl_income_social_share = plben,
ttl_income_social_collective_2 = colexp2,
ttl_taxes_payments_contributions = tax,
ttl_income_pretax_pension = ptnin,
income_pretax_pension_labor = plnin,
income_pretax_pension_capital = pknin,
ttl_contributions_social_insurance_govt = govcontrib,
contributions_social_insurance_govt_pensions_ui_di = ssuicontrib,
contributions_social_insurance_govt_other = othercontrib,
income_pension_taxable = peninc,
income_schedule_net = schcinc,
income_s_corp_net = scorinc,
income_partnership_net = partinc,
income_rental_net = rentinc,
income_estate_trust_net = estinc,
income_royalty_net = rylinc,
income_other_in_agi = othinc,
income_capital_main_house_asset = fkhoumain,
income_capital_rental_house = fkhourent,
income_social_insurance_retirement = ssinc_oa,
contributions_social_pension = wagpen,
contributions_social_insurance_di_ui = ploco,
income_social_insurance_disability = ssinc_di,
income_social_insurance_unemployment = uiinc,
income_social_cash_supplemental = disup,
income_social_cash = dicab,
income_social_cash_veterans = divet,
income_social_taxcredit = dicred,
income_social_othercash_tanf = tanfinc,
income_social_othercash_cashlocalstate = othben,
income_social_health_medicare = medicare,
income_social_health_medicaid = medicaid,
contributions_social_health_wages = waghealth,
income_social_inkind_pell = pell,
income_social_inkind_vethealth = vethealth,
wages_all_filers_taxable = flwag,
wages_all_filers_taxable_supplements = flsup,
benefits_pension = plpbe,
benefits_pension_capital_share = pkpbk,
benefits_pension_labor_share = plpbl,
benefits_di_ui = plobe,
taxes_capital_sales_excise = fkprk,
taxes_property_housing = proprestax,
taxes_property_business = propbustax,
taxes_income_wealth_current = ditax,
taxes_federal_income = ditaf,
taxes_state_income = ditas,
taxes_sales_excise = salestax,
taxes_corporate = corptax,
taxes_estate = estatetax,
payments_interest = fkdeb,
payments_interest_mortgage = fkmor,
payments_interest_nonmortgage = fknmo,
ttl_wealth_personal_net = hwealnokg,
wealth_rental_housing_gross = rentalhome,
wealth_rental_housing_mortgages = rentalmort,
wealth_rental_housing_net = rental,
wealth_main_housing_gross = ownerhome,
wealth_main_housing_mortgages = ownermort,
wealth_main_housing_net = housing,
wealth_partnership = partw,
wealth_sole_proprietor = soleprop,
wealth_s_corp = scorw,
wealth_equity = equity,
wealth_taxable_bond = taxbond,
wealth_muni_bond = muni,
wealth_currency = currency,
wealth_non_mortgage_debt = nonmort,
wealth_household_financial_assets = hwfin,
wealth_household_nonfinancial_assets = hwnfa
)
sort(sample(names(dina_df4), 10))## [1] "agegroup_secondary" "benefits_pension"
## [3] "diwco" "income_factor_labor_mixed"
## [5] "income_fiscal_dividends" "income_other_in_agi"
## [7] "ttl_income_factor_labor" "ttl_income_fiscal_excl_capgains"
## [9] "ttl_wealth_personal_net" "wages_all_filers_taxable"
Some final modifications
For some reason the authors multipled the population by 100,000 to create the population weight. After dividing that back out you should get the official adult population (age 20+) of the US for each year.
dina_df5 <- dina_df4 %>%
select(tax_unit_id, grouping_vars, everything()) %>%
mutate(
year = factor(year),
population_weight = population_weight / 100000,
population_weight_ptu = population_weight_ptu / 100000)
dina_df5 %>%
group_by(year) %>%
summarize(total_population = sum(population_weight)) ## # A tibble: 6 x 2
## year total_population
## <fct> <dbl>
## 1 1968 123507000.
## 2 1978 149142000.
## 3 1988 174344000.
## 4 1998 196795000.
## 5 2008 220884000.
## 6 2018 243014623.
A nearly perfect doubling of the population in the 50 years between 1968 and 2018.
Next Up: Reconciling the professors’ calculations
saveRDS(dina_df5, "temp/dina_renamed.RDS")