Overview
This series of posts is intended to get the reader up speed on how to import, format, and use the economic data of Thomas Piketty, Gabriel Zucman, and Emmanuel Saez. Piketty is most known in the US for his seminal 2014 work Capital in the Twenty-First Century, and Saez and Zucman recently released The Triumph of Injustice: How the Rich Dodge Taxes and How to Make Them Pay.
In this chapter you will:
- Download the micro-files used in downstream analysis
- Load a subset of the files
- Look at the variable names
Locating files on web
The micro-files are currently located in Gabriel Zucman’s Distributional National Accounts page located here http://gabriel-zucman.eu/usdina/
After clicking on the circled link you will be able to download a zip file of all of the current yearly files. There will probably be a new set released every year and the zip file may change over time.
Unzip the source files to a folder that you can get to through R. I have mine in a sub-directory of the main R project.
Create a new folder at the same level as the folder you unzipped to called Dina_subset. You can change the code you’ll see later if you want a different folder structure.
These files are large, every year has almost 69,000 records which represents a generic US individual-type. For instance one record could be for a married working man age 20-64 with x amount of income from various sources, and y amount of wealth.
Finally some code
We will use these libraries in this section
library(tidyverse)
library(fs)
library(haven)Get the filenames in Dina_subset
This uses the fs package’s dir_ls() function
paths <- dir_ls("Dina_subset/")
paths## Dina_subset/usdina1968.dta Dina_subset/usdina1978.dta
## Dina_subset/usdina1988.dta Dina_subset/usdina1998.dta
## Dina_subset/usdina2008.dta Dina_subset/usdina2018.dta
Import the files into a single dataframe
This part does a lot in two lines of code. It maps the filenames to the haven package’s read_dta() function which imports native-Stata data files. It also appends the filename to a column called id, then extracts the year from the filename and puts it into another column.
dina_df <- map_dfr(paths, ~ read_dta(.x), .id = "filename") %>%
extract(filename, "year", "(\\d{4})")
dim(dina_df)## [1] 321530 146
There are 146 variables per record.
Look at the variable names
names(dina_df)## [1] "year" "id" "dweght" "dweghttaxu"
## [5] "female" "ageprim" "agesec" "age"
## [9] "oldexm" "oldexf" "old" "oldmar"
## [13] "married" "second" "xkidspop" "filer"
## [17] "fiinc" "fninc" "fainc" "flinc"
## [21] "fkinc" "ptinc" "plinc" "pkinc"
## [25] "diinc" "princ" "peinc" "poinc"
## [29] "hweal" "fiwag" "fibus" "firen"
## [33] "fiint" "fidiv" "fikgi" "fnps"
## [37] "peninc" "schcinc" "scorinc" "partinc"
## [41] "rentinc" "estinc" "rylinc" "othinc"
## [45] "flemp" "flmil" "flprl" "fkhou"
## [49] "fkequ" "fkfix" "fkbus" "fkpen"
## [53] "fkdeb" "plcon" "plbel" "pkpen"
## [57] "pkbek" "hwequ" "hwfix" "hwhou"
## [61] "hwbus" "hwpen" "hwdeb" "flwag"
## [65] "flsup" "waghealth" "wagpen" "fkhoumain"
## [69] "fkhourent" "fkmor" "fknmo" "fkprk"
## [73] "proprestax" "propbustax" "rental" "rentalhome"
## [77] "rentalmort" "ownerhome" "ownermort" "housing"
## [81] "partw" "soleprop" "scorw" "equity"
## [85] "taxbond" "muni" "currency" "nonmort"
## [89] "hwealnokg" "hwfin" "hwnfa" "plpco"
## [93] "ploco" "plpbe" "plobe" "plben"
## [97] "plpbl" "plnin" "pkpbk" "pknin"
## [101] "ptnin" "dicsh" "inkindinc" "colexp"
## [105] "govin" "npinc" "prisupen" "invpen"
## [109] "peinck" "peincl" "prisupenprivate" "prisupgov"
## [113] "educ" "colexp2" "poinc2" "tax"
## [117] "ditax" "ditaf" "ditas" "salestax"
## [121] "corptax" "estatetax" "govcontrib" "ssuicontrib"
## [125] "othercontrib" "ssinc_oa" "ssinc_di" "uiinc"
## [129] "ben" "dicab" "dicred" "difoo"
## [133] "disup" "divet" "diwco" "dicao"
## [137] "tanfinc" "othben" "medicare" "medicaid"
## [141] "otherkin" "pell" "vethealth" "corptax0"
## [145] "corptax60" "corptax100"
Save RDS for next chapter
saveRDS(dina_df, "Dina_df.RDS")