IPUMS with PostgreSQL, R, and dplyr

This is a brief write-up to document how I am using IPUMS data with PostgreSQL, R, and dplyr. The Integrated Public Use Microdata Series (IPUMS) is a project from the Minnesota Population Center to provide users with access to formatted microdata. IPUMS is a fantastic resource for demographers as it includes many different datasets, including microdata samples from the US and several international censuses, as well as complete-count historical census data for a few countries. I’ve been using IPUMS data for many years, but I have always done so with proprietary statistical software; first SPSS, then Stata. However, I’m trying to open-source my workflow to enhance reproducibility, so I wanted to figure out how to use IPUMS data with open-source software.

I’m presently working on a series of projects that look at internal migration trends in the United States over the past several years. I’m drawing data from a large extract I produced from the IPUMS system, that includes 1 percent samples from the American Community Surveys between 2005 and 2011.

IPUMS provides the data in fixed-width format along with scripts to import the data into SAS, SPSS, and Stata out of the box. However, they also give you the option of downloading the data as a CSV. My CSV, which has around 21 million rows, is just under 2.5 GB in size. Now I could read this CSV directly into R; the fread function from the data.table package takes care of this in 33 seconds. However, I’d prefer not to read the whole file into memory, which is why I’ve turned to the database solution.

The import scripts and codebook provided by IPUMS provide information about the variables in the dataset. I’ve copied and pasted this information into Excel and used Excel to massage it into a SQL query. There are ways to parse these files programmatically but I wanted to go through this process myself. I’m not going to deal with value labels here - just the values themselves. I created the following query, which I ran using pgAdmin III, to create a new table in my database and copy the CSV to it.


create table migration1 (
  year  int,
  datanum   int,
  serial    double precision,
  hhwt  double precision,
  region    int,
  statefip  int,
  metarea   int,
  metaread  int,
  puma  int,
  pumares2mig   int,
  gq    int,
  hhincome  int,
  pernum    int,
  perwt double precision,
  nchild    int,
  nchlt5    int,
  age   int,
  sex   int,
  marst int,
  race  int,
  raced int,
  bpl   int,
  bpld  int,
  hispan    int,
  hispand   int,
  educ  int,
  educd int,
  empstat   int,
  empstatd  int,
  labforce  int,
  occ   int,
  inctot    int,
  migrate1  int,
  migrate1d int,
  migplac1  int,
  migmet1   int,
  migpuma1  int,
  pwstate2  int,
  pwcity    int,
  pwpuma00  int,
  tranwork  int
);

copy migration1 from 'usa_00009.csv' delimiter ',' CSV header;

The import to PostgreSQL took about 207 seconds to complete on my machine.

I now have a PostgreSQL table ready to go for use in R. When I am using R, my go-to package for data preparation is dplyr. If you are unfamiliar with dplyr, it is a new project from Hadley Wickham and his collaborators that includes specialized functions for working with data frames as well as data stored in remote databases. As such, I can use dplyr to connect to my database, work with the data remotely in R, and then pull it into R once I am ready for it.

To get started, I first make a database connection, then create a table in which I retain only those records who have migrated in the past year. Migration geographies in the IPUMS refer to Public Use Microdata Areas, or PUMAs, the finest level of geographic detail available in the Census microdata samples. For reference, a PUMA is a geographic entity built from Census tracts that contains 100,000 to 200,000 people, in general. In New York City, PUMAs are very similar to the city’s “community districts;” whereas in rural areas PUMAs may contain multiple counties.

library(dplyr)

mig_db <- src_postgres(dbname = 'ipums_migration', user = 'postgres', password = 'postgres')

migrants <- tbl(mig_db, sql('SELECT *
                            FROM migration1 
                            WHERE migrate1d > 10'))

migrants

## Source: postgres 9.3.4 [postgres@localhost:5432/ipums_migration]
## From: <derived table> [?? x 41]
## 
##    year datanum  serial hhwt region statefip metarea metaread puma
## 1  2010       1 1397564   62     41       56       0        0  400
## 2  2010       1 1397564   62     41       56       0        0  400
## 3  2010       1 1397566   44     41       56       0        0  200
## 4  2010       1 1397568   36     41       56       0        0  200
## 5  2010       1 1397568   36     41       56       0        0  200
## 6  2010       1 1397572  147     41       56       0        0  200
## 7  2010       1 1397572  147     41       56       0        0  200
## 8  2010       1 1397584  142     41       56       0        0  400
## 9  2010       1 1397589  123     41       56       0        0  300
## 10 2010       1 1397605  163     41       56       0        0  100
## ..  ...     ...     ...  ...    ...      ...     ...      ...  ...
## Variables not shown: pumares2mig (int), gq (int), hhincome (int), pernum
##   (int), perwt (dbl), nchild (int), nchlt5 (int), age (int), sex (int),
##   marst (int), race (int), raced (int), bpl (int), bpld (int), hispan
##   (int), hispand (int), educ (int), educd (int), empstat (int), empstatd
##   (int), labforce (int), occ (int), inctot (int), migrate1 (int),
##   migrate1d (int), migplac1 (int), migmet1 (int), migpuma1 (int), pwstate2
##   (int), pwcity (int), pwpuma00 (int), tranwork (int)

With dplyr’s database connection, when I ask R for the table, it doesn’t return the whole table (which I wouldn’t want) but rather a summary of that table, which is seen above. This is very useful for understanding what your data looks like coming from the database.

Now, I can use dplyr to perform some operations on my table. For this example, I want to know how internal migrants’ destinations vary by educational level. First, I want to further subset my data, as I am only going to look at internal migrants from the 2011 sample aged 25 and up.

migrant_subset <- filter(migrants, year == 2011 & age >= 25 & migrate1d != 40)

migrant_subset

## Source: postgres 9.3.4 [postgres@localhost:5432/ipums_migration]
## From: <derived table> [?? x 41]
## Filter: year == 2011 & age >= 25 & migrate1d != 40 
## 
##    year datanum serial hhwt region statefip metarea metaread puma
## 1  2011       1     15   16     32        1     516     5160 2400
## 2  2011       1     15   16     32        1     516     5160 2400
## 3  2011       1     17   26     32        1     516     5160 2300
## 4  2011       1     35   25     32        1       0        0  700
## 5  2011       1     52   49     32        1       0        0 1100
## 6  2011       1     52   49     32        1       0        0 1100
## 7  2011       1     56   27     32        1       0        0 2100
## 8  2011       1     57    8     32        1       0        0 2500
## 9  2011       1     60   33     32        1       0        0  500
## 10 2011       1     62  106     32        1     516     5160 2300
## ..  ...     ...    ...  ...    ...      ...     ...      ...  ...
## Variables not shown: pumares2mig (int), gq (int), hhincome (int), pernum
##   (int), perwt (dbl), nchild (int), nchlt5 (int), age (int), sex (int),
##   marst (int), race (int), raced (int), bpl (int), bpld (int), hispan
##   (int), hispand (int), educ (int), educd (int), empstat (int), empstatd
##   (int), labforce (int), occ (int), inctot (int), migrate1 (int),
##   migrate1d (int), migplac1 (int), migmet1 (int), migpuma1 (int), pwstate2
##   (int), pwcity (int), pwpuma00 (int), tranwork (int)

Now that I have my subsetted table, I’m going to pull the data into R with the collect function. I want to do this as I’m having issues performing operations on derived columns while the data are in the database - it looks like this is something the dplyr developers are working on, as per this GitHub issue.

migrant_df <- collect(migrant_subset)

head(migrant_df)

## Source: local data frame [6 x 41]
## 
##   year datanum serial hhwt region statefip metarea metaread puma
## 1 2011       1     15   16     32        1     516     5160 2400
## 2 2011       1     15   16     32        1     516     5160 2400
## 3 2011       1     17   26     32        1     516     5160 2300
## 4 2011       1     35   25     32        1       0        0  700
## 5 2011       1     52   49     32        1       0        0 1100
## 6 2011       1     52   49     32        1       0        0 1100
## Variables not shown: pumares2mig (int), gq (int), hhincome (int), pernum
##   (int), perwt (dbl), nchild (int), nchlt5 (int), age (int), sex (int),
##   marst (int), race (int), raced (int), bpl (int), bpld (int), hispan
##   (int), hispand (int), educ (int), educd (int), empstat (int), empstatd
##   (int), labforce (int), occ (int), inctot (int), migrate1 (int),
##   migrate1d (int), migplac1 (int), migmet1 (int), migpuma1 (int), pwstate2
##   (int), pwcity (int), pwpuma00 (int), tranwork (int)

Now, I want to perform some operations on my data frame to do my analysis. dplyr makes this straightforward with chaining using the %>% operator, which allows me to perform multiple operations on the data frame at once.

In the code that follows, I recode education into a four-category variable using mutate, I group by my newly recoded education variable and migration type, I sum up the perwt (person weight) column to get representative counts, calculate group percentages and define a couple ordered factors, then select the columns that I want. The results are as follows.

library(xtable)

migration_by_educ <- migrant_df %>%
  mutate(
    educ_general = ifelse(educd > 113, "Graduate degree", 
                      ifelse(educd > 100 & educd < 114, "BA/BS degree", 
                         ifelse(educd > 61 & educd < 101, "High school", 
                              ifelse(educd < 62, "Less than high school", "other"))))) %>%
  group_by(educ_general, migrate1d) %>%
  summarize(totnum = sum(perwt)) %>%
  mutate(percent = totnum / sum(totnum),  
         migfactor = factor(migrate1d, 
                            levels = c(23, 24, 31, 32), 
                            labels = c("Within PUMA", 
                                       "Between PUMAs", 
                                       "Contiguous states", 
                                       "Non-contiguous states")), 
         educfactor = factor(educ_general, 
                             levels = c("Less than high school", 
                                        "High school", 
                                        "BA/BS degree", 
                                        "Graduate degree"))) %>%
  arrange(educfactor, migfactor) %>%
  ungroup() %>%
  select(educfactor, migfactor, percent)

print(xtable(migration_by_educ), type = "html")

	educfactor	migfactor	percent
1	Less than high school	Within PUMA	0.71
2	Less than high school	Between PUMAs	0.19
3	Less than high school	Contiguous states	0.03
4	Less than high school	Non-contiguous states	0.06
5	High school	Within PUMA	0.66
6	High school	Between PUMAs	0.19
7	High school	Contiguous states	0.05
8	High school	Non-contiguous states	0.10
9	BA/BS degree	Within PUMA	0.58
10	BA/BS degree	Between PUMAs	0.20
11	BA/BS degree	Contiguous states	0.06
12	BA/BS degree	Non-contiguous states	0.15
13	Graduate degree	Within PUMA	0.52
14	Graduate degree	Between PUMAs	0.20
15	Graduate degree	Contiguous states	0.08
16	Graduate degree	Non-contiguous states	0.20

I can also visualize the comparisons with ggplot2:

library(ggplot2)
library(scales)

ggplot(migration_by_educ, aes(x = educfactor, y = percent, fill = migfactor)) + 
  geom_bar(position = "stack", stat = "identity") + 
  scale_fill_brewer(palette = "Set1") + 
  labs(list(x = "Education category", 
            y = "Percent", 
            fill = "Type of migration")) + 
  theme(legend.position = "bottom") + 
  scale_y_continuous(labels = percent)

plot of chunk unnamed-chunk-5

The general results are unsurprising; migrants with more education tend to move longer distances. This is likely due to the higher incomes of educated migrants, the different types of labor markets that they participate in, or many other factors that have to do with education (“cosmopolitanism,” perhaps).

I do think some of the numeric divides are noteworthy, however. We see that the share of migrants at each education level that moved “Between PUMAs” within the same state is nearly identical - right around 19 or 20 percent. The divides, then, are most visible when it comes to the shortest or longest moves. 71 percent of migrants with less than a high school education moved within their PUMA of residence; less than 10 percent left their state. Conversely, only 52 percent of graduate-educated migrants stayed within their PUMA, whereas 28 percent moved between states. This reflects recent studies (like here and here) that suggest that the American population is increasingly sorting itself by education and occupation.

Certainly, there are many more questions to be answered. For example, how do these trends vary by occupation? By race? Where, specifically, are these migrants going… and what places are they leaving? The great thing about IPUMS is that answers to these questions are found in the dataset, which is made easier with the help of dplyr. I should note that any more complicated calculations (e.g. anything that requires the calculation of standard errors) should be done with R’s survey package, which can handle IPUMS’ complex sample design. For data munging and exploratory analysis, however, dplyr is the way to go.

I’d welcome questions, comments, or feedback; I’m available by email or on Twitter.

IPUMS with PostgreSQL, R, and dplyr

Kyle Walker

July 22, 2014