1 Introduction

1.1 Migration

To understand the value of the IRS Migration data, it’s helpful to understand how migration is measured and why migration data are so notoriously difficult to collect. So, a quick primer on measuring migration:


1.1.1 Why is migration difficult to collect?

  • Migration involves two locations, an origin (i) and destination (j).
    • Migration can be measured or reported at either destination or origin and these efforts are usually not coordinated.
    • Measuring migration is longitudinal in nature (involves at least 2 time points to measure a migrant in the origin at time T and destination at time T+1.
    • Origins and destinations are interested in measuring migration for different reasons.
    • Following migrants to new locations is difficult.
  • Defining migration is often political when it involves crossing borders, and there are many different definitions.
  • Defining migration often depends on how long a migrant has been in the destination.


1.1.2 Ways to measure migration

Two commonly used measures are migrant stocks and migrant flows.

1.1.2.1 Migrant Stocks

Migrant stocks are typically a cross-sectional measure of the count of migrants or foreign-born population at a given time. They are reported by destination country. They are easy to measure, define, and compare across locations and are more readily available.


1.1.2.2 Migrant Flows

Migrant flows try to capture the dynamic nature of migration as it involves two physical locations - an origin (i) and a destination (j). Measuring flows is difficult because migrants are hard to follow across locations, the flow could be reported at either location and origin and destination countries may have differing definitions of what counts as a flow, making them difficult to collect and compare.

Migration Flows, source: Matthew Hauer publication


1.2 IRS Migration Data

The Internal Revenue Service’s (IRS) county-to-county migration data produces annual estimates of county-to-county migration, available since 1990 (tax filing year 1991). It is one of the only sources of county-to-county migration data outside of the decennial U.S. Census and the American Community Survey (ACS).

Migration flows are estimated by year-to-year address changes reported on individual income tax returns filed with the IRS.


These data contain:

  • Number of returns filed -> which approximates the number of households that migrated
    • Migration is estimated as outflows and inflows.
  • Number of personal exemptions claimed -> which approximates the number of individuals
  • Total adjusted gross income (starting in filing year 1995)
  • Approximately 95-98% of the tax filing universe is represented in this dataset, making it the dataset that is one of the largest, most regularly estimated, with longest time series available, of any internal migration data source within the United States.


1.2.1 Limitations

  • The data include all U.S. federal income taxpayers, and therefore underrepresent the very poor and older populations, who are less likely to file income tax returns or be included as dependents on others’ tax returns, as well as the small percentage of tax returns filed after late September of the filing year.

  • IRS data do not contain sociodemographic information, unlike the ACS whose migration flow data contain important covariates.

  • After 2011, the IRS changed the way they produced these data. Not only does this create a break in the time-series comparability, but researchers have found these post-2010 data should not be used until the IRS clarifies the changes that have been made. Read more about this issue here.

  • VERY challenging to use! The drawback of these data are available in seven different legacy formats across 2000+ data files - which represents “a serious burden for migration scholars” to process these data into a usable format.


1.2.2 Why not just use ACS?

Below, Hauer and Byers compare the IRS data to ACS data in use for migration, showing it’s coverage and availability make it very advantageous to use.

Table from Hauer & Byars 2019, IRS county-to-county migration data, 1990‒2010

Bonus material: If you want to see how these data are used in research, here’s an example of a paper that used these data to understand where people went after Hurricane Katrina.



1.3 Hauer & Byars county-to-county flows

Matthew Hauer and James Byars processed these data from 1990-2010 to produce a single, flat, data file containing county-to-county migration flows to encourage and facilitate the use of these data.




2 How to use

2.1 Download data

To use these data, download the data file directly from GitHub.


# download migration data from Github
url <- "https://raw.githubusercontent.com/mathewhauer/IRS-migration-data/master/DATA-PROCESSED/county_migration_data.txt"
download.file(url, "irs_migration_data.txt")

# load in data
df <- read.table(file = "irs_migration_data.txt", # reference the file name
                 header = T, # header tells R the data have column names
                 sep = "") # this is a tab separated file


2.1.1 What do the data look like?

Each row of the dataset represents an origin-destination county combinations (n>160,00).

The variables are:

  • ORIGIN - Refers to the 5-digit FIPS code for the origin of the migrants.

  • DESTINATION - Refers to the two-digit FIPS code associated with each state.

  • YEAR COLUMN (X1990:X2010) - Refers to number of migrants who moved from ORIGIN to DESTINATION in a given year. (note: counties were coded 99999 if all migration flows contained less than 10 tax filers.)


# show the first few rows of data
head(df[1:10]) #selected first 10 columns for easy viewing
##   origin destination X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997
## 1   1001        1001 26703 27278 27704 28677 29118 31910 32952 33446
## 2   1001        1003     0     0    27     0     0    23     0     0
## 3   1001        1013     0     0     0     0     0     0     0     0
## 4   1001        1021   101    94   112    79    77   104   103   133
## 5   1001        1041     0     0     0     0     0     0     0     0
## 6   1001        1047    68   100    96    82    65    78   115   118


In the glimpse of the data above, we can see that in 1990, 101 people migrated to 1021 (Chilton county, AL) from 1001 (Autagua county, AL). Notice, that the first row represents people who did not migrate (where origin and destination are both 1001).


2.1.2 Making FIPS codes more useful

Next, it’s helpful to get the county and state names for the FIPS codes. This will make it easier to interpret the data. Learn more about this helpful FIPS data here.


# download helpful fips table for county and state names
url <- ("https://raw.githubusercontent.com/ChuckConnell/articles/master/fips2county.tsv")
download.file(url, "fips_table.tsv")

# load in data
df_fips <- read.csv("fips_table.tsv", 
                       sep='\t', 
                       header=T, 
                       encoding='latin-1')

# create two new columns with fips code for joining with IRS data 
df_fips <- df_fips %>%
  mutate(destination = CountyFIPS,
         origin = CountyFIPS)

# create dataframe with origin and destination county and state names
df_names <- df %>% 
  right_join(df_fips, by="origin") %>%  # merge data matching on origin
  select(-c(STATE_COUNTY, destination.y, 
            CountyFIPS, StateFIPS, CountyFIPS_3)) %>% # don't keep all vars
  rename(origin_county = CountyName,  #  rename origin county var
         origin_state = StateName,    #  rename origin state var
         origin_abbr = StateAbbr,     #  rename origin state abbr var
         destination = destination.x)  %>% 
  right_join(df_fips, by="destination") %>%  # repeat the process for destination!
  select(-c(STATE_COUNTY, origin.y,
            CountyFIPS, StateFIPS, CountyFIPS_3)) %>% 
  rename(destination_county = CountyName, 
         destination_state = StateName,
         destination_abbr = StateAbbr,     
         origin = origin.x) %>%
  mutate(migrant = case_when(
    origin == destination ~ "no",
    TRUE ~ "yes"
  ))


Now these data have other columns for county and state name we can use to understand migration flows.


# show the first few rows of data
df_names %>% select(origin, destination, X1990,
              origin_county, origin_state, origin_abbr,
              destination_county, destination_state, destination_abbr) %>%
              head() 
##   origin destination X1990 origin_county origin_state origin_abbr
## 1   1001        1001 26703       Autauga      Alabama          AL
## 2   1001        1003     0       Autauga      Alabama          AL
## 3   1001        1013     0       Autauga      Alabama          AL
## 4   1001        1021   101       Autauga      Alabama          AL
## 5   1001        1041     0       Autauga      Alabama          AL
## 6   1001        1047    68       Autauga      Alabama          AL
##   destination_county destination_state destination_abbr
## 1            Autauga           Alabama               AL
## 2            Baldwin           Alabama               AL
## 3             Butler           Alabama               AL
## 4            Chilton           Alabama               AL
## 5           Crenshaw           Alabama               AL
## 6             Dallas           Alabama               AL


2.2 Visualize the data


Chord plots are an effective way at visualizing migration flows. But there are so many origin-destination combinations that we’ll need to pare down the data.


2.2.1 County flow in Washington

First let’s look at county-to-county flows within Washington in 1990. To do this, we’ll need to filter for rows where either the origin or destination has a Washington county FIPS code. This is where it is handy that we have added a column for state name, so we can filter on that.


# this package is from Guy Abel and does lots of migration stuff
library(migest)


# create chord plot
# NOTE chord plots look at the first three columns for: 
    # origin, destination, migration flow estimate

# select only counties in WA
df_WA_1990 <- df_names %>% 
  select(origin_county, destination_county, 
         X1990, 
         origin_abbr, destination_abbr, migrant) %>%
  filter(origin_abbr=="WA" & destination_abbr=="WA")

df_WA_2010 <- df_names %>% 
  select(origin_county, destination_county, 
         X2010, 
         origin_abbr, destination_abbr, migrant) %>%
  filter(origin_abbr=="WA" & destination_abbr=="WA")

# select 1990 data in WA
df_WA_1990 %>% mig_chord()


There is too much information in this figure to understand it well. Additionally, there are non-migrants are still in the dataset. If we take out non-migrant, we’ll be able to see migrant flows more clearly.


# take out non-migrants
df_WA_1990 %>%
  filter(migrant=="yes") %>%
  mig_chord()


Still a little messy, because there are so many counties - let’s look at high flow origin-destination pairs and compare across years. To do this, I filter out any flows where less than 1000 people move. Also, if we sort flows, the figure becomes easier to read.


# only keep flows >= 1,000

#first for 1990
df_WA_1990 %>%
  filter(migrant=="yes",
         X1990>=1000) %>%
  arrange(X1990) %>%
  mig_chord(., 
            axis_size = 0.5) # control axis text size

# repeat for 2010
df_WA_2010 %>%
  filter(migrant=="yes",
         X2010>=1000) %>%
  arrange(X2010) %>%
  mig_chord(., 
            axis_size = 0.5)


2.2.2 State flows


We can also aggregate these data for state flows. To do this, I use the aggregate function in R and aggregate all of the column years (X1990:X2010) by two variables: origin_state and destination_state.

Below, I plot the flows in and out of Washington state. I also left out state-to-state flows where there were less than 100 people who moved to or from Washington (for example, no one moved between West Virginia and Washington in 1990 or 2010…).


# aggregate data for summary of flows by state, only looking at WA and over 100 flows
df_state <- df_names %>%
  filter(origin_state!=destination_state,
         origin_abbr=="WA" | destination_abbr=="WA") %>% # don't keep stayers
  select(origin_state, destination_state, X1990:X2010) %>%
  aggregate(cbind(X1990, X2010) ~ origin_state + destination_state,FUN=sum)

# plot 1990
df_state %>% 
  arrange(X1990) %>% # sort data in order of flow estimates
  filter(X1990>300) %>% # keep larger flows
  mig_chord(., axis_size = .5)

# plot 2010 (do this by making 2010 the third column)
df_state %>% 
  arrange(X2010) %>% #
  select(-X1990) %>%
  filter(X2010>300) %>%
  mig_chord(., axis_size = .5)


3 Final thoughts

  • The unreliability of 2011-present data is discouraging. Additionally, Matthew Hauer’s GitHub does not process these data, so these are fairly inaccessible data.
  • These data can be mapped! Though the ‘flow’ cannot be visualized on a map, but you could choose to just map outflows or inflows.
  • International migration flow data are relatively rare and often incomparable across sources.
  • The UN Department of Economic and Social Affairs (UN DESA) produces estimates, but currently only have data for 45 countries.
  • Read more about migration flows!
  • Guy Abel has figured out clever ways to compute flows based on migration stock data
  • Don’t forget to check out the main source of this tutorial: Matthew Hauer GitHub, Website, and paper describing this work.