Download Raw PUMS Files

Rather than using the data.census.gov which provides different cross tabulated values and tables from the Census Bureau, you might want to work with individual level data.

1. Accessing the Data

The US Census Bureau releases ACS Public Use Microdata Sample (PUMS) files for free so that researchers can use the data in a more raw form (i.e. without being pretabulated). These data contain records at both the household and individual levels; however, the smallest level of geography given are Public Use Microdata Areas (PUMAs), so that households and/or individuals cannot be identified with smaller geographies.

2. Types of Estimates

PUMS data comes in 1 and 5-year estimates. For this example, we will be using the 1-Year estimates for 2019. Data can be obtained from data.census.gov or an FTP site. We are using 1-year estimates here in order to examine the trends at a more granular level.

3. File Names

Due to the large amount of similarity between the URLs, we can parse the URL strings to easily automate downloading data from the census FTP site.

3.0 Zip File Naming Structure

The naming structure of the zip files is as follows:

\[ \text{csv_} + \text{h } (household) \text{ or p } (person) + \text{state abbreviation} + \text{.zip}\]

3.1 State Records

R contains functions to work with state names and abbreviations. Note that the states are given in alphabetical order and the function produces a vector.

## STATE ABBREVIATIONS
state.abb[1:5]
## [1] "AL" "AK" "AZ" "AR" "CA"
## LOWER CASE
tolower(state.abb[1:5])
## [1] "al" "ak" "az" "ar" "ca"

3.2 Household and Person Records

There are two types of PUMS records, household and person level. People are nested within households and as such, SERIALNO is used to identify households and link across these files.

We start by writing functions to create strings for file names. File names are distinguished with an “h” or a “p” for household and person level data records, respectively.

## FUNCTION FOR FILE NAMES 

## HOUSEHOLD LEVEL
house<-function(state_abb){
  paste("csv_h", state_abb, ".zip", sep="")
}

house("ca")
## [1] "csv_hca.zip"
## PERSON LEVEL
person<-function(state_abb){
  paste("csv_p", state_abb, ".zip", sep="")
}

person("ca")
## [1] "csv_pca.zip"

3.3 YEARLY FOLDERS

URLs and naming structure have changed overtime. The following is a function to write the file pathway of the URL by year.

## FUNCTION TO CREATE FOLDER STRING
folder<-function(year){
  ## SAME ACROSS ALL YEARS
  base<-"https://www2.census.gov/programs-surveys/acs/"
  
  # DIFFERENT BY YEAR
  folder20<-"experimental/2020/data/pums/1-Year/"
  folder19<-"data/pums/"
  oneyear<-"/1-Year/"
  folder06<-"data/pums/"
  
  if(year==2020){
    out<-paste(base, folder20, sep="")
  }
  if(year<2020 & year>2006){
    out<-paste(base, folder19, year, oneyear, sep="")
  }
  if(year <=2006){
    out<-paste(base, folder06, year, "/", sep="")
  }
  out
}

4. Automating File Download and Extraction

4.1 Creating Folders

  • Start by creating a folder for your PUMS data.

  • Then create folders for each year. This is important because files have similar names across years and would be indistinguishable

# CREATE DIRECTORY
for(i in 2005: 2020){
  pums_folder<-"~/Desktop/PUMS/"
  dir.create(paste(pums_folder, i, sep="")) 
}

4.2 Download and Extract ALL Years and ALL States

st_abb<-tolower(state.abb[1:50])

for(i in 2005:2020){ # LOOP FOR YEARS
  pums_folder<-"~/Desktop/PUMS/"
  setwd(paste(pums_folder, i, sep=""))
  dir <- getwd()
  
  for(j in 1:50){ # LOOP FOR STATES
    for(k in 1:2){ # 1 = HOUSE, 2 = PERSON
      this_folder<-folder(i)
      if(k==1){
        file<-house(st_abb[j])
      }
      if(k==2){
        file<-person(st_abb[j])
      }
      zip.url <-paste(this_folder, file, sep="")
      #print(zip.url)
      
      zip.file <- file
      
      zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
      
      download.file(zip.url, destfile = zip.combine)
      
      unzip(zip.file)
    }
  }
}

Note that when files are extracted they are labeled with the state FIPS code rather than the abbreviation.