Rather than using the data.census.gov which provides different cross tabulated values and tables from the Census Bureau, you might want to work with individual level data.
The US Census Bureau releases ACS Public Use Microdata Sample (PUMS) files for free so that researchers can use the data in a more raw form (i.e. without being pretabulated). These data contain records at both the household and individual levels; however, the smallest level of geography given are Public Use Microdata Areas (PUMAs), so that households and/or individuals cannot be identified with smaller geographies.
Information about PUMS: https://www.census.gov/programs-surveys/acs/microdata.html#:~:text=The%20Census%20Bureau’s%20American%20Community,through%20ACS%20pretabulated%20data%20products.
Information about PUMAs: https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html
PUMS data comes in 1 and 5-year estimates. For this example, we will be using the 1-Year estimates for 2019. Data can be obtained from data.census.gov or an FTP site. We are using 1-year estimates here in order to examine the trends at a more granular level.
Please read the difference about 1 and 5 year estimates here: https://www.census.gov/programs-surveys/acs/guidance/estimates.html
Data can be downloaded from the FTP site here: https://www.census.gov/programs-surveys/acs/microdata/access.2019.html
Data dictionary: https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2019.pdf
Due to the large amount of similarity between the URLs, we can parse the URL strings to easily automate downloading data from the census FTP site.
The naming structure of the zip files is as follows:
\[ \text{csv_} + \text{h } (household) \text{ or p } (person) + \text{state abbreviation} + \text{.zip}\]
R contains functions to work with state names and abbreviations. Note that the states are given in alphabetical order and the function produces a vector.
## STATE ABBREVIATIONS
state.abb[1:5]
## [1] "AL" "AK" "AZ" "AR" "CA"
## LOWER CASE
tolower(state.abb[1:5])
## [1] "al" "ak" "az" "ar" "ca"
There are two types of PUMS records, household and person level. People are nested within households and as such, SERIALNO
is used to identify households and link across these files.
We start by writing functions to create strings for file names. File names are distinguished with an “h” or a “p” for household and person level data records, respectively.
## FUNCTION FOR FILE NAMES
## HOUSEHOLD LEVEL
house<-function(state_abb){
paste("csv_h", state_abb, ".zip", sep="")
}
house("ca")
## [1] "csv_hca.zip"
## PERSON LEVEL
person<-function(state_abb){
paste("csv_p", state_abb, ".zip", sep="")
}
person("ca")
## [1] "csv_pca.zip"
URLs and naming structure have changed overtime. The following is a function to write the file pathway of the URL by year.
## FUNCTION TO CREATE FOLDER STRING
folder<-function(year){
## SAME ACROSS ALL YEARS
base<-"https://www2.census.gov/programs-surveys/acs/"
# DIFFERENT BY YEAR
folder20<-"experimental/2020/data/pums/1-Year/"
folder19<-"data/pums/"
oneyear<-"/1-Year/"
folder06<-"data/pums/"
if(year==2020){
out<-paste(base, folder20, sep="")
}
if(year<2020 & year>2006){
out<-paste(base, folder19, year, oneyear, sep="")
}
if(year <=2006){
out<-paste(base, folder06, year, "/", sep="")
}
out
}
Start by creating a folder for your PUMS
data.
Then create folders for each year. This is important because files have similar names across years and would be indistinguishable
# CREATE DIRECTORY
for(i in 2005: 2020){
pums_folder<-"~/Desktop/PUMS/"
dir.create(paste(pums_folder, i, sep=""))
}
st_abb<-tolower(state.abb[1:50])
for(i in 2005:2020){ # LOOP FOR YEARS
pums_folder<-"~/Desktop/PUMS/"
setwd(paste(pums_folder, i, sep=""))
dir <- getwd()
for(j in 1:50){ # LOOP FOR STATES
for(k in 1:2){ # 1 = HOUSE, 2 = PERSON
this_folder<-folder(i)
if(k==1){
file<-house(st_abb[j])
}
if(k==2){
file<-person(st_abb[j])
}
zip.url <-paste(this_folder, file, sep="")
#print(zip.url)
zip.file <- file
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)
}
}
}
Note that when files are extracted they are labeled with the state FIPS code rather than the abbreviation.