QCEW Open Data Access

Summary

This is a short write-up on using R to download and parse data from the Bureau of Labor Statistics (BLS) QCEW Open Data Access data sources. This summary was originally requested by Drake Gibson, an economist at the Bureau of Labor Statistics, as an idea of how one might download and visualize data from the QCEW Open Data Access files.

Downlaod and Unzip Data

Let’s begin by downloading and unzipping one of the zip files from the QCEW NIACS-Based files.

# Set your working directory, this may be different on your computer.
setwd("~/Documents/R")

# Downloads 2015 Quarterly by Area zip file.
download.file("http://www.bls.gov/cew/data/files/2015/csv/2015_qtrly_by_area.zip", "my_tempfile.zip")

# Unzip file and set new working directory to the newly unzipped folder.
unzip("my_tempfile.zip")
setwd("~/Documents/R/2015.q1-q4.by_area")

Decide what Data to Use

The data packed into this particular zip file is in excess of 8 GB. Since the R language leverages a computer’s internal memory to store data, we’ll have to subset some of the .csv files before loading them into R. If you have access to a server-class machine with > 32 GB. of RAM, you may be able to load the entire data set.

For our puroses here, I’m assuming these analysis will be done on a normal workstation with 8 to 16 GB. of RAM. This isn’t normally a problem since most analysis are done on only a single vector of the data.

In this case, I want to look at the QCEW statistics for each county in Alabama. To save machine memory, we want to subset the files to load only county data for the state of Alabama.

# Get a list of filenames then search that list for files with "Alabama" in the file name.
fileNames <- list.files()
fileNames <- fileNames[grep("\\Alabama\\b", fileNames)]

# Get rid of state-wide data in our file list.
fileNames <- fileNames[grep("\\Statewide\\b", fileNames, invert=TRUE)]

# Use an apply function to iterate over list and load all .csv files to data frames.
# Can also use a for-loop, but lapply is faster in R.
my_data <- lapply(fileNames, read.csv)

Combine and Aggregate Data

The above R code leaves us with a large list of data frames in our R environment. In order to do any useful analysis, we will have to combine these data into a single data frame.

library(dplyr)
# Use dplyr to bind the data frames.
# Can also use do.call("rbind", my_data) but dplyr is faster.
alabama.df <- dplyr::bind_rows(my_data)

Now we can run a couple of diagnostics on the newly created data frame.

# Check the memory usage. We're only using 63.5 MB. as opposed to 8.6 GB.!
object.size(alabama.df)

## 63557280 bytes

# Check out the first few rows of the new data frame.
names(alabama.df)

##  [1] "area_fips"                      "own_code"                      
##  [3] "industry_code"                  "agglvl_code"                   
##  [5] "size_code"                      "year"                          
##  [7] "qtr"                            "disclosure_code"               
##  [9] "area_title"                     "own_title"                     
## [11] "industry_title"                 "agglvl_title"                  
## [13] "size_title"                     "qtrly_estabs_count"            
## [15] "month1_emplvl"                  "month2_emplvl"                 
## [17] "month3_emplvl"                  "total_qtrly_wages"             
## [19] "taxable_qtrly_wages"            "qtrly_contributions"           
## [21] "avg_wkly_wage"                  "lq_disclosure_code"            
## [23] "lq_qtrly_estabs_count"          "lq_month1_emplvl"              
## [25] "lq_month2_emplvl"               "lq_month3_emplvl"              
## [27] "lq_total_qtrly_wages"           "lq_taxable_qtrly_wages"        
## [29] "lq_qtrly_contributions"         "lq_avg_wkly_wage"              
## [31] "oty_disclosure_code"            "oty_qtrly_estabs_count_chg"    
## [33] "oty_qtrly_estabs_count_pct_chg" "oty_month1_emplvl_chg"         
## [35] "oty_month1_emplvl_pct"          "oty_month2_emplvl_chg"         
## [37] "oty_month2_emplvl_pct"          "oty_month3_emplvl_chg"         
## [39] "oty_month3_emplvl_pct"          "oty_total_qtrly_wages_chg"     
## [41] "oty_total_qtrly_wages_pct"      "oty_taxable_qtrly_wages_chg"   
## [43] "oty_taxable_qtrly_wages_chg.1"  "oty_qtrly_contributions_chg"   
## [45] "oty_qtrly_contributions_pct"    "oty_avg_wkly_wage_chg"         
## [47] "oty_avg_wkly_wage_pct"

Visualizing the Data

Now that the data are in a single data frame, we can subset and visualize as we would normally in R. For example, if we want to look at the over-the-year average weekly wage percentage for all industries in Alabama for Q4 2015, we could run the following subset.

The end goal of the following code will be making a county map of Alabama.

library(dplyr, warn.conflicts = FALSE)
map.df <- subset(alabama.df, year==2015 & qtr==4 & industry_title=="Total, all industries" 
       & own_title=="Total Covered",
       select=c("area_fips", "oty_avg_wkly_wage_pct")) %>%
       rename(fips = area_fips)

A little data manipulation is in order. The original .csv files are missing the leading zero in the FIPS codes. This is a common occurrence when importing to R. To solve, we have to add the leading zeros back, so the FIPS codes will match with our map data.

# Add leading zeros to fips codes. R stips out any leading zeros when it imports the .csv.
map.df$fips <- formatC(map.df$fips, width = 5, format = "d", flag = "0")

At this point we can map using the map_county() function from the blscrapeR package, which contains 2015 FIPS codes for each U.S. county.

library(blscrapeR)
blscrapeR::bls_map_county(map_data=map.df, fill_rate = "oty_avg_wkly_wage_pct", stateName = "Alabama")

Another method would be to map using the ggplot2 package to allow for a more customizable solution.

library(ggplot2)
library(blscrapeR) # Using blascrapR here just for the shape file that has 2015 FIPS codes.

# Get the shape file U.S. counties
us_map <- blscrapeR::county_map_data

# Subset the us_map by FIPS codes to only Alabama, State FIPS 01.
# Note: Fips codes are listed as "id" in this data set.
us_map <- subset(us_map, substr(us_map$id, 1,2)=="01")

# Insert some color breaks for the values.
map.df$rate_d <- cut(map.df$oty_avg_wkly_wage_pct, breaks =  c(seq(0, 10, by = 2), 35))

# Plot
ggplot() +
    geom_map(data=us_map, map=us_map,
             aes(x=long, y=lat, map_id=id, group=group),
             fill="#ffffff", color="#0e0e0e", size=0.15) +
    geom_map(data=map.df, map=us_map, aes_string(map_id="fips", fill="rate_d"),
             color="#0e0e0e", size=0.15) +
    scale_fill_brewer()+
    coord_equal() +
    theme(axis.line=element_blank(),
          axis.text.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks=element_blank(),
          axis.title.x=element_blank(),
          axis.title.y=element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(),
          legend.title=element_blank())