1 Overview: Federal Reserve Industrial Production

The United States Federal Reserve System (“the Fed”) publishes statistical data about the economy as part of its overall mission to supervise and regulate monetary policy, the banking system and monitor economic conditions. One such statistical publication is the G17 statistical release about Industrial Production and Capacity Utilization. Released monthly since 1947 by the Fed, the G17 data set is a critical information source for global financial markets, central banks and economists.

This project component identifies how to source and prepare one key table of the G17 release for data analysis and modeling.

The document proceeds as follows. We first explain the data set and its intrinsic features and complexities. Then, we propose a data model and wrangle the input file to populate the database table. Third, we run basic exploratory data analysis to glean basic facts.

1.1 Overview of the Industrial Production Data

The G17 statistical release has a vast amount of data. This project’s focus is the first table: the Industrial Production Summary table. An example is illustrated below in a screenshot.

G17 Industrial Production Summary table

G17 Industrial Production Summary table

The above image has been marked up to identify the data element we wish to import. In particular, we are choosing to exclude previous estimates and percent changes in values. Previous estimates and percent changes are excluded because they are redundant. Only data circled in red is in scope for the project.

The actual data will be parsed from an ASCII file named “g17.txt” that was downloaded from:

The data source of the ASCII file is [https://www.federalreserve.gov/releases/g17/Current/g17.txt]

There are several complications worth noting below

1.1.1 Data and Format Is Revised Over Time

The G17 data is published monthly with a 6 month lookback period. During this lookback period, the same prior month’s data may be revised up to 6 times. This is because industrial production data is revised due to variation in survey response time by participating firms. For each production figure for a given period, the first observation is called preliminary while later ones are called revised. The final revision occurs 6 months after the preliminary observation.

In the example below, we will see that the Oct 2018 Industrial Production figure is revised three times between Dec 2018 release to Feb 2019 release. The value changes from 108.7 to 109.1 to 109.3. Moreover, notice the columm position of the Oct 2018 figure shifts each month by one column to the left since data is displayed on a six month rolling window.

Here is an example of the revision process for 3 consecutive months:

Dec 2018-Feb 2019 G17 SummaryDec 2018-Feb 2019 G17 SummaryDec 2018-Feb 2019 G17 Summary

1.1.2 File Contains Extraneous Data

The file contains numerous tables and extraneous text. Moreover, the rows of the file containing the relevant data and show the month-over-month changes in the Industrial production values. For clarity, in the Feb 2019 release, the Jan change is calculated by comparing the latest available Dec 2018 data to the latest Jan 2019 to get a relative return.

Some rows contain previous estimates from the prior month. These datum are ignored.

Moreover, the release date of the file is provided on line 4 whereas the data table is much later.

1.1.3 Additional Details are grouped Hierarchically

The industrial production data is released for a variety of industries within the economy. These include:

  1. Total index
  2. Final Products
  3. Consumer goods
  4. Business equipment
  5. Nonindustrial supplies
  6. Construction
  7. Materials
  8. Manufacturing
  9. Mining
  10. Utilities

We will store the time series data for all of these industries.

1.2 Defining Tidy Data

Tidy data requires clearly defining an observation. It requires ensuring each column is a variable and each row is an observation.

In this case, we define each observation as being comprised of:

  • an industry group (either Total Index, Final Products, Consumer goods, etc.)
  • production period (the month/year for which production data is surveyed)
  • release date (the date on which a given version of the value is released)
  • index level ( a value representing the level of industrial production relative to the baseline year 2012)

1.3 Location of the Data Files

We will only parse and import the G17 file released as of February 15, 2019.

library(readr)

urlFile = "https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/g17.txt"
lines <- readLines(urlFile)


str(lines)
##  chr [1:1345] "                                 FEDERAL RESERVE STATISTICAL RELEASE                                 " ...

1.4 Wrangling the Data File

The useful information in the file has two parts: the release date near the top and the data table near line 70.

First, the release date is located on line 4 as the text string: February 15, 2019 We use it to identify the production dates. These would be the month-end dates of the production periods. Since the statistical releases are revised 6 times for a given production period, we see 6 prior month ends. For the February release, the production months range start on August 31, 2018 and end on Jan 31, 2019.

# The release date is found on the fourth row  %B = full name month  %d 2 digit day  %Y 4 digit year
(release_date = parse_date(str_trim(lines[4] ), "%B %d, %Y" ) )
## [1] "2019-02-15"
# Last 6 prior month-ends:  rollback provides the month-end.  
# lubridate does the normal date arithmetics
# since release dates are mid-month.   
# rollback is vectorized and gives the month-end prior to the input dates.
# rev does a reverse sort to put dates in increasing chronological order.
#
production_date = rev( rollback( release_date - months(0:5 )) )

The data table can be identified by its title which exists on a line alone. Its position varies with each release but the relative position and format of the data table is fixed relative to the title. Moreover, the industry groups for which data is required are constant. We hard-code their offset positions relative to the title as we parse the relevant lines.

# Row where the title of the table is found.
( first_match = which(str_detect( lines, "Industrial Production and Capacity Utilization:  Summary" ) ) )
## [1] 77
offsets = c( 9, 13, 14, 15, 16, 17, 18, 21, 23, 24 )

# Allocate a matrix of zeros of the size to hold the industrial production index levels.
# ----------------------------------------------------------------------------------------
mm = matrix( rep(0, length(offsets)* 6), nrow=length(offsets), ncol=6)

# Allocate the industry names to a character vector.
industries = as.character(length(offsets))

idx = 0

for( offset in offsets )
{
  idx=idx+1

  # Useful row index is defined by first_match + offset
  index_row = first_match + offset

  # Split the industry text and the six months of industrial figures by pipe delimiter
  lines_parts = str_trim( unlist(  str_split( lines[ index_row ], "\\|")  ) )

  # extract the industry name
  industries[idx] = unlist( lines_parts[1] )
  
  # strip off the parenthetical note on one industry name
  industries[idx] = str_replace( industries[idx], "\\s*\\(.*\\)", "" )

  # get the vector of numerical figures
  figures = as.double( unlist( str_extract_all( lines_parts[2], "\\d+\\.\\d+") ) ) 

  # store a row into the matrix
  mm[idx,] = figures
}

# Transpose the matrix to have industries as columns.
# There are 6 rows.
df = as_tibble(t(mm))
## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
## This warning is displayed once per session.
colnames(df) <- industries

# Add the production dates and the release date into the dataframe.
df = add_column( df, production_date = production_date)
df = add_column( df, revision_date = rep(release_date, 6))

knitr::kable( df)
Total index Final Products Consumer goods Business equipment Nonindustrial supplies Construction Materials Manufacturing Mining Utilities production_date revision_date
108.8 103.2 105.9 101.2 107.1 114.5 114.0 104.9 126.5 105.4 2018-08-31 2019-02-15
109.0 103.7 106.2 102.2 106.9 114.3 114.0 105.2 127.6 104.1 2018-09-30 2019-02-15
109.3 104.3 107.0 102.5 107.5 114.6 113.9 105.0 127.9 107.5 2018-10-31 2019-02-15
110.0 104.6 107.3 102.8 107.5 115.3 115.1 105.3 129.3 109.9 2018-11-30 2019-02-15
110.1 104.3 106.3 103.8 108.0 117.6 115.4 106.1 131.2 102.4 2018-12-31 2019-02-15
109.4 103.4 105.6 102.2 107.8 117.1 114.9 105.2 131.3 102.8 2019-01-31 2019-02-15

The last step is to make the data tidy and consistent with the format we are going to use to do analysis. Gather the staged dataframe and store the industrial production index levels into a column called: production. Store the industry group names under a column called IndustryGroup.

tidydata = gather(df, key="industry_group", value="production", `Total index`:`Utilities`)

knitr::kable(head(tidydata,10))
production_date revision_date industry_group production
2018-08-31 2019-02-15 Total index 108.8
2018-09-30 2019-02-15 Total index 109.0
2018-10-31 2019-02-15 Total index 109.3
2018-11-30 2019-02-15 Total index 110.0
2018-12-31 2019-02-15 Total index 110.1
2019-01-31 2019-02-15 Total index 109.4
2018-08-31 2019-02-15 Final Products 103.2
2018-09-30 2019-02-15 Final Products 103.7
2018-10-31 2019-02-15 Final Products 104.3
2018-11-30 2019-02-15 Final Products 104.6

1.5 Exploratory Data Analysis

ggplot(tidydata) + 
  geom_line( mapping=aes(x=production_date, y = production, color=industry_group), size=1.5) + 
  facet_wrap(~industry_group) + theme(legend.position = "none") +
  ggtitle("Industrial Production for Feb 2019")

Let us explore the monthly returns of each industry group’s industrial production function. Changes in Industrial Production are important for economists to detect early signs of recession or boom in the US economy. To do this, we calculate monthly lagged returns using mutate with the lag function.

Note that we filter out the August production figure because we cannot plot returns of the earliest production period. However, since our goal is to visualize the recent production months, that should be fine.

The below bar plots show returns by IndustryGroup. We see that industrial production decline in the most recent period for 7 industry groups and rose for 3.

tidydata %>% 
  group_by(industry_group) %>% 
  mutate( ret = production / lag(production) - 1.0 ) %>% 
  filter( !is.na(ret)) -> trends
## Warning: package 'bindrcpp' was built under R version 3.4.4
ggplot(trends) + 
  geom_bar( stat="identity", mapping=aes(x=production_date, y = ret, 
                                         fill=industry_group), size=1.0) + 
  facet_wrap(~industry_group) + theme(legend.position = "none") +
  ggtitle("Industrial Production Returns as of Feb 2019")

1.6 Exporting the Data to SQL

We created a database table for the production figures in MySQL called industrialproduction. However, we will not store the returns because this requires access to the previous month’s data to calculate returns for the earliest period. Instead, returns should be calculated within MySQL after production figures are imported.

First, we write the tidydata to csv.

write_csv(tidydata, path="./Industrial_Production_Feb2019.csv" )

Second, we import the local file with MySQL import wizard and display the results from MySQL Workbench.

knitr::include_graphics("Industrial_Production_MySQL.png")

1.7 Conclusion

This study illustrates the use of tidy data principles for a well known financial data set. The source data is clean but highly non-tidy. There are a few files to be reviewed:

  • RMD file is on Github
  • HTML report in on RPUBS
  • Raw G17 File in on Github
  • Generated Flat File for Import to database is on Github
  • MySQL scheme for the target table in on Github.