Overview
Project Proposal
Project Draft
Final Project
- Leaflet Choropleth Map
- Shiny Application
References

Overview

Your final project is to create a public visualization (static or interactive) using data relevant to a current policy, business, or justice issue. You may use any dataset you can find for this assignment, as long as it is either public or you have permission from the data’s owner/administrator to work with it and share it. Recommended data sources are: governmental data, data provided by a non-profit/Nongovernmental organizations, and data available from large, semi-structured data sets (i.e. social networks, company financials, etc.). You must document each step of your data analysis process (excluding data acquisition) in code: this will include changing the format of the data and the creation of any images or interactive displays that are made. You must also include a short (2-3 paragraph) write-up on the visualization. This write-up must include the following: the data source, what the parameters of the data set are (geography, timeframe, what the data points are, etc.) what the data shows, and why it is Important… Note: The type of deliverable you provide will depend on the strategy you use for this project. If you put together an interactive visualization, you should be able to provide code that I will be able to run and host locally. If you are choosing static visualizations, your write up will be more important to your overall grade, and it may be useful to think about how you’re presenting these visualizations (in a formatted R Markdown document for example.)

You must submit a proposal for your project. This proposal must include: a link to the data source, an explanation of what you want to show, why this is relevant to a current policy, business, or justice issue, and which technologies you plan to use. Your instructor must approve this proposal: you may have to refine this somewhat. You will present your final project. If you are not able to attend the lecture on [the proposal or final due dates], you must write-up a status report with screenshots of current progress and issues you are experiencing.

Project Proposal

The IRS has a dataset that contains U.S. Population Migration Data from 1990 to 2016. The data “based on year-to-year address changes reported on individual income tax returns filed with the IRS.” The data shows inflows and outflows to every county in the United States as defined by US Census Bureau FIPS Codes. The flows are in terms of number of returns, number of exemptions, and aggregate income. These data are useful in business for gauging demand and human capital. The project proposed is an interactive visualization where a year, destination State, and destination County are selected, then a map with inflows or outflows by origin State and Origin County is displayed. An example would be selecting 2015 and Kings County in New York State to see a heatmap of the United States identifying from where people migrated by return count. Below are the top 15 rows of the raw data for such a selection.

DS	DC	OS	OC	State Name	County Name	Returns	Exemptions	AGI
36	47	36	61	NY	New York County	12253	18042	803554000
36	47	36	81	NY	Queens County	8881	16332	305235000
36	47	36	5	NY	Bronx County	2783	5215	74961000
36	47	36	85	NY	Richmond County	1527	2809	64619000
36	47	36	59	NY	Nassau County	1317	2108	60445000
36	47	36	103	NY	Suffolk County	974	1502	42653000
36	47	34	17	NJ	Hudson County	784	1150	39661000
36	47	6	37	CA	Los Angeles County	770	956	31287000
36	47	36	119	NY	Westchester County	677	982	32442000
36	47	42	101	PA	Philadelphia County	568	821	20239000
36	47	17	31	IL	Cook County	535	668	34080000
36	47	6	75	CA	San Francisco County	484	557	25770000
36	47	34	13	NJ	Essex County	454	757	16920000
36	47	12	86	FL	Miami Dade County	436	648	11926000
36	47	34	3	NJ	Bergen County	403	605	17831000

I plan on using R with Shiny unless there is too much data for R to handle. In that case, I will try to use Python with plotly. The data for this analysis are available in CSV Format here: IRS U.S. Population Migration Data, US Census Bureau FIPS Codes.

Instructor Feedback

This is a very interesting dataset, but you still need to flesh out exactly what you’re looking to show. For assignment 6, think about what kind of app you want to make (exploratory? purpose driven?), and if it is more narrative or purpose driven, think about what you want your thesis to be (highlighting the characteristics of the largest in / outflows? saying something general about how Americans move?)

Project Draft

Feedback Response

I am looking to make an exploratory app that shows how Americans move in general. I was planning to do it the whole country but the millions of rows of data will make the visualization unresponsive. I have culled it down to inflows and outflows from NY which is only about a quarter-million rows of data and it still lags a little (depending on RAM), but it is functional. The app will be more purpose driven than narrative based so I plan to focus more on improving the visualization than telling a story about the data. I was a little disappointed to see that county level choropleth maps do not have zoom capabilities, but will add sliders for years and drop downs for both direction (inflow, outflow) and scope (NY to/from where).

System Setup

sudo apt-get update
pip install --upgrade pip
sudo apt-get install libudunits2-dev # for reticulate
sudo apt-get install libffi-dev g++ libssl-dev
pip install plotly
pip install geopandas # for cloropleths
pip install pyshp # for cloropleths
pip install shapely # for cloropleths

Import Data

Brace Expansions with wget are useful for downloading several files via the command line. When downloading these annual data however, the naming conversions (i.e. *1993to1994*.csv) calls for the use of multiple brace expansions (i.e. *{1993..2003}to{1994..2004}*.csv) which produces repeated non-critical errors for nonexistent files. This is because multiple Brace Expansions produce cross (Cartesian) products with every possible combination is returned. Using for loops is an alternative, but can require more coding than using wget. Brace Expansions also encounter issues with zero-padding. When Brace Expansions loop over “08” an error of “too great for base” is returned because numbers with a leading 0 are base 8 octal numbers.

Clean Data

Most dat files were cleaned using Pyhton, but data from 1990 to 1992 are in text files in a hierarchical format which is not easily converted to csv. These files were manual converted using Microsoft Excel.

Import Data: From Text/CSV
Fixed Width: 0,2,5,9,34,36,46,53,63
Column Headings: State1, County1, State2, County2, State3, County3, Returns, Exemptions, Income
State1 First Row: TRIM(A2)
County1 First Row: TRIM(CONCATENATE(TRIM(B2), TRIM(C2)))
State1: VALUE(IF(ISBLANK(A3), K2, A3))
County1: VALUE(IF(ISBLANK(A3), L2, CONCATENATE(TRIM(B3), TRIM(C3))))
State2: IF(COUNT(FIND({0,1,2,3,4,5,6,7,8,9}, IF(ISBLANK(A2), IF(LEN(TRIM(B2))>0, TRIM(B2), TRIM(C2)),“”)))>0, IF(ISBLANK(A2), IF(LEN(TRIM(B2))>0, TRIM(B2), TRIM(C2)),“”),“”)
County2: IF(COUNT(FIND({0,1,2,3,4,5,6,7,8,9}, IF(ISBLANK(A2), IF(LEN(TRIM(B2))>0, TRIM(C2), LEFT(D2, 4)), “”)))>0, IF(ISBLANK(A2), IF(LEN(TRIM(B2))>0, TRIM(C2), LEFT(D2, 4)), “”), “”)
State3: IF(LEN(TRIM(E2))>0, E2, “XX”)
County3: TRIM(IF(COUNT(FIND({0,1,2,3,4,5,6,7,8,9},B2))>0, D2, IF(LEN(TRIM(B2))>0, CONCATENATE(B2,C2,D2), MID(D2,4,LEN(D2)))))
Returns: F2
Exemptions: H2
Income: NULL
Fill Handle: Complete rest but dragging or double-clicking fill handle.

Python Engine

The run Python code from R Markdown using a specific engine, navigate to the desired location and get the engine path with the below code. The output below is only an example.

python -c "import os; print(os.environ['_'])"

## /usr/bin/python

import numpy as np
import pandas as pd
github = 'https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20608/'
# Federal Information Processing Standards (FIPS) County Codes
df_fips = pd.read_csv(github + 'FIPS_Codes_2016.csv', encoding ='latin1')
# IRS Statistics of Income Division (SOI) Tax Stats - Migration Data
col_right = ['State_Name', 'County_Name', 'Returns', 'Exemptions', 'Income']
col_fips, df_list = (['State_Code', 'County_Code'], [])
for year, io in [(x, y) for x in range(1990, 2016) for y in ['i','o']]:
    file = github + 'IRS_NYc' + str(year) + 'to' + str(year + 1) + io
    col_basis = [('I_' if io == 'i' else 'O_') + c for c in col_fips]
    col_delta = [('I_' if io == 'o' else 'O_') + c for c in col_fips]
    cols = col_basis + col_delta + col_right
    if year <= 1991:
        df_temp = pd.read_csv(file + '.csv', skiprows=1, names=cols)
    else:
        df_temp = pd.ExcelFile(file + '.xls').parse(skiprows=(7 if year<=2008 else 5), names=cols)
    df_temp.drop(['State_Name', 'County_Name'], axis=1, inplace=True)
    df_temp['Direction'] = 'Inflow' if (io == 'i') else 'Outflow'
    df_temp['Year_From'], df_temp['Year_To'] = (year, year+1)
    for col_irs in [col_basis, col_delta]:
        df_temp = pd.merge(df_fips, df_temp, left_on=col_fips, right_on=col_irs, how='inner')
        dic = dict(zip(df_fips.columns[2:], [col_irs[0][:2] + l for l in ['State', 'Abbr', 'County']]))
        df_temp.rename(columns=dic, inplace=True)
        fips_left = df_temp[col_irs[0]].astype(int).astype(str).str.zfill(2)
        fips_right = df_temp[col_irs[1]].astype(int).astype(str).str.zfill(3)
        df_temp[col_irs[0][:2] + 'FIPS'] = fips_left + fips_right
        df_temp.drop(col_fips+col_irs, axis=1, inplace=True)
    df_list.append(df_temp)
df_irs = pd.concat(df_list, ignore_index=False).iloc[:,[12,13,0,4,2,5,3,9,7,10,8,11,1,6]]
df_irs.to_csv('IRS_NYc1990to2016io.csv', index=False)

Plotly Choropleth Map

This code is not actually being run. Runnable Jupyter Notebook with actual code can be found here.

Final Project

Summary

The IRS has a Spatial and Longitudinal dataset that contains U.S. Population Migration Data from 1990 to 2016. The data are “based on year-to-year address changes reported on individual income tax returns filed with the IRS.” These data contain geographic information regarding inflows and outflows to every county in the United States over time along with the number of returns, number of exemptions, and aggregate income associated with those flows. These data are very important to business for gauging demand in sectors such as marketing and real estate as well as supply of labor for human capital management.

These data are easy to find, but not easy to obtain or clean. The dataset is huge with inconsistent file formats, file naming conversions, file structures, and data values. On top of that, to pull one file the data store for the entire year must be downloaded. Data acquisition and cleansing was not trivial. Files were downloaded with Bash, sifted manually, renamed with Bulk Rename Utility, reshaped with Excel, and then cleaned with Python before being saved in an aggregated Spatial and Longitudinal CSV file that was uploaded to GitHub.

With these data, an exploratory Shiny application was created that shows how people in the United States move in general. Yet the size of the data (millions of rows) makes the visualization unresponsive, therefore a subset of the data with inflows and outflows from New York (about a quarter-million rows) is used. The resulting visualization shows some interesting features. In general, New York residents tend to move between neighboring counties, but New York City residents also show a large amount of movement between other cities. Albany residents also have a wider dispersion than other New York residents, although not as wide as New York City residents. Without diving into a full quantitative analysis, at first glance this dataset appears to show that people tend to move between areas that are relatively similar in density. The interactive visualization produced through this Shiny application incorporates the following data visualization concepts:

Data-Ink Ratio: Simple color scheme, detail relative to zoom level, dynamic legend, minimalistic panel that fades.
Chartjunk: Map without gridlines, unnecessary colors, or elevations.
Small Multiples: Minimap (toggled) displaying spatial movement.
Multifunctional Elements: Choropleth polygons displaying density.
High-Resolution Graphics: Choropleth polygons represent highly dense data.
Animation: Year slider with play button to loops through time.
Tooltips: Display of Returns, Income, and Exemptions on hover.
Zooming and Panning: Can zoom with scroll wheel and pan by dragging.
Web Controls: Dropdowns, Radio Buttons, and Sliders.

Leaflet Choropleth Map

library(tigris)
library(leaflet)
github = 'https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20608/'
df_irs <- read.csv(paste0(github, "IRS_NYc1990to2016io.csv"))
df_irs$I_FIPS <- sprintf("%05d", df_irs$I_FIPS)
df_irs$O_FIPS <- sprintf("%05d", df_irs$O_FIPS)
shapefile <- counties(cb=TRUE, year=2016)

input <- data.frame(radio1 = "Inflow", slider1 = 2015, 
  select1 = "Kings County", select2 = "United States")

delta_col <- if (substr(input$radio1, 1, 1) == "I") "O_State" else "I_State"
delta_fips <- if (substr(input$radio1, 1, 1) == "I") "O_FIPS" else "I_FIPS"
basis_col <- paste0(substr(input$radio1, 1, 1), "_County")
delta <- if (input$select2 %in% df_irs[, delta_col]) input$select2 else levels(df_irs[, delta_col])
condition1 <- df_irs[, "Year_To"] == input$slider1
condition2 <- df_irs[, "Direction"] == as.vector(input$radio1)
condition3 <- df_irs[,basis_col] == as.vector(input$select1)
condition4 <- df_irs[, delta_col] %in% delta
rows <- condition1 & condition2 & condition3 & condition4
map_df <- geo_join(shapefile, df_irs[rows, ], by_sp="GEOID", by_df=delta_fips, how='inner')
delta_st <- if (substr(input$radio1, 1, 1) == "I") "O_Abbr" else "I_Abbr"
delta_ct <- if (substr(input$radio1, 1, 1) == "I") "O_County" else "I_County"
labels <- sprintf("<strong>%s, %s</strong><br/>%g Returns<br/>%g Income<br/>%g Exemptions",
  map_df[, delta_ct][[1]], map_df[, delta_st][[1]], map_df$Returns, map_df$Income, 
  map_df$Exemptions ) %>% lapply(htmltools::HTML)
breaks <- 1 / min(length(unique(map_df$Returns)), 8)
bins <- unique(floor(quantile(map_df$Returns, seq(0, 1, breaks))))
pal <- colorBin("Blues", domain = map_df$Returns, bins = bins, right = F)
abbr <- c("US", state.abb)[c("United States", state.name) == input$select2]
leg_title <- paste0(input$radio1, if (substr(input$radio1, 1, 1) == "I") "s from " else "s to ", abbr)
m <- leaflet(width = "100%") %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addMiniMap(toggleDisplay=T, position="bottomleft", minimized=T) %>%
  addPolygons(data = map_df, fillColor = ~pal(Returns), weight = 1, 
    opacity = 1, color = "Black", dashArray = 3, fillOpacity = 0.7, label = labels, 
    highlight = highlightOptions(weight = 3, color = "Grey", dashArray = NULL, 
      fillOpacity = 0.9, bringToFront = TRUE)) %>%
  addLegend(pal = pal, values = NULL,  opacity = 0.7, 
    position = "bottomleft", title = leg_title) %>%
  addEasyButton(easyButton(icon="fa-crosshairs", title="Zoom to State Level",
    onClick=JS("function(btn, map){ map.setZoom(7); }")))

Shiny Application

Actual Shiny application can be found here. The application has dependencies this CSS file and this Markdown file.

library(tigris)
library(leaflet)
github = 'https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20608/'
df_irs <- read.csv(paste0(github, "IRS_NYc1990to2016io.csv"))
df_irs$I_FIPS <- sprintf("%05d", df_irs$I_FIPS)
df_irs$O_FIPS <- sprintf("%05d", df_irs$O_FIPS)
shapefile <- counties(cb=TRUE, year=2016)

# Define UI for application that draws a histogram
ui <- bootstrapPage(
  
  navbarPage("IRS Migration Data", id="nav",
    
    tabPanel("Overview", includeMarkdown("DATA608_Final.md")),
    
    tabPanel("Interactive Map",
      div(class="outer",
        tags$head(includeCSS("DATA608_Final.css")),
        leafletOutput("map", width="100%", height="100%"),
        absolutePanel(id = "controls", class = "panel panel-default", fixed = TRUE,
                      draggable = TRUE, top = 60, left = "auto", right = 20, bottom = "auto",
                      width = 330, height = "auto", style="z-index:500;", 
          h2("New York Migration"),
          uiOutput('basis1'),
          radioButtons("radio1", label = "Migration Direction", selected = "Inflow", inline = T,
                       choices = list("Inflow" = "Inflow", "Outflow" = "Outflow")),
          sliderInput("slider1", label = "Migration Year", min = min(df_irs$Year_To), 
                      max = max(df_irs$Year_To), value = max(df_irs$Year_To), sep = "",
                      step = 1, animate = animationOptions(interval = 1000, loop = T)),
          uiOutput('scope1')
        ),
        tags$div(id="cite", style="z-index:500;", "IRS New York Migration Flows by Jose Zuniga (May 2018)")
      )
    ),
    
    tabPanel("Data Explorer",
      h2("New York Migration"),
      hr(),
      fluidRow(
        column(3, uiOutput('basis2')),
        column(3, radioButtons("radio2", label = "Migration Direction", selected = "Inflow", inline = T,
                   choices = list("Inflow" = "Inflow", "Outflow" = "Outflow"))),
        column(3, sliderInput("slider2", label = "Migration Year", min = min(df_irs$Year_To), 
                  max = max(df_irs$Year_To), value = max(df_irs$Year_To), sep = "", step = 1)),
        column(3, uiOutput('scope2'))),
      hr(),
      dataTableOutput('table')
    )
  )
)

# Define server logic required to draw a histogram
server <- function(input, output, session) {
  
  # Interactive Map
  reactive_basis1 <- reactive({ 
    basis_col <- paste0(substr(input$radio1, 1, 1), "_County")
    condition1 <- df_irs[, "Year_To"] == input$slider1
    condition2 <- df_irs[, "Direction"] == input$radio1
    rows <- condition1 & condition2
    df <- sort(df_irs[rows, basis_col])
    return(df) })
  output$basis1 <- renderUI({  
    selectInput('select1', label = 'Migration County', width = "auto",
                choices=reactive_basis1(), selected = 1) })
  
  # Data Explorer
  reactive_basis2 <- reactive({ 
    basis_col <- paste0(substr(input$radio2, 1, 1), "_County")
    condition1 <- df_irs[, "Year_To"] == input$slider2
    condition2 <- df_irs[, "Direction"] == input$radio2
    rows <- condition1 & condition2
    df <- sort(df_irs[rows, basis_col])
    return(df) })
  output$basis2 <- renderUI({  
    selectInput('select3', label = 'Migration County', width = "auto",
                choices=reactive_basis2(), selected = 1) })
  
  # Interactive Map
  reactive_delta1 <- reactive({ 
    delta_col <- if (substr(input$radio1, 1, 1) == "I") "O_State" else "I_State"
    basis_col <- paste0(substr(input$radio1, 1, 1), "_County")
    condition1 <- df_irs[, "Year_To"] == input$slider1
    condition2 <- df_irs[, "Direction"] == input$radio1
    condition3 <- df_irs[, basis_col] == input$select1
    rows <- condition1 & condition2 & condition3
    df <- rbind("United States", as.vector(sort(df_irs[rows, delta_col])))
    return(df) })
  output$scope1 <- renderUI({ 
    selectInput("select2", label = "Migration Location", width = "auto", 
                choices=reactive_delta1(), selected = 1) })
  
  # Data Explorer
  reactive_delta2 <- reactive({ 
    delta_col <- if (substr(input$radio2, 1, 1) == "I") "O_State" else "I_State"
    basis_col <- paste0(substr(input$radio2, 1, 1), "_County")
    condition1 <- df_irs[, "Year_To"] == input$slider2
    condition2 <- df_irs[, "Direction"] == input$radio2
    condition3 <- df_irs[, basis_col] == input$select3
    rows <- condition1 & condition2 & condition3
    df <- rbind("United States", as.vector(sort(df_irs[rows, delta_col])))
    return(df) })
  output$scope2 <- renderUI({ 
    selectInput("select4", label = "Migration Location", width = "auto", 
                choices=reactive_delta2(), selected = 1) })
  
  # Interactive Map 
  output$map <- renderLeaflet({
    delta_col <- if (substr(input$radio1, 1, 1) == "I") "O_State" else "I_State"
    delta_fips <- if (substr(input$radio1, 1, 1) == "I") "O_FIPS" else "I_FIPS"
    basis_col <- paste0(substr(input$radio1, 1, 1), "_County")
    delta <- if (input$select2 %in% df_irs[, delta_col]) input$select2 else levels(df_irs[, delta_col])
    condition1 <- df_irs[, "Year_To"] == input$slider1
    condition2 <- df_irs[, "Direction"] == input$radio1
    condition3 <- df_irs[,basis_col] == input$select1
    condition4 <- df_irs[, delta_col] %in% delta
    rows <- condition1 & condition2 & condition3 & condition4
    map_df <- geo_join(shapefile, df_irs[rows, ], by_sp="GEOID", by_df=delta_fips, how='inner')
    delta_st <- if (substr(input$radio1, 1, 1) == "I") "O_Abbr" else "I_Abbr"
    delta_ct <- if (substr(input$radio1, 1, 1) == "I") "O_County" else "I_County"
    labels <- sprintf("<strong>%s, %s</strong><br/>%g Returns<br/>%g Income<br/>%g Exemptions",
      map_df[, delta_ct][[1]], map_df[, delta_st][[1]], map_df$Returns, map_df$Income, 
      map_df$Exemptions ) %>% lapply(htmltools::HTML)
    breaks <- 1 / min(length(unique(map_df$Returns)), 8)
    bins <- unique(floor(quantile(map_df$Returns, seq(0, 1, breaks))))
    pal <- colorBin("Blues", domain = map_df$Returns, bins = bins, right = F)
    abbr <- c("US", state.abb)[c("United States", state.name) == input$select2]
    leg_title <- paste0(input$radio1, if (substr(input$radio1, 1, 1) == "I") "s from " else "s to ", abbr)
    leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>%
      addMiniMap(toggleDisplay=T, position="bottomleft", minimized=T) %>%
      addPolygons(data = map_df, fillColor = ~pal(Returns), weight = 1, 
        opacity = 1, color = "Black", dashArray = 3, fillOpacity = 0.7, label = labels, 
        highlight = highlightOptions(weight = 3, color = "Grey", dashArray = NULL, 
          fillOpacity = 0.9, bringToFront = TRUE)) %>%
      addLegend(pal = pal, values = NULL,  opacity = 0.7, 
        position = "bottomleft", title = leg_title) %>%
      addEasyButton(easyButton(icon="fa-crosshairs", title="Zoom to State Level",
        onClick=JS("function(btn, map){ map.setZoom(7); }"))) })
  
  # Data Explorer
  output$table <- renderDataTable({
    delta_col <- if (substr(input$radio2, 1, 1) == "I") "O_State" else "I_State"
    delta_fips <- if (substr(input$radio2, 1, 1) == "I") "O_FIPS" else "I_FIPS"
    basis_col <- paste0(substr(input$radio2, 1, 1), "_County")
    delta <- if (input$select4 %in% df_irs[, delta_col]) input$select4 else levels(df_irs[, delta_col])
    condition1 <- df_irs[, "Year_To"] == input$slider2
    condition2 <- df_irs[, "Direction"] == input$radio2
    condition3 <- df_irs[,basis_col] == input$select3
    condition4 <- df_irs[, delta_col] %in% delta
    rows <- condition1 & condition2 & condition3 & condition4
    map_df <- df_irs[rows, ] })
  
  # Minimize Active Hours
  session$onSessionEnded(function() {
    stopApp()
  })
}

# Run the application 
shinyApp(ui = ui, server = server)