Your final project is to create a public visualization (static or interactive) using data relevant to a current policy, business, or justice issue. You may use any dataset you can find for this assignment, as long as it is either public or you have permission from the data’s owner/administrator to work with it and share it. Recommended data sources are: governmental data, data provided by a non-profit/Nongovernmental organizations, and data available from large, semi-structured data sets (i.e. social networks, company financials, etc.). You must document each step of your data analysis process (excluding data acquisition) in code: this will include changing the format of the data and the creation of any images or interactive displays that are made. You must also include a short (2-3 paragraph) write-up on the visualization. This write-up must include the following: the data source, what the parameters of the data set are (geography, timeframe, what the data points are, etc.) what the data shows, and why it is Important… Note: The type of deliverable you provide will depend on the strategy you use for this project. If you put together an interactive visualization, you should be able to provide code that I will be able to run and host locally. If you are choosing static visualizations, your write up will be more important to your overall grade, and it may be useful to think about how you’re presenting these visualizations (in a formatted R
Markdown document for example.)
You must submit a proposal for your project. This proposal must include: a link to the data source, an explanation of what you want to show, why this is relevant to a current policy, business, or justice issue, and which technologies you plan to use. Your instructor must approve this proposal: you may have to refine this somewhat. You will present your final project. If you are not able to attend the lecture on [the proposal or final due dates], you must write-up a status report with screenshots of current progress and issues you are experiencing.
The IRS has a dataset that contains U.S. Population Migration Data from 1990 to 2016. The data “based on year-to-year address changes reported on individual income tax returns filed with the IRS.” The data shows inflows and outflows to every county in the United States as defined by US Census Bureau FIPS Codes. The flows are in terms of number of returns, number of exemptions, and aggregate income. These data are useful in business for gauging demand and human capital. The project proposed is an interactive visualization where a year, destination State, and destination County are selected, then a map with inflows or outflows by origin State and Origin County is displayed. An example would be selecting 2015 and Kings County in New York State to see a heatmap of the United States identifying from where people migrated by return count. Below are the top 15 rows of the raw data for such a selection.
DS | DC | OS | OC | State Name | County Name | Returns | Exemptions | AGI |
---|---|---|---|---|---|---|---|---|
36 | 47 | 36 | 61 | NY | New York County | 12253 | 18042 | 803554000 |
36 | 47 | 36 | 81 | NY | Queens County | 8881 | 16332 | 305235000 |
36 | 47 | 36 | 5 | NY | Bronx County | 2783 | 5215 | 74961000 |
36 | 47 | 36 | 85 | NY | Richmond County | 1527 | 2809 | 64619000 |
36 | 47 | 36 | 59 | NY | Nassau County | 1317 | 2108 | 60445000 |
36 | 47 | 36 | 103 | NY | Suffolk County | 974 | 1502 | 42653000 |
36 | 47 | 34 | 17 | NJ | Hudson County | 784 | 1150 | 39661000 |
36 | 47 | 6 | 37 | CA | Los Angeles County | 770 | 956 | 31287000 |
36 | 47 | 36 | 119 | NY | Westchester County | 677 | 982 | 32442000 |
36 | 47 | 42 | 101 | PA | Philadelphia County | 568 | 821 | 20239000 |
36 | 47 | 17 | 31 | IL | Cook County | 535 | 668 | 34080000 |
36 | 47 | 6 | 75 | CA | San Francisco County | 484 | 557 | 25770000 |
36 | 47 | 34 | 13 | NJ | Essex County | 454 | 757 | 16920000 |
36 | 47 | 12 | 86 | FL | Miami Dade County | 436 | 648 | 11926000 |
36 | 47 | 34 | 3 | NJ | Bergen County | 403 | 605 | 17831000 |
I plan on using R
with Shiny
unless there is too much data for R
to handle. In that case, I will try to use Python
with plotly
. The data for this analysis are available in CSV
Format here: IRS U.S. Population Migration Data, US Census Bureau FIPS Codes.
This is a very interesting dataset, but you still need to flesh out exactly what you’re looking to show. For assignment 6, think about what kind of app you want to make (exploratory? purpose driven?), and if it is more narrative or purpose driven, think about what you want your thesis to be (highlighting the characteristics of the largest in / outflows? saying something general about how Americans move?)
I have given all of you feedback for your final project proposals, and this assignment is meant to be a first step in service of your final project, and an (optional) chance to build your skills in the JavaScript libraries we’ve covered: D3 or Plotly. Follow the directions in my feedback and create some preliminary graphs for your final project. I encourage all students to attempt to attempt to use JavaScript. Even if you are not planning on using these tools for your final project, the exercise can be used to build your skills above our basic assignments in Project 5. Since this is in service of your final project however, projects submitted in either Dash or Shiny will be accepted if that’s what you’ll be using… If you are not using D3 or Plotly.js, then Project 6 would be acceptable in whatever format you’re using for your final project. If you’re using Dash, then you can put some preliminary graphs in a notebook, or if you’re using Shiny, you can put some preliminary graphs in an R Markdown file.
I am looking to make an exploratory app that shows how Americans move in general. I was planning to do it the whole country but the millions of rows of data will make the visualization unresponsive. I have culled it down to inflows and outflows from NY which is only about a quarter-million rows of data and it still lags a little (depending on RAM), but it is functional. The app will be more purpose driven than narrative based so I plan to focus more on improving the visualization than telling a story about the data. I was a little disappointed to see that county level choropleth maps do not have zoom capabilities, but will add sliders for years and drop downs for both direction (inflow, outflow) and scope (NY to/from where).
sudo apt-get update
pip install --upgrade pip
sudo apt-get install libudunits2-dev # for reticulate
sudo apt-get install libffi-dev g++ libssl-dev
pip install plotly
pip install geopandas # for cloropleths
pip install pyshp # for cloropleths
pip install shapely # for cloropleths
Brace Expansions with wget
are useful for downloading several files via the command line. When downloading these annual data however, the naming conversions (i.e. *1993to1994*.csv
) calls for the use of multiple brace expansions (i.e. *{1993..2003}to{1994..2004}*.csv
) which produces repeated non-critical errors for nonexistent files. This is because multiple Brace Expansions produce cross (Cartesian) products with every possible combination is returned. Using for
loops is an alternative, but can require more coding than using wget
. Brace Expansions also encounter issues with zero-padding. When Brace Expansions loop over “08” an error of “too great for base” is returned because numbers with a leading 0 are base 8 octal numbers.
Most dat files were cleaned using Pyhton, but data from 1990 to 1992 are in text files in a hierarchical format which is not easily converted to csv. These files were manual converted using Microsoft Excel.
The run Python code from R Markdown using a specific engine, navigate to the desired location and get the engine path with the below code. The output below is only an example.
python -c "import os; print(os.environ['_'])"
## /usr/bin/python
import numpy as np
import pandas as pd
github = 'https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20608/'
# Federal Information Processing Standards (FIPS) County Codes
df_fips = pd.read_csv(github + 'FIPS_Codes_2016.csv', encoding ='latin1')
# IRS Statistics of Income Division (SOI) Tax Stats - Migration Data
col_right = ['State_Name', 'County_Name', 'Returns', 'Exemptions', 'Income']
col_fips, df_list = (['State_Code', 'County_Code'], [])
for year, io in [(x, y) for x in range(1990, 2016) for y in ['i','o']]:
file = github + 'IRS_NYc' + str(year) + 'to' + str(year + 1) + io
col_basis = [('I_' if io == 'i' else 'O_') + c for c in col_fips]
col_delta = [('I_' if io == 'o' else 'O_') + c for c in col_fips]
cols = col_basis + col_delta + col_right
if year <= 1991:
df_temp = pd.read_csv(file + '.csv', skiprows=1, names=cols)
else:
df_temp = pd.ExcelFile(file + '.xls').parse(skiprows=(7 if year<=2008 else 5), names=cols)
df_temp.drop(['State_Name', 'County_Name'], axis=1, inplace=True)
df_temp['Direction'] = 'Inflow' if (io == 'i') else 'Outflow'
df_temp['Year_From'], df_temp['Year_To'] = (year, year+1)
for col_irs in [col_basis, col_delta]:
df_temp = pd.merge(df_fips, df_temp, left_on=col_fips, right_on=col_irs, how='inner')
dic = dict(zip(df_fips.columns[2:], [col_irs[0][:2] + l for l in ['State', 'Abbr', 'County']]))
df_temp.rename(columns=dic, inplace=True)
fips_left = df_temp[col_irs[0]].astype(int).astype(str).str.zfill(2)
fips_right = df_temp[col_irs[1]].astype(int).astype(str).str.zfill(3)
df_temp[col_irs[0][:2] + 'FIPS'] = fips_left + fips_right
df_temp.drop(col_fips+col_irs, axis=1, inplace=True)
df_list.append(df_temp)
df_irs = pd.concat(df_list, ignore_index=False).iloc[:,[12,13,0,4,2,5,3,9,7,10,8,11,1,6]]
df_irs.to_csv('IRS_NYc1990to2016io.csv', index=False)
This code is not actually being run. Runnable Jupyter Notebook with actual code can be found here.
The IRS has a Spatial and Longitudinal dataset that contains U.S. Population Migration Data from 1990 to 2016. The data are “based on year-to-year address changes reported on individual income tax returns filed with the IRS.” These data contain geographic information regarding inflows and outflows to every county in the United States over time along with the number of returns, number of exemptions, and aggregate income associated with those flows. These data are very important to business for gauging demand in sectors such as marketing and real estate as well as supply of labor for human capital management.
These data are easy to find, but not easy to obtain or clean. The dataset is huge with inconsistent file formats, file naming conversions, file structures, and data values. On top of that, to pull one file the data store for the entire year must be downloaded. Data acquisition and cleansing was not trivial. Files were downloaded with Bash, sifted manually, renamed with Bulk Rename Utility, reshaped with Excel, and then cleaned with Python before being saved in an aggregated Spatial and Longitudinal CSV file that was uploaded to GitHub.
With these data, an exploratory Shiny application was created that shows how people in the United States move in general. Yet the size of the data (millions of rows) makes the visualization unresponsive, therefore a subset of the data with inflows and outflows from New York (about a quarter-million rows) is used. The resulting visualization shows some interesting features. In general, New York residents tend to move between neighboring counties, but New York City residents also show a large amount of movement between other cities. Albany residents also have a wider dispersion than other New York residents, although not as wide as New York City residents. Without diving into a full quantitative analysis, at first glance this dataset appears to show that people tend to move between areas that are relatively similar in density. The interactive visualization produced through this Shiny application incorporates the following data visualization concepts:
library(tigris)
library(leaflet)
github = 'https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20608/'
df_irs <- read.csv(paste0(github, "IRS_NYc1990to2016io.csv"))
df_irs$I_FIPS <- sprintf("%05d", df_irs$I_FIPS)
df_irs$O_FIPS <- sprintf("%05d", df_irs$O_FIPS)
shapefile <- counties(cb=TRUE, year=2016)
input <- data.frame(radio1 = "Inflow", slider1 = 2015,
select1 = "Kings County", select2 = "United States")
delta_col <- if (substr(input$radio1, 1, 1) == "I") "O_State" else "I_State"
delta_fips <- if (substr(input$radio1, 1, 1) == "I") "O_FIPS" else "I_FIPS"
basis_col <- paste0(substr(input$radio1, 1, 1), "_County")
delta <- if (input$select2 %in% df_irs[, delta_col]) input$select2 else levels(df_irs[, delta_col])
condition1 <- df_irs[, "Year_To"] == input$slider1
condition2 <- df_irs[, "Direction"] == as.vector(input$radio1)
condition3 <- df_irs[,basis_col] == as.vector(input$select1)
condition4 <- df_irs[, delta_col] %in% delta
rows <- condition1 & condition2 & condition3 & condition4
map_df <- geo_join(shapefile, df_irs[rows, ], by_sp="GEOID", by_df=delta_fips, how='inner')
delta_st <- if (substr(input$radio1, 1, 1) == "I") "O_Abbr" else "I_Abbr"
delta_ct <- if (substr(input$radio1, 1, 1) == "I") "O_County" else "I_County"
labels <- sprintf("<strong>%s, %s</strong><br/>%g Returns<br/>%g Income<br/>%g Exemptions",
map_df[, delta_ct][[1]], map_df[, delta_st][[1]], map_df$Returns, map_df$Income,
map_df$Exemptions ) %>% lapply(htmltools::HTML)
breaks <- 1 / min(length(unique(map_df$Returns)), 8)
bins <- unique(floor(quantile(map_df$Returns, seq(0, 1, breaks))))
pal <- colorBin("Blues", domain = map_df$Returns, bins = bins, right = F)
abbr <- c("US", state.abb)[c("United States", state.name) == input$select2]
leg_title <- paste0(input$radio1, if (substr(input$radio1, 1, 1) == "I") "s from " else "s to ", abbr)
m <- leaflet(width = "100%") %>% addProviderTiles(providers$CartoDB.Positron) %>%
addMiniMap(toggleDisplay=T, position="bottomleft", minimized=T) %>%
addPolygons(data = map_df, fillColor = ~pal(Returns), weight = 1,
opacity = 1, color = "Black", dashArray = 3, fillOpacity = 0.7, label = labels,
highlight = highlightOptions(weight = 3, color = "Grey", dashArray = NULL,
fillOpacity = 0.9, bringToFront = TRUE)) %>%
addLegend(pal = pal, values = NULL, opacity = 0.7,
position = "bottomleft", title = leg_title) %>%
addEasyButton(easyButton(icon="fa-crosshairs", title="Zoom to State Level",
onClick=JS("function(btn, map){ map.setZoom(7); }")))
Actual Shiny application can be found here. The application has dependencies this CSS file and this Markdown file.
library(tigris)
library(leaflet)
github = 'https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20608/'
df_irs <- read.csv(paste0(github, "IRS_NYc1990to2016io.csv"))
df_irs$I_FIPS <- sprintf("%05d", df_irs$I_FIPS)
df_irs$O_FIPS <- sprintf("%05d", df_irs$O_FIPS)
shapefile <- counties(cb=TRUE, year=2016)
# Define UI for application that draws a histogram
ui <- bootstrapPage(
navbarPage("IRS Migration Data", id="nav",
tabPanel("Overview", includeMarkdown("DATA608_Final.md")),
tabPanel("Interactive Map",
div(class="outer",
tags$head(includeCSS("DATA608_Final.css")),
leafletOutput("map", width="100%", height="100%"),
absolutePanel(id = "controls", class = "panel panel-default", fixed = TRUE,
draggable = TRUE, top = 60, left = "auto", right = 20, bottom = "auto",
width = 330, height = "auto", style="z-index:500;",
h2("New York Migration"),
uiOutput('basis1'),
radioButtons("radio1", label = "Migration Direction", selected = "Inflow", inline = T,
choices = list("Inflow" = "Inflow", "Outflow" = "Outflow")),
sliderInput("slider1", label = "Migration Year", min = min(df_irs$Year_To),
max = max(df_irs$Year_To), value = max(df_irs$Year_To), sep = "",
step = 1, animate = animationOptions(interval = 1000, loop = T)),
uiOutput('scope1')
),
tags$div(id="cite", style="z-index:500;", "IRS New York Migration Flows by Jose Zuniga (May 2018)")
)
),
tabPanel("Data Explorer",
h2("New York Migration"),
hr(),
fluidRow(
column(3, uiOutput('basis2')),
column(3, radioButtons("radio2", label = "Migration Direction", selected = "Inflow", inline = T,
choices = list("Inflow" = "Inflow", "Outflow" = "Outflow"))),
column(3, sliderInput("slider2", label = "Migration Year", min = min(df_irs$Year_To),
max = max(df_irs$Year_To), value = max(df_irs$Year_To), sep = "", step = 1)),
column(3, uiOutput('scope2'))),
hr(),
dataTableOutput('table')
)
)
)
# Define server logic required to draw a histogram
server <- function(input, output, session) {
# Interactive Map
reactive_basis1 <- reactive({
basis_col <- paste0(substr(input$radio1, 1, 1), "_County")
condition1 <- df_irs[, "Year_To"] == input$slider1
condition2 <- df_irs[, "Direction"] == input$radio1
rows <- condition1 & condition2
df <- sort(df_irs[rows, basis_col])
return(df) })
output$basis1 <- renderUI({
selectInput('select1', label = 'Migration County', width = "auto",
choices=reactive_basis1(), selected = 1) })
# Data Explorer
reactive_basis2 <- reactive({
basis_col <- paste0(substr(input$radio2, 1, 1), "_County")
condition1 <- df_irs[, "Year_To"] == input$slider2
condition2 <- df_irs[, "Direction"] == input$radio2
rows <- condition1 & condition2
df <- sort(df_irs[rows, basis_col])
return(df) })
output$basis2 <- renderUI({
selectInput('select3', label = 'Migration County', width = "auto",
choices=reactive_basis2(), selected = 1) })
# Interactive Map
reactive_delta1 <- reactive({
delta_col <- if (substr(input$radio1, 1, 1) == "I") "O_State" else "I_State"
basis_col <- paste0(substr(input$radio1, 1, 1), "_County")
condition1 <- df_irs[, "Year_To"] == input$slider1
condition2 <- df_irs[, "Direction"] == input$radio1
condition3 <- df_irs[, basis_col] == input$select1
rows <- condition1 & condition2 & condition3
df <- rbind("United States", as.vector(sort(df_irs[rows, delta_col])))
return(df) })
output$scope1 <- renderUI({
selectInput("select2", label = "Migration Location", width = "auto",
choices=reactive_delta1(), selected = 1) })
# Data Explorer
reactive_delta2 <- reactive({
delta_col <- if (substr(input$radio2, 1, 1) == "I") "O_State" else "I_State"
basis_col <- paste0(substr(input$radio2, 1, 1), "_County")
condition1 <- df_irs[, "Year_To"] == input$slider2
condition2 <- df_irs[, "Direction"] == input$radio2
condition3 <- df_irs[, basis_col] == input$select3
rows <- condition1 & condition2 & condition3
df <- rbind("United States", as.vector(sort(df_irs[rows, delta_col])))
return(df) })
output$scope2 <- renderUI({
selectInput("select4", label = "Migration Location", width = "auto",
choices=reactive_delta2(), selected = 1) })
# Interactive Map
output$map <- renderLeaflet({
delta_col <- if (substr(input$radio1, 1, 1) == "I") "O_State" else "I_State"
delta_fips <- if (substr(input$radio1, 1, 1) == "I") "O_FIPS" else "I_FIPS"
basis_col <- paste0(substr(input$radio1, 1, 1), "_County")
delta <- if (input$select2 %in% df_irs[, delta_col]) input$select2 else levels(df_irs[, delta_col])
condition1 <- df_irs[, "Year_To"] == input$slider1
condition2 <- df_irs[, "Direction"] == input$radio1
condition3 <- df_irs[,basis_col] == input$select1
condition4 <- df_irs[, delta_col] %in% delta
rows <- condition1 & condition2 & condition3 & condition4
map_df <- geo_join(shapefile, df_irs[rows, ], by_sp="GEOID", by_df=delta_fips, how='inner')
delta_st <- if (substr(input$radio1, 1, 1) == "I") "O_Abbr" else "I_Abbr"
delta_ct <- if (substr(input$radio1, 1, 1) == "I") "O_County" else "I_County"
labels <- sprintf("<strong>%s, %s</strong><br/>%g Returns<br/>%g Income<br/>%g Exemptions",
map_df[, delta_ct][[1]], map_df[, delta_st][[1]], map_df$Returns, map_df$Income,
map_df$Exemptions ) %>% lapply(htmltools::HTML)
breaks <- 1 / min(length(unique(map_df$Returns)), 8)
bins <- unique(floor(quantile(map_df$Returns, seq(0, 1, breaks))))
pal <- colorBin("Blues", domain = map_df$Returns, bins = bins, right = F)
abbr <- c("US", state.abb)[c("United States", state.name) == input$select2]
leg_title <- paste0(input$radio1, if (substr(input$radio1, 1, 1) == "I") "s from " else "s to ", abbr)
leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>%
addMiniMap(toggleDisplay=T, position="bottomleft", minimized=T) %>%
addPolygons(data = map_df, fillColor = ~pal(Returns), weight = 1,
opacity = 1, color = "Black", dashArray = 3, fillOpacity = 0.7, label = labels,
highlight = highlightOptions(weight = 3, color = "Grey", dashArray = NULL,
fillOpacity = 0.9, bringToFront = TRUE)) %>%
addLegend(pal = pal, values = NULL, opacity = 0.7,
position = "bottomleft", title = leg_title) %>%
addEasyButton(easyButton(icon="fa-crosshairs", title="Zoom to State Level",
onClick=JS("function(btn, map){ map.setZoom(7); }"))) })
# Data Explorer
output$table <- renderDataTable({
delta_col <- if (substr(input$radio2, 1, 1) == "I") "O_State" else "I_State"
delta_fips <- if (substr(input$radio2, 1, 1) == "I") "O_FIPS" else "I_FIPS"
basis_col <- paste0(substr(input$radio2, 1, 1), "_County")
delta <- if (input$select4 %in% df_irs[, delta_col]) input$select4 else levels(df_irs[, delta_col])
condition1 <- df_irs[, "Year_To"] == input$slider2
condition2 <- df_irs[, "Direction"] == input$radio2
condition3 <- df_irs[,basis_col] == input$select3
condition4 <- df_irs[, delta_col] %in% delta
rows <- condition1 & condition2 & condition3 & condition4
map_df <- df_irs[rows, ] })
# Minimize Active Hours
session$onSessionEnded(function() {
stopApp()
})
}
# Run the application
shinyApp(ui = ui, server = server)
https://rstudio.github.io/leaflet
https://rpubs.com/HunterRatliff1/152226
https://ehbick01.github.io/reported_crime
https://www.shinyapps.io/admin/#/dashboard
https://shiny.rstudio.com/gallery/superzip-example.html
https://cran.r-project.org/web/packages/tigris/tigris.pdf
https://rmarkdown.rstudio.com/authoring_knitr_engines.html
https://www.irs.gov/statistics/soi-tax-stats-migration-data
https://github.com/jzuniga123/SPS/blob/master/DATA%20608/DATA608_Final.md
https://github.com/jzuniga123/SPS/blob/master/DATA%20608/DATA608_Final.css
https://github.com/jzuniga123/SPS/blob/master/DATA%20608/DATA608_HW06.ipynb
https://github.com/jzuniga123/SPS/blob/master/DATA%20608/IRS_NYc1990to2016io.csv
https://github.com/rstudio/shiny-examples/blob/master/063-superzip-example/ui.R
https://www.census.gov/geographies/reference-files/2016/demo/popest/2016-fips.html