I want to look at trends in median income by borough in NYC over 10+ years. Since the American Community Survey (ACS) conducts their surveys in 1-year, 3-year, and 5-year increments, I will investigate two 5-year measures: (1) data from the 2013-2017 survey and (2) data from the 2005-2009 survey.
I am particularly interested in looking at which boroughs have a significantly higher median income over time. I am also interested in exploring what this might suggest about that borough, the people moving in or out of it, etc.
Note: tidycensus is needed to extract data from the census
library(tidycensus)
library(tidyverse)
library(viridis)
In order to access census data, you must retrieve and use an API key. I retrieved a key by going to the following website: http://api.census.gov/data/key_signup.html.
Citation: Adapted code from the github page on tidycensus
census_api_key("4ce84da33024270337d2c7a57c4b2cbf6bc78f6c")
install = TRUE
To access any census data, you need to choose a variable ID which requires first searching through the data to determine the variable you want to focus on.
Use View(v17) at the end of the code to view the dataset
Citation: Adapted code from github
v17 <- load_variables(2017, "acs5", cache = TRUE)
The two main types of types of information I can access from the census are:
“The primary purpose of the American Community Survey (ACS) is to measure the changing social and economic characteristics of the U.S. population — our education, housing, jobs, and more.” - U.S. Census Bureau. We are looking for the median income, so I will use ACS data.
Now, I need to find the the ACS code that represents an estimation of median incomes by borough; this code is: B06011_001. Please Note: for the code below, the year defaults to 2017 so there is no need to specify.
Citation: Used code adapted from github
ny17 <- get_acs(geography = "county",
variables = c(medincome = "B06011_001"),
state = "NY")
ny17
## # A tibble: 62 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 36001 Albany County, New York medincome 32452 615
## 2 36003 Allegany County, New York medincome 21607 527
## 3 36005 Bronx County, New York medincome 20576 204
## 4 36007 Broome County, New York medincome 25008 530
## 5 36009 Cattaraugus County, New York medincome 24740 459
## 6 36011 Cayuga County, New York medincome 28586 1103
## 7 36013 Chautauqua County, New York medincome 23202 423
## 8 36015 Chemung County, New York medincome 27930 899
## 9 36017 Chenango County, New York medincome 25398 815
## 10 36019 Clinton County, New York medincome 26915 781
## # … with 52 more rows
The following code will work to filter out everything but the NYC boroughs.
Citation: Adapted code from online resources and Maxwell Austensen
nyc17_boroughs <- filter(ny17, GEOID %in% c(36005, 36047, 36081, 36085, 36061))
nyc17_boroughs
## # A tibble: 5 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 36005 Bronx County, New York medincome 20576 204
## 2 36047 Kings County, New York medincome 28238 281
## 3 36061 New York County, New York medincome 46152 605
## 4 36081 Queens County, New York medincome 29706 306
## 5 36085 Richmond County, New York medincome 36459 488
Next, I need to order the dataset by greatest median income to least median income by borough. This value is represented by the “estimate” column. With this in mind, we can use arrange(des(column)) to single out the factor we want to order by. Then, we use %>% with select(c(desired_column1, desired_column2, desired_column3)) to list all the columns you want to view within this new order.
Citation: Adapted code from stackoverflow - link:https://stackoverflow.com/questions/51255688/sort-a-data-frame-in-r-largest-to-smallest-from-value-in-a-column
boroughs17_ordered <- nyc17_boroughs %>%
# desc orders from greatest to least
arrange(desc(estimate)) %>%
# select the columns you want
select(c("NAME", "estimate", "moe"))
boroughs17_ordered
## # A tibble: 5 x 3
## NAME estimate moe
## <chr> <dbl> <dbl>
## 1 New York County, New York 46152 605
## 2 Richmond County, New York 36459 488
## 3 Queens County, New York 29706 306
## 4 Kings County, New York 28238 281
## 5 Bronx County, New York 20576 204
Then, I need to shorten borough names. For example, I want to shorten “Queens County, New York” to simply “Queens”. I do this by using mutate() and factor() to create a new column titled “labels” in the dataset.
Citation: Adapted a mixture of code from stackoverflow and Maxwell Austensen via Slack
boroughs17_names <- boroughs17_ordered %>%
mutate(borough_names = factor(NAME, levels = c('NYCNY', 'RCNY', 'QCNY', 'KCNY', 'BCNY')),
labels = c('Manhattan', 'Staten Island', 'Queens', 'Brooklyn', 'Bronx'))
boroughs17_names
## # A tibble: 5 x 5
## NAME estimate moe borough_names labels
## <chr> <dbl> <dbl> <fct> <chr>
## 1 New York County, New York 46152 605 <NA> Manhattan
## 2 Richmond County, New York 36459 488 <NA> Staten Island
## 3 Queens County, New York 29706 306 <NA> Queens
## 4 Kings County, New York 28238 281 <NA> Brooklyn
## 5 Bronx County, New York 20576 204 <NA> Bronx
Below is the bar graph for Median Incomes by Borough in NYC, 2017.
Citations: For color - used UCSB’s R color cheat sheet For ggplot - from R for Data Science Manual and adapted from Maxwell Austensen’s class demo
bgraph17_NYC <- ggplot(boroughs17_names) +
aes(x = reorder(labels, -estimate), y = estimate) +
geom_bar(stat = "identity", fill = 'seagreen4') +
theme(axis.text.x = element_text(angle = 15, hjust = 1,
vjust = 0.5)) +
labs(
title = "Income by Borough",
subtitle = "NYC, 2017",
y = "Median Income",
x = "Borough",
caption = "Source: American Community Survey, 2013-2017"
)
bgraph17_NYC
Download ggrepel which will help us create labels for the column values
library(ggrepel)
Citations: Adapted code from cran r project, github, and r statistics; see links below:
bargraph17_labels <- bgraph17_NYC + geom_label_repel(aes(label=estimate),
size=4, data=boroughs17_names,
colour = 'darkgreen',
box.padding = unit(0.35, "lines"),
segment.color = 'grey50',
nudge_y = 36)
bargraph17_labels
I touched up the bar graph w/ 2017 data to add clarity.
Citation: Used code from r-statistics.co
bgraph17_final <- bargraph17_labels +
theme(plot.title=element_text(size=18,
face="bold",
family="American Typewriter",
color="black",
hjust=0.5,
lineheight=1.2), # title
plot.subtitle=element_text(size=15,
family="American Typewriter",
hjust=0.5),
plot.caption=element_text(size=10,
family="American Typewriter",
color="grey36"),
axis.title.x=element_text(vjust=0.5,
size=15), # X axis title
axis.title.y=element_text(size=15,
vjust=3), # Y axis title)
axis.text.x=element_text(size=12,
vjust=1), # X axis text
axis.text.y=element_text(size=12,
angle=0)) # Y axis text
bgraph17_final
Below is the plot graph for Median Incomes by Borough in NYC, 2017.
For each ACS calculation, there is a margin of error (moe). The moe for ACS is set at a default of 90% accuracy with a 10% moe. I kept this default setting so this is the moe for the data represented in both 2017 graphs. Since the moe is valuable to know in order to distinguish the potential range of values for each borough, I want to create a plot graph that can show the moe for each borough calculation.
plotgraph17_NYC <- ggplot(boroughs17_names) +
aes(x = estimate, y = reorder(labels, -estimate)) +
geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe,
y = reorder(labels, -estimate))) +
geom_point(color = "seagreen4", size = 3) +
theme(axis.text.y = element_text(angle = 15, hjust = 1,
vjust = 0.5)) +
labs(
title = "Income by Borough",
subtitle = "NYC, 2017",
y = "Borough",
x = "Median Income",
caption = "Source: American Community Survey, 2013-2017"
)
plotgraph17_NYC
I touched up the plot graph w/ 2017 data to add clarity.
Citation: Used code from r-statistics.co
plotgraph17_final <- plotgraph17_NYC +
theme(plot.title=element_text(size=18,
face="bold",
family="American Typewriter",
color="black",
hjust=0.5,
lineheight=1.2), # title
plot.subtitle=element_text(size=15,
family="American Typewriter",
hjust=0.5),
plot.caption=element_text(size=10,
family="American Typewriter",
color="grey36"),
axis.title.x=element_text(size=15,
vjust=1), # Y axis title)
axis.title.y=element_text(vjust=1,
size=15), # X axis title
axis.text.x=element_text(size=12,
vjust=1), # X axis text
axis.text.y=element_text(size=12,
angle=10)) # Y axis text
plotgraph17_final
Using the same steps as above, I will now pull the 2012 data from ACS.
Please Note: for the code below, the year defaults to 2017 so I adjusted it to 2012.
v12 <- load_variables(2012, "acs5", cache = TRUE)
Similar to above, I need to find the the ACS code that represents an estimation of median incomes by borough; this code is still: B06011_001.
Citation: Used code adapted from github
ny12 <- get_acs(geography = "county", year = 2012,
variables = c(medincome = "B06011_001"),
state = "NY")
ny12
## # A tibble: 62 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 36001 Albany County, New York medincome 31095 465
## 2 36003 Allegany County, New York medincome 19942 701
## 3 36005 Bronx County, New York medincome 19756 252
## 4 36007 Broome County, New York medincome 23007 583
## 5 36009 Cattaraugus County, New York medincome 22346 484
## 6 36011 Cayuga County, New York medincome 26004 541
## 7 36013 Chautauqua County, New York medincome 21513 414
## 8 36015 Chemung County, New York medincome 24876 614
## 9 36017 Chenango County, New York medincome 23131 892
## 10 36019 Clinton County, New York medincome 25252 923
## # … with 52 more rows
Filter 2012 info to only the NYC boroughs
Citation: Adapted code from stackoverflow and Maxwell Austensen via Slack
nyc12_boroughs <- filter(ny12, GEOID %in% c(36005, 36047, 36081, 36085, 36061))
nyc12_boroughs
## # A tibble: 5 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 36005 Bronx County, New York medincome 19756 252
## 2 36047 Kings County, New York medincome 25670 133
## 3 36061 New York County, New York medincome 40414 470
## 4 36081 Queens County, New York medincome 27626 272
## 5 36085 Richmond County, New York medincome 35112 569
Order 2012 boroughs from greatest to least based on median income, using the same code as used above for 2017 data
Citation: Adapted code from stackoverflow and Maxwell Austensen via Slack
boroughs12_ordered <- nyc12_boroughs %>%
# desc orders from largest to smallest
arrange(desc(estimate)) %>%
# select subsets the columns you want
select(c("NAME", "estimate", "moe"))
boroughs12_ordered
## # A tibble: 5 x 3
## NAME estimate moe
## <chr> <dbl> <dbl>
## 1 New York County, New York 40414 470
## 2 Richmond County, New York 35112 569
## 3 Queens County, New York 27626 272
## 4 Kings County, New York 25670 133
## 5 Bronx County, New York 19756 252
Change column names to shorter borough names by creating a new column titled “labels” using mutate() and factor()
Citation: Used a mixture of code from stackoverflow and Maxwell Austensen via Slack
boroughs12_names <- boroughs12_ordered %>%
mutate(borough_names = factor(NAME, levels = c('NYCNY', 'RCNY', 'QCNY', 'KCNY', 'BCNY')),
labels = c('Manhattan', 'Staten Island', 'Queens', 'Brooklyn', 'Bronx'))
boroughs12_names
## # A tibble: 5 x 5
## NAME estimate moe borough_names labels
## <chr> <dbl> <dbl> <fct> <chr>
## 1 New York County, New York 40414 470 <NA> Manhattan
## 2 Richmond County, New York 35112 569 <NA> Staten Island
## 3 Queens County, New York 27626 272 <NA> Queens
## 4 Kings County, New York 25670 133 <NA> Brooklyn
## 5 Bronx County, New York 19756 252 <NA> Bronx
Inflation
The 2003-2017 data has been inflation-adjusted to 2017 dollar estimates. However, the 2008-2012 data has not. These calculations can be done by multiplying " the 2008-2012 dollar estimates by 1.06962963 (CPI-U-RS)" according to the U.S. Census Bureau.
Please see the chart below for the inflation-adjusted values for 2008-2012. The column name for this value is new_vals. On average the 2012 median incomes, after inflation-adjusted, increased by approx. 2,000-3,000 dollars.
Citation: U.S. Census Bureau - https://www.census.gov/programs-surveys/acs/guidance/comparing-acs-data/2017/5-year-comparison.html
boroughs12_newvals <- boroughs12_names %>%
mutate(old_vals = factor(estimate),
new_vals = c(estimate*1.06962963))
boroughs12_newvals
## # A tibble: 5 x 7
## NAME estimate moe borough_names labels old_vals new_vals
## <chr> <dbl> <dbl> <fct> <chr> <fct> <dbl>
## 1 New York County… 40414 470 <NA> Manhattan 40414 43228.
## 2 Richmond County… 35112 569 <NA> Staten I… 35112 37557.
## 3 Queens County, … 27626 272 <NA> Queens 27626 29550.
## 4 Kings County, N… 25670 133 <NA> Brooklyn 25670 27457.
## 5 Bronx County, N… 19756 252 <NA> Bronx 19756 21132.
Cleaned up dataset by filtering out everything except for the needed columns
boroughs12_columns <- boroughs12_newvals %>%
select(c("labels", "new_vals", "moe"))
boroughs12_columns
## # A tibble: 5 x 3
## labels new_vals moe
## <chr> <dbl> <dbl>
## 1 Manhattan 43228. 470
## 2 Staten Island 37557. 569
## 3 Queens 29550. 272
## 4 Brooklyn 27457. 133
## 5 Bronx 21132. 252
Below is the bar graph for Median Incomes by Borough in NYC, 2012.
Citations: For color - used UCSB’s R color cheat sheet
boroughs12_graph <- ggplot(boroughs12_columns) +
aes(x = reorder(labels,-new_vals), y = new_vals) +
geom_bar(stat = "identity", fill = 'palegreen3') +
theme(axis.text.x = element_text(angle = 15, hjust = 1,
vjust = 0.5)) +
labs(
title = "Income by Borough",
subtitle = "NYC, 2012",
y = "Median Income",
x = "Borough",
caption = "Source: American Community Survey, 2008-2012"
)
boroughs12_graph
bargraph12_labels <- boroughs12_graph + geom_label_repel(aes(label=round(new_vals, digits=0)
),
size=4, data=boroughs12_columns,
colour = 'darkgreen',
box.padding = unit(0.35, "lines"),
segment.color = 'grey50',
nudge_y = 36)
bargraph12_labels
I touched up the bar graph w/ 2012 data to add clarity.
Citation: Used code from r-statistics.co
graph12_final <- bargraph12_labels + theme(plot.title=element_text(size=18,
face="bold",
family="American Typewriter",
color="black",
hjust=0.5,
lineheight=1.2), # title
plot.subtitle=element_text(size=15,
family="American Typewriter",
hjust=0.5),
plot.caption=element_text(size=10,
family="American Typewriter",
color="grey36"),
axis.title.x=element_text(vjust=0.5,
size=15), # X axis title
axis.title.y=element_text(size=15,
vjust=3), # Y axis title)
axis.text.x=element_text(size=12,
vjust=1), # X axis text
axis.text.y=element_text(size=12,
angle=0)) # Y axis text
graph12_final
Below is the plot graph for Median Incomes by Borough in NYC, 2012.
As explained in the 2017 Plot Graph section - for each ACS calculation, there is a margin of error (moe). The moe for ACS is set at a default of 90% accuracy with a 10% moe. I kept this default setting for the 2012 data represented in both graphs.
plotgraph12_NYC <- ggplot(boroughs12_columns) +
aes(x = new_vals, y = reorder(labels, -new_vals)) +
geom_errorbarh(aes(xmin = new_vals - moe, xmax = new_vals + moe,
y = reorder(labels, -new_vals))) +
geom_point(color = "palegreen3", size = 3) +
theme(axis.text.y = element_text(angle = 15, hjust = 1,
vjust = 0.5)) +
labs(
title = "Income by Borough",
subtitle = "NYC, 2012",
y = "Borough",
x = "Median Income",
caption = "Source: American Community Survey, 2008-2012"
)
plotgraph12_NYC
I touched up the plot graph w/ 2012 data to add clarity.
Citation: Used code from r-statistics.co
plotgraph12_final <- plotgraph12_NYC +
theme(plot.title=element_text(size=18,
face="bold",
family="American Typewriter",
color="black",
hjust=0.5,
lineheight=1.2), # title
plot.subtitle=element_text(size=15,
family="American Typewriter",
hjust=0.5),
plot.caption=element_text(size=10,
family="American Typewriter",
color="grey36"),
axis.title.x=element_text(size=15,
vjust=1), # Y axis title)
axis.title.y=element_text(vjust=1,
size=15), # X axis title
axis.text.x=element_text(size=12,
vjust=1), # X axis text
axis.text.y=element_text(size=12,
angle=10)) # Y axis text
plotgraph12_final
First, you need to download the necessary packages.
library(grid)
library(gridExtra)
library(downloader)
library(grDevices)
Citation: To place graphs side by side, I adapted the code below from github - https://sebastiansauer.github.io/two-plots-rmd/
grid.arrange(graph12_final, bgraph17_final, ncol = 2)
Join the two datasets using left_join()
acs_12_17 <- left_join(boroughs12_columns, boroughs17_names, by = "labels")
acs_12_17
## # A tibble: 5 x 7
## labels new_vals moe.x NAME estimate moe.y borough_names
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <fct>
## 1 Manhattan 43228. 470 New York County,… 46152 605 <NA>
## 2 Staten Isl… 37557. 569 Richmond County,… 36459 488 <NA>
## 3 Queens 29550. 272 Queens County, N… 29706 306 <NA>
## 4 Brooklyn 27457. 133 Kings County, Ne… 28238 281 <NA>
## 5 Bronx 21132. 252 Bronx County, Ne… 20576 204 <NA>
Clean dataset by filtering down to only the needed columns for graphing.
acsCLEAN_12_17 <- acs_12_17 %>%
select(c("labels", "new_vals", "estimate"))
acsCLEAN_12_17
## # A tibble: 5 x 3
## labels new_vals estimate
## <chr> <dbl> <dbl>
## 1 Manhattan 43228. 46152
## 2 Staten Island 37557. 36459
## 3 Queens 29550. 29706
## 4 Brooklyn 27457. 28238
## 5 Bronx 21132. 20576
Rename columns before pivoting in order to keep data clean, organized, and easy to read.
Citations: Code adapted from Sam Raby and from R Documentation: https://www.rdocumentation.org/packages/plyr/versions/1.8.4/topics/rename
proper_names <- plyr::rename(acsCLEAN_12_17, c("new_vals" = "income2012",
"estimate" = "income2017"))
names(proper_names)
## [1] "labels" "income2012" "income2017"
proper_names
## # A tibble: 5 x 3
## labels income2012 income2017
## <chr> <dbl> <dbl>
## 1 Manhattan 43228. 46152
## 2 Staten Island 37557. 36459
## 3 Queens 29550. 29706
## 4 Brooklyn 27457. 28238
## 5 Bronx 21132. 20576
Use pivot_longer() to add columns - need to repeat Boroughs and differentiate which are in 2012 and which are in 2017 - repeat labels and add medincome columns and then distinguish values by year - woohoo
Use the following to pivot the two new coloumns
acs_long <- pivot_longer(proper_names, c("income2012","income2017"))
acs_long
## # A tibble: 10 x 3
## labels name value
## <chr> <chr> <dbl>
## 1 Manhattan income2012 43228.
## 2 Manhattan income2017 46152
## 3 Staten Island income2012 37557.
## 4 Staten Island income2017 36459
## 5 Queens income2012 29550.
## 6 Queens income2017 29706
## 7 Brooklyn income2012 27457.
## 8 Brooklyn income2017 28238
## 9 Bronx income2012 21132.
## 10 Bronx income2017 20576
proper_namesAgain <- plyr::rename(acs_long, c("labels"="Boroughs", "name" = "Year",
"value" = "Median_Income"))
names(proper_names)
## [1] "labels" "income2012" "income2017"
proper_namesAgain
## # A tibble: 10 x 3
## Boroughs Year Median_Income
## <chr> <chr> <dbl>
## 1 Manhattan income2012 43228.
## 2 Manhattan income2017 46152
## 3 Staten Island income2012 37557.
## 4 Staten Island income2017 36459
## 5 Queens income2012 29550.
## 6 Queens income2017 29706
## 7 Brooklyn income2012 27457.
## 8 Brooklyn income2017 28238
## 9 Bronx income2012 21132.
## 10 Bronx income2017 20576
Combine and order x-axis (Boroughs) based on the value of y (Median Incomes)
Citation: Github - https://sebastiansauer.github.io/ordering-bars/
combinedGraph <- ggplot(proper_namesAgain, aes(x = reorder(Boroughs, -Median_Income),
y = Median_Income, fill=Year)) +
geom_bar(stat = "identity", position = "dodge") +
theme(axis.text.x = element_text(angle = 15, hjust = 1,
vjust = 0.5)) +
labs(
title = "Income by Borough",
subtitle = "NYC, 2012 & 2017",
y = "Median Income",
x = "Borough",
caption = "Source: American Community Survey, 2008-2012 & 2013-2017"
)
combinedGraph
combinedGraph_labels <- combinedGraph + geom_label_repel(aes(label=round(Median_Income,
digits=0)),
size=3, data=proper_namesAgain,
colour = 'black',
box.padding = unit(0.35, "lines"),
segment.color = 'grey50',
nudge_x = 0,
nudge_y = 0)
combinedGraph_labels
Citation: http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually
combinedGraph_colors <- combinedGraph_labels + scale_fill_manual(values=c('palegreen3', 'seagreen4'))
combinedGraph_colors
combinedGraph_final <- combinedGraph_colors +
theme(plot.title=element_text(size=18,
face="bold",
family="American Typewriter",
color="black",
hjust=0.5,
lineheight=1.2), # title
plot.subtitle=element_text(size=15,
family="American Typewriter",
hjust=0.5),
plot.caption=element_text(size=10,
family="American Typewriter",
color="grey36"),
axis.title.x=element_text(vjust=0.5,
size=15), # X axis title
axis.title.y=element_text(size=15,
vjust=3), # Y axis title)
axis.text.x=element_text(size=12,
vjust=1), # X axis text
axis.text.y=element_text(size=12,
angle=0)) # Y axis text
combinedGraph_final
Given the above observations, to some degree, we can infer that boroughs with higher median incomes are increasing and boroughs with lower median incomes are either remaining the same or decreasing over time. Understanding why will take further research. With more time, I would be interested in exploring changes in population within each borough from 2012 to 2017. I would also be interested in exploring changes in the cost of rent in each borough. I believe these pieces of information would aid in figuring out what may be causing the overarching trends mentioned above.
Citation: U.S. Census Bureau - https://www.census.gov/programs-surveys/acs/guidance/comparing-acs-data/2017/5-year-comparison.html
grid.arrange(plotgraph12_final, plotgraph17_final, ncol = 2)
The above inference may be a result of greater job diversity and potentially even greater opportunity for employment or growth within pre-existing employment. One would need to pursue further research to be sure.