Welcome! This site presents all materials for replicating our team’s project analyzing the Boston BlueBikes network. You can start your own RStudio.Cloud project with this data using the following link.
This RStudioCloud project is our repository for sharing all coding,
data, and new techniques for our research project. To see our code for
building the dataset, please see project_builder_code.Rmd
in our RStudioCloud project. To see our code for analyzing the dataset,
see below (also labelled guide.Rmd in the project)!
# First, let's load packages
# For data wrangling and spatial analysis
library(tidyverse)
library(sf)
library(tigris)
library(lubridate)
# For better visuals
library(viridis)
library(ggtext)
library(ggpubr)
library(shadowtext)
library(ggspatial)
# For parallel processing
library(furrr)
library(future)There’s a lot of it! Here’s a small sample of the data that you can use to explore.
Check out our data, ‘test data’, which contains bluebikes trips from
December 2021. It’s a HUGE file, so a good way to start would be by
using sample_n() to take a random sample of it, rather than
using the whole thing. Here’s an example.
Total size of the file, for example.
read_rds("testdata.rds") %>%
summarize(rows = n())## # A tibble: 1 × 1
## rows
## <int>
## 1 143300
Here’s how we take a random sample of 10,000 rides during 12/2021.
dat <- read_rds("testdata.rds") %>%
sample_n(size = 10000)Explore the data!
dat %>% head()## # A tibble: 6 × 15
## tripduration starttime stoptime start_station_id
## <dbl> <dttm> <dttm> <dbl>
## 1 1002 2021-12-31 21:45:31 2021-12-31 22:02:13 44
## 2 637 2021-12-11 17:59:00 2021-12-11 18:09:38 76
## 3 178 2021-12-03 18:43:46 2021-12-03 18:46:44 218
## 4 1145 2021-12-01 23:50:49 2021-12-02 00:09:55 318
## 5 2318 2021-12-03 19:17:23 2021-12-03 19:56:02 338
## 6 960 2021-12-05 22:06:03 2021-12-05 22:22:04 14
## # … with 11 more variables: start_station_name <chr>, from_y <dbl>,
## # from_x <dbl>, end_station_id <dbl>, end_station_name <chr>, to_y <dbl>,
## # to_x <dbl>, bikeid <dbl>, usertype <chr>, postal_code <chr>, tripid <int>
Try out using lubridate, the tidyverse package that
helps us work with date data.
library(lubridate)
dat %>%
head() %>%
select(starttime) %>%
# We can use ceiling date to consolidate our dates...
mutate(
seconds = ceiling_date(starttime, unit = "seconds"),
minutes = ceiling_date(starttime, unit = "minutes"),
hours = ceiling_date(starttime, unit = "hours"),
days = ceiling_date(starttime, unit = "days"),
weeks = ceiling_date(starttime, unit = "weeks"),
months = ceiling_date(starttime, unit = "months"))## # A tibble: 6 × 7
## starttime seconds minutes
## <dttm> <dttm> <dttm>
## 1 2021-12-31 21:45:31 2021-12-31 21:45:32 2021-12-31 21:46:00
## 2 2021-12-11 17:59:00 2021-12-11 17:59:01 2021-12-11 18:00:00
## 3 2021-12-03 18:43:46 2021-12-03 18:43:47 2021-12-03 18:44:00
## 4 2021-12-01 23:50:49 2021-12-01 23:50:50 2021-12-01 23:51:00
## 5 2021-12-03 19:17:23 2021-12-03 19:17:24 2021-12-03 19:18:00
## 6 2021-12-05 22:06:03 2021-12-05 22:06:04 2021-12-05 22:07:00
## # … with 4 more variables: hours <dttm>, days <dttm>, weeks <dttm>,
## # months <dttm>
This lets us analyze change over time.
result <- dat %>%
# We can use ceiling date to consolidate our dates...
mutate(days = ceiling_date(starttime, unit = "days")) %>%
# Count trips per day
group_by(days) %>%
summarize(count = n())
result## # A tibble: 31 × 2
## days count
## <dttm> <int>
## 1 2021-12-02 00:00:00 534
## 2 2021-12-03 00:00:00 520
## 3 2021-12-04 00:00:00 519
## 4 2021-12-05 00:00:00 473
## 5 2021-12-06 00:00:00 459
## 6 2021-12-07 00:00:00 443
## 7 2021-12-08 00:00:00 390
## 8 2021-12-09 00:00:00 451
## 9 2021-12-10 00:00:00 393
## 10 2021-12-11 00:00:00 458
## # … with 21 more rows
And we can visualize them.
result %>%
ggplot(mapping = aes(x = days, y = count)) +
geom_col() +
coord_flip()Several other interesting traits are in this dataset, including station name and zipcode. Can you put together some descriptive statistics and visuals to examine how BlueBikes ridership varied over the last month? This will be a practice cycle for us as we get going.
make_line = function(data){
st_linestring(matrix(c(data$from_x, data$from_y, data$to_x, data$to_y), 2, 2)) %>%
return()
}
library(sf)
test <- dat %>%
pmap(make_line) %>%
st_as_sf(crs = 4326)
{tibble(id = df$id, geometry = .)} %>%
st_sf() Congrats! We’ve made it to the next statge of the project. Now that we have (1) reviewed the literature and (2) explored the raw data, we’re going to analyze the (3) refined data. What does this mean?
We’re going to use this SQLite database to access a refined version
of our data. SQLite is really handy because we can ask it to do lots of
number-crunching without loading all the millions of rows into our R
environment, which would normally cause it to crash. Instead, we can
feed SQLite basic dplyr functions, like select(),
mutate(), filter(), group_by(),
and summarize(), and then ask the SQLite database to
collect() the resulting data and give it to us in R. This
output (should) be much, much, much smaller, at a size R can handle.
Our data is saved in our_data. Please use this folder
for all your data needs. I’d recommend not looking elsewhere, as it gets
messy real fast :) Here’s our data:
our_data/bluebikes.sqlite: a HUGE compendium of
datasets. These files are a little too big to access on their own, so we
will access them via the SQLite database. Contains:
tally_rush_edges (in bluebikes.sqlite):
a dataset tallying number of rides each day during morning and evening
rushhour.
tally_rush (in bluebikes.sqlite): a
dataset tallying number of rides each day during morning and evening
rushhour, for EACH START AND END STATION. That’s a LOT!
our_data/stationbg_dataset.rds: an sf
dataset of geolocated points for all bluebike stations present in our
data. Contains the geoid of the census block group each
station is located in. Also contains all the traits of that block
group!
Others You might run into, but don’t need to think about as much.
our_data/dates.rds: a dataset of dates and days of
the week for the past 10 years. Useful for filtering, but not strictly
necessary.
our_data/bgdataset.rds: an sf dataset
of all census block group polygons in Boston. Contains all the traits of
each block group.
Here’s a quick summary of all variable names you will run into.
code: unique ID for each bluebikes station.
geoid: unique ID for each census block
group.
count: total rides occuring during that time,
between those places.
Example Demographics Variable Set:
pop_density_2020: population density per square
kilometer in 2020. Some places are missing data.
pop_density_2020_smooth5: population density, with
missing data filled in by taking the median of their 5 nearest
neighboring census block groups.
pop_density_2020_smooth10: population density, with
missing data filled in by taking the median of their 10 nearest
neighboring census block groups. Either is fine to use.
I generated many other variables too, saved in
our_data/stationbg_dataset.rds. Let’s look at them real
quick.
read_rds("our_data/stationbg_dataset.rds") %>%
select(-contains("smooth")) %>%
names()## [1] "code" "geometry" "geoid"
## [4] "area" "pop_0_40000_2019" "pop_100000_plus_2019"
## [7] "pop_2020" "pop_40001_60000_2019" "pop_60001_100000_2019"
## [10] "pop_asian_2020" "pop_black_2020" "pop_density_2020"
## [13] "pop_employed_2019" "pop_hisplat_2020" "pop_natam_2020"
## [16] "pop_over_65_2019" "pop_some_college" "pop_white_2020"
## [19] "pop_women_2019"
Demographics
pop_white_2020: % of white residents (you can make
pop_nonwhite_2020 by taking 1 - pop_white_2020)
pop_black_2020: % of Black residents
pop_hisplat_2020: % of Hispanic/Laitno
residents
pop_asian_2020: % of Asian residents
pop_natam_2020: % of Native American
residents
Socioeconomics
pop_0_40000_2019: % of families earning 0-40K a
year.
pop_40001_60000_2019: % of families earning 40-60K a
year.
pop_60001_100000_2019: % of families earning 60-100K
a year.
pop_100000_plus_2019: % of families earning 100K+ a
year.
pop_some_college: % of residents with some college
education or more.
pop_employed_2019: % of residents employed, out of
total population.
# Load Packages
library(tidyverse)
library(RSQLite)
library(DBI)
# Tell R to hire a 'SQL translator' object, which we'll name 'mydat',
# sourced from our bluebikes data
mydat <- dbConnect(RSQLite::SQLite(), "our_data/bluebikes.sqlite")
# Whenever you're finished, disconnect the SQL dataset, like so:
#dbDisconnect(mydat)mydat %>%
# Tell R to look in the tally_rush dataset in our SQLite database
# It will return the first 1000 rows
tbl("tally_rush") %>%
# Grab first 6 rows
head()## # Source: lazy query [?? x 3]
## # Database: sqlite 3.37.0 [/cloud/project/our_data/bluebikes.sqlite]
## day rush count
## <chr> <chr> <int>
## 1 2011-07-28 am 1
## 2 2011-07-28 pm 168
## 3 2011-07-29 am 85
## 4 2011-07-29 pm 188
## 5 2011-08-01 am 91
## 6 2011-08-01 pm 282
mine <- mydat %>%
# Tell R to look in the tally_rush dataset in our SQLite database
# It will return the first 1000 rows
tbl("tally_rush") %>%
# Grab first 6 rows
head() %>%
# extract those five rows to be used
collect()
# Check it out!
mine## # A tibble: 6 × 3
## day rush count
## <chr> <chr> <int>
## 1 2011-07-28 am 1
## 2 2011-07-28 pm 168
## 3 2011-07-29 am 85
## 4 2011-07-29 pm 188
## 5 2011-08-01 am 91
## 6 2011-08-01 pm 282
str_sub()str_sub(), also known as string-sub, is an amazing
function loaded within the tidyverse package. It allows you
to extract part of your entry, based on the number of characters. It’s
great for extracting the year, month, day, census code, etc. out of
structured text. Here’s a quick example.
# Let's make an example dataset
mydates <- data.frame(day = c("2020-10-11", "2021-05-02", "2021-02-03"))
mydates## day
## 1 2020-10-11
## 2 2021-05-02
## 3 2021-02-03
We can extract the first four letters by setting
start = 1 and end = 4.
mydates$day %>% str_sub(start = 1, end = 4)## [1] "2020" "2021" "2021"
# Though you could also write it like this
# str_sub(mydates$day, start = 1, end = 4)You could also write it like this, within mutate().
mydates %>%
mutate(day = str_sub(day, start = 1, end = 4))## day
## 1 2020
## 2 2021
## 3 2021
Finally, you could systematize a bunch of info in multiple columns!
mydates %>%
mutate(
# let's get the year
y = str_sub(day, start = 1, end = 4),
# then the month
m = str_sub(day, start = 6, end = 7),
# then the day!
d = str_sub(day, start = 9, end = 10))## day y m d
## 1 2020-10-11 2020 10 11
## 2 2021-05-02 2021 05 02
## 3 2021-02-03 2021 02 03
Handy, right?
mydat %>%
tbl("tally_rush") %>%
# Zoom into October, using the 6th and 7th characters in the day vector
filter(str_sub(day, start = 6, end = 7) == "10") %>%
# Zoom into just am rush hour traffic
filter(rush == "am") %>%
# Count how many rows (days) is in that?
summarize(count = n()) %>%
# collect response - 243 days!
collect()## # A tibble: 1 × 1
## count
## <int>
## 1 243
myviz <- mydat %>%
tbl("tally_rush") %>%
# Zoom into October, using the 6th and 7th characters in the day vector
filter(str_sub(day, start = 6, end = 7) == "10") %>%
# Zoom into just am rush hour traffic
filter(rush == "am") %>%
# collect response - 243 days!
collect() %>%
# If you ever work with date data,
# you WILL have to transform it from character into date format
mutate(day = as.POSIXct(day))
myviz %>%
ggplot(mapping = aes(x = day, y = count)) +
geom_jitter(size = 3, color = "purple", alpha = 0.5) +
labs(subtitle = "Yay October!")The big one we need to worry about file size with is
``tally_rush_edges. It’s also the coolest!
myedges <- mydat %>%
tbl("tally_rush_edges") %>%
# zoom into just 2021
filter(str_sub(day, 1,4) == "2021") %>%
# zoom into just morning
filter(rush == "am") %>%
# It's just 20000 rows, so let's collect it. I'd avoid going above 50K.
collect() %>%
as_tibble()
# How many rows?
myedges %>%
summarize(count = n())## # A tibble: 1 × 1
## count
## <int>
## 1 27114
# Wow! That's a lot of data! Let's look at just a few rows.
myedges %>% head()## # A tibble: 6 × 5
## start_code end_code day rush count
## <chr> <chr> <chr> <chr> <int>
## 1 A32001 B32018 2021-01-01 am 4
## 2 A32001 S32034 2021-01-01 am 2
## 3 A32002 A32008 2021-01-28 am 4
## 4 A32002 A32010 2021-01-04 am 8
## 5 A32002 B32006 2021-01-01 am 4
## 6 A32002 M32009 2021-01-01 am 2
So, in start_code, we’ve got the unique identifier for
each bluebikes station that a person checked a bike out from.
end_code shows the station they returned the bike to
afterwards. day shows that date it all occurred during.
rush indicates whether it happened during rush hour in the
"am" or "pm". And count is the number of rides
(roughly = people).
mystations <- read_rds("our_data/stationbg_dataset.rds") %>%
# Let's get the block group population of African Americans where each station is,
# using our spatially smoothed estimate, 'pop_black_2020_smooth5'
select(code, pop_black_2020_smooth5) %>%
# And let's classify it as
# above 50% or below 50%
mutate(maj_black = if_else(pop_black_2020_smooth5 > 0.5, "yes", "no")) %>%
# convert to tibble,
as_tibble() %>%
select(code, maj_black)
mystations %>% head()## # A tibble: 6 × 2
## code maj_black
## <chr> <chr>
## 1 A32000 no
## 2 A32001 no
## 3 A32002 no
## 4 A32003 no
## 5 A32004 no
## 6 A32005 no
Finally, let’s join in our demographic traits to get a glimpse of what this mobility network looks like in terms of which neighborhoods are getting connected - eg. who rides? We will classify each station based on its block group’s traits!
mydemo <- myedges %>%
# We can join in the source traits,
# Eg. whether the station is in a majority Black neighborhood
left_join(by = c("start_code" = "code"),
y = mystations %>% select(code, start_black = maj_black)) %>%
# We can join in the destination traits,
# Eg. whether that station is in a majority Black neighborhood
left_join(by = c("end_code" = "code"),
y = mystations %>% select(code, end_black = maj_black))
# Handy side benefit: if start_black or end_black == NA, that means the start_code station or end_code station are not in boston proper (eg. maybe in Cambridge or beyond.)
# I'd like to zoom into Boston, for the time being.
mydemo %>% head()## # A tibble: 6 × 7
## start_code end_code day rush count start_black end_black
## <chr> <chr> <chr> <chr> <int> <chr> <chr>
## 1 A32001 B32018 2021-01-01 am 4 no no
## 2 A32001 S32034 2021-01-01 am 2 no no
## 3 A32002 A32008 2021-01-28 am 4 no no
## 4 A32002 A32010 2021-01-04 am 8 no no
## 5 A32002 B32006 2021-01-01 am 4 no no
## 6 A32002 M32009 2021-01-01 am 2 no no
You can repeat this process with any of the demographic
variables in stationbg_dataset.rds (as long as it makes
sense. Rule of thumb: simple analyses are usually better and closer to
the truth).
How often do people bike between predominantly Black neighborhoods?
Let’s find out, using group_by() and
summarize(). Let’s add up the total trips (stored in
count) for each subgroup.
mydemo %>%
group_by(start_black, end_black) %>%
summarize(trips = sum(count, na.rm = TRUE))## # A tibble: 8 × 3
## # Groups: start_black [3]
## start_black end_black trips
## <chr> <chr> <int>
## 1 no no 69843
## 2 no yes 339
## 3 no <NA> 33
## 4 yes no 505
## 5 yes yes 51
## 6 yes <NA> 3
## 7 <NA> no 8
## 8 <NA> <NA> 14
This is very cool, but we should probably deal with those NAs. The
NAs refer to stations outside of Boston - to keep this manageable, we’re
just looking at movement within Boston. We can filter() out
"NA" observations like this:
mydemo %>%
# Remove rows where start_black == NA or end_black == NA
filter(start_black != "NA" & end_black != "NA") %>%
group_by(start_black, end_black) %>%
summarize(trips = sum(count, na.rm = TRUE))## # A tibble: 4 × 3
## # Groups: start_black [2]
## start_black end_black trips
## <chr> <chr> <int>
## 1 no no 69843
## 2 no yes 339
## 3 yes no 505
## 4 yes yes 51
mytab <- mydemo %>%
# Remove rows where start_black == NA or end_black == NA
filter(start_black != "NA" & end_black != "NA") %>%
group_by(start_black, end_black) %>%
summarize(trips = sum(count, na.rm = TRUE)) %>%
ungroup() %>%
mutate(total = sum(trips),
percent = trips / total) %>%
# Clean up the percentages
mutate(percent = round(percent*100, digits = 1))
# view it!
mytab## # A tibble: 4 × 5
## start_black end_black trips total percent
## <chr> <chr> <int> <int> <dbl>
## 1 no no 69843 70738 98.7
## 2 no yes 339 70738 0.5
## 3 yes no 505 70738 0.7
## 4 yes yes 51 70738 0.1
So that’s… disquieting. Yikes. I mean, it’s an individual choice to ride or not, but when a public program sees low rates of participation from neighborhoods of color as small as that, it’s not really an equitable policy outcome.
It’s bar-chart time! So many bar charts!
mytab %>%
ggplot(mapping = aes(x = start_black, y = trips, fill = end_black)) +
geom_col(position = "fill") +
labs(subtitle = "What % of Rides that started out in Black Neighborhoods\nended in Black Neighborhoods?",
y = "% of Rides",
x = "Starting Station\nin Majority Black Neighborhood?",
fill = "Ending Station\nin Majority\nBlack\nNeighborhood?")Sometimes filtered bar charts are more intuitive to read.
mytab %>%
mutate(type = paste(start_black, "->", end_black)) %>%
ggplot(mapping = aes(x = type, y = percent)) +
geom_col() +
geom_text(mapping = aes(label = percent), vjust = 0, nudge_y = 2) +
labs(subtitle = "What % of Rides that started out in Black Neighborhoods\nended in Black Neighborhoods?",
y = "% of Rides",
x = "Starting Neighborhood >50% Black? -> Ending Station >50% Black?")Another great option is geom_tile(), also known as
heatmaps.
library(viridis)
mytab %>%
select(start_black, end_black, percent) %>%
ggplot(mapping = aes(x = start_black, y = end_black, fill = percent, label = percent)) +
geom_tile(color = "white") +
geom_text(size = 8) +
# Viridis is back!!!
scale_fill_viridis(option = 'mako', begin = 0.2, end = 0.8) +
theme_classic(base_size = 14) +
labs(x = "Starting Station\nin Majority Black Neighborhood?",
y = 'Ending Station\nin Majority Black Neighborhood?',
subtitle = "% of BlueBikes Ridership between Black/White Neighborhoods")Oh hang on, didn’t we have this data for every place over time?
# Load Packages
library(tidyverse)
library(RSQLite)
library(DBI)
# Tell R to hire a 'SQL translator' object, which we'll name 'mydat',
# sourced from our bluebikes data
mydat <- dbConnect(RSQLite::SQLite(), "our_data/bluebikes.sqlite")
mytime <- mydat %>%
tbl("tally_rush_edges") %>%
# zoom into just morning
filter(rush == "am") %>%
# drop unnecessary variables, to keep this dataset as small as possible
select(-rush) %>%
collect()# WOW! that's a huge dataset. We hit 2 million rows!!!
# This is all the station-pairs from 2011 to 2021.
mytime %>% head() ## # A tibble: 6 × 4
## start_code end_code day count
## <chr> <chr> <chr> <int>
## 1 A32000 A32000 2011-10-11 1
## 2 A32000 A32000 2011-10-14 1
## 3 A32000 A32010 2011-09-26 2
## 4 A32000 A32010 2011-10-06 1
## 5 A32000 A32010 2011-10-18 1
## 6 A32000 A32010 2011-11-04 1
Okay, this is going to be rough, but here we go!
We’re going to join in demographics, filter to valid cases, and get our percentages for every year.
mytimedemo <- mytime %>%
left_join(by = c("start_code" = "code"),
y = mystations %>%
select(code, start_black = maj_black)) %>%
# We can join in the destination traits,
# Eg. whether that station is in a majority Black neighborhood
left_join(by = c("end_code" = "code"),
y = mystations %>%
select(code, end_black = maj_black)) %>%
# We can join in the source traits,
# Remove rows where start_black == NA or end_black == NA
filter(start_black != "NA" & end_black != "NA") %>%
# Now count up by YEAR
group_by(year = str_sub(day, 1, 4),
# and by start and end type
start_black, end_black) %>%
summarize(trips = sum(count, na.rm = TRUE)) %>%
ungroup() %>%
# Now for each year, calculate percentage
group_by(year) %>%
mutate(total = sum(trips),
percent = trips / total) %>%
# Clean up the percentages
mutate(percent = round(percent*100, digits = 1))
# and let's quickly get rid of that 2 million row dataset so our computer can breath easy.
remove(mytime)Check it out!!!!
mytimedemo## # A tibble: 44 × 6
## # Groups: year [11]
## year start_black end_black trips total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 2011 no no 21816 22177 98.4
## 2 2011 no yes 131 22177 0.6
## 3 2011 yes no 223 22177 1
## 4 2011 yes yes 7 22177 0
## 5 2012 no no 91248 91571 99.6
## 6 2012 no yes 106 91571 0.1
## 7 2012 yes no 213 91571 0.2
## 8 2012 yes yes 4 91571 0
## 9 2013 no no 162388 162946 99.7
## 10 2013 no yes 131 162946 0.1
## # … with 34 more rows
Let’s finish up by visualizing some equity questions over time!
mytimedemo %>%
# Zoom into just majority non-black-non-black pairs
filter(start_black == "no", end_black == "no") %>%
ggplot(mapping = aes(x = year, y = percent, group = 1)) +
geom_point(size = 5) +
geom_line() +
theme_classic(base_size = 14) +
labs(y = "% Trips between Majority Non-Black Neighborhoods,\nout of Total Trips",
x = "Year",
subtitle = "Very tiny changes in Demographic Mobility\nin BlueBikes Program over Time")Using these basic tools, we constructed a series of preliminary evaluations to explore the demographic and socioeconomic breakdown of Bluebikes ridership, with a focus on equity. Is the program reaching a broad slice of the population? Or not so much? When and where is it succeeding vs. struggling?
Using a chosen demographic variable, each team member set an expectation/hypothesis, and then tested it - Eg. I expect things to look pretty even, or I expect things to look pretty uneven!
Finally, we compiled results and refined analyses.
This study calculated a series of network statistics, all of them being proportions representing how homophilous the bluebikes user movement patterns were.
We measured homophily by race, income, education, and population density.
Some amount of homophily, however, was guaranteed, based on the structure of the network. For example, a city with more BlueBikes stations in white neighborhoods may by default see more homophilous mobility. So, we want to estimate how much mobility occurs by default between neighborhoods of the same type, subtract that out, and see if the differences between groups remain considerably large.
We’ll control by design by zooming just into rush hour time periods. In fact, it makes sense to just zoom into AM or PM traffic, so that we’re capturing just one direction of movement.
We also need to adjust for time, by calculating each statistic year-by-year, since the total number of riders and places available changed dramatically from year to year, so we want the denominator to differ each year. We want to only permute movement among stations which were actually available/in use each year. Since the numbers on when each site was opened were a little fuzzy - I’d like to use a data-driven approach to estimate this. Eg. only permute edges among the stations where at least 1 (or maybe 10 or 100) users actually moved during a year. This is a control for whether the stations were in use.
# Load Packages
library(tidyverse)
library(RSQLite)
library(DBI)
# Tell R to hire a 'SQL translator' object, which we'll name 'mydat',
# sourced from our bluebikes data
mydat <- dbConnect(RSQLite::SQLite(), "our_data/bluebikes.sqlite")
mydat %>%
tbl("tally_rush_edges") %>%
# zoom into just morning
filter(rush == "am") %>%
# drop unnecessary variables, to keep this dataset as small as possible
select(-rush) %>%
collect() %>%
saveRDS("processing/mytime.rds")library(tidyverse)
library(sf)
library(tidygraph)
read_rds("our_data/stationbg_dataset.rds") %>%
# Build a threshold by pop density tericles
mutate(density = ntile(pop_density_2020_smooth10, 2)) %>%
# Create terciles by age
mutate(age = ntile(pop_over_65_2019_smooth10, 2)) %>%
# Create terciles by education
mutate(edu = ntile(pop_some_college_smooth10, 2)) %>%
# Calculate total (smoothed) share of residents making UNDER the US median household income
mutate(pop_lower_income = pop_0_40000_2019_smooth10 + pop_40001_60000_2019_smooth10) %>%
# Now build a threshold describing income
mutate(income = ntile(pop_lower_income, 2)) %>%
# Calculate total (smoothed) nonwhite population per block group
mutate(pop_nonwhite = 1 - pop_white_2020_smooth10) %>%
# Create quartiles by race
mutate(nonwhite = ntile(pop_nonwhite, 2)) %>%
bind_cols(st_coordinates(.$geometry) %>% as_tibble() %>% select(x = 1, y = 2)) %>%
# convert to tibble,
as_tibble() %>%
# Create our homophily variables for testing
mutate(
# Reverse code income, so that Low = More lower income families and High = more upper income families
bin_income = if_else(pop_lower_income > 0.50, true = "low", false = "high"),
bin_nonwhite = if_else(pop_nonwhite > 0.50, true = "high", false = "low"),
bin_edu = if_else(pop_some_college_smooth10 > median(pop_some_college_smooth10, na.rm = TRUE),
true = "high", false = "low"),
bin_dense = if_else(pop_density_2020_smooth10 > median(pop_density_2020_smooth10, na.rm = TRUE),
true = "high", false = "low")) %>%
# Save
write_rds("our_data/nodes.rds")
mystations <- read_rds("our_data/nodes.rds")
read_rds("processing/mytime.rds") %>%
# Filter to valid stations
filter(start_code %in% mystations$code &
end_code %in% mystations$code) %>%
#filter(!is.na(start_code) & !is.na(end_code)) %>%
group_by(start_code, end_code, year = str_sub(day, 1,4)) %>%
summarize(weight = sum(count, na.rm = TRUE)) %>%
ungroup() %>%
# THRESHOLD BY USAGE
# Design-based control for low usage levels
# Threshold: filter to just pairs where at least 100 bluebikes trips occurred ridden in an entire year
filter(weight >= 100) %>%
# BLOCK BY USAGE
# classify edgeweights into quintiles
mutate(traffic = ntile(weight, 5)) %>%
# BLOCK BY DISTANCE
left_join(by = c("start_code" = "code"), y = mystations %>% select(code, x1 = x, y1 = y)) %>%
left_join(by = c("end_code" = "code"), y = mystations %>% select(code, x2 = x, y2 = y)) %>%
mutate(geometry = sprintf("LINESTRING(%s %s, %s %s)", x1, y1, x2, y2)) %>%
st_as_sf(wkt = "geometry", crs = 4326) %>%
select(-x1, -x2, -y1, -y2) %>%
mutate(dist = as.numeric(st_length(geometry)) / 1000) %>%
mutate(distcat = ntile(dist, 5)) %>%
as_tibble() %>%
# Join in income terciles
left_join(by = c("start_code" = "code"), y = mystations %>%
select(code,
start_density = density, start_edu = edu, start_age = age,
start_income = income, start_nonwhite = nonwhite,
start_pop_density = pop_density_2020_smooth10)) %>%
left_join(by = c("end_code" = "code"), y = mystations %>%
select(code, end_density = density, end_edu = edu, end_age = age,
end_income = income, end_nonwhite = nonwhite,
end_pop_density = pop_density_2020_smooth10)) %>%
# Let's also create a population density normalizing weight for our edges,
# representing the product between the start and end population density.
# Also, divide by 1000, because the numbers get really big otherwise
mutate(weight_pop_density = start_pop_density/1000 * end_pop_density/1000) %>%
write_rds("our_data/edges_annual.rds")Here’s our justification for block permuting by edge weight deciles. There’s a second dimension here. Really, we don’t want to allow edge weights to switch between very different pairs; eg. an edge of 500 from Downtown Crossing to Back Bay should not get switched with 3 from JP to Forest Hills. However, there are so many different numbers of people who traveled that we need to create a few categories indicating which edges get exchanged. We’ll call this variable volume, and it’s just a categorical variable describing their location on the edge weight distribution.
myedges <- read_rds("our_data/edges_annual.rds")
myrange <- myedges %>%
as_tibble() %>%
group_by(traffic) %>%
summarize(
count = n(),
min = min(weight, na.rm = TRUE),
max = max(weight, na.rm = TRUE))
myedges %>%
as_tibble() %>%
ggplot(mapping = aes(x = weight)) +
geom_rect(data = myrange, mapping = aes(x = NULL, xmin = min, xmax = max, ymin = 0, ymax = Inf, fill = traffic), color = "#373737") +
geom_histogram(mapping = aes(y = ..density..), color = "white", fill = "#373737", bins = 50) +
geom_density(color = "white", size= 3) +
geom_density(color = "blue", size = 1) +
scale_x_log10(breaks = c(100, 300, 1000, 3000, 10000)) +
labs(y = "Frequency of Rides by Station Pairs\nDensity Distribution of Edge-Weights",
x = "# of Rides per Year per Station-Pair\n(Edge-Weights for Station-Pair-Year Observations)") +
theme_classic(base_size = 14) +
theme(legend.position = "bottom",
legend.margin = margin(0,0,0,0, "cm")) +
scale_fill_gradient(low = "white", high = "#648FFF",
breaks = 0:10, name = "Quintiles by Rides") +
guides(fill = guide_colorbar(barwidth = 20, barheight = 1))myrange <- myedges %>%
as_tibble() %>%
mutate(distcat = ntile(dist, 5)) %>%
group_by(distcat) %>%
summarize(
count = n(),
min = min(dist, na.rm = TRUE),
max = max(dist, na.rm = TRUE))
myedges %>%
as_tibble() %>%
ggplot(mapping = aes(x = dist)) +
geom_rect(data = myrange, mapping = aes(x = NULL, xmin = min, xmax = max, ymin = 0, ymax = Inf, fill = distcat), color = "#373737") +
geom_histogram(mapping = aes(y = ..density..), color = "white", fill = "#373737", bins = 50) +
geom_density(color = "white", size= 3) +
geom_density(color = "blue", size = 1) +
scale_x_log10(breaks = c(0.1, .3, 1, 3, 10, 30, 100)) +
labs(y = "Frequency of Distances\nby Station Pairs",
x = "Distance Travelled per Year per Station-Pair\n(Distances (km) for Station-Pair-Year Observations)") +
theme_classic(base_size = 14) +
theme(legend.position = "bottom",
legend.margin = margin(0,0,0,0, "cm")) +
scale_fill_gradient(low = "white", high = "#648FFF",
breaks = 0:10, name = "Quintiles by Distance") +
guides(fill = guide_colorbar(barwidth = 20, barheight = 1))# We identified the median
# Are households above or below the US median household income?
mystations %>%
# Calculate total (smoothed) share of residents making UNDER the US median household income
group_by(density = ntile(pop_density_2020_smooth10, 3)) %>%
summarize(n = n(),
min = min(pop_density_2020_smooth10, na.rm = TRUE),
max = max(pop_density_2020_smooth10, na.rm = TRUE))## # A tibble: 3 × 4
## density n min max
## <int> <int> <dbl> <dbl>
## 1 1 139 1886. 8803.
## 2 2 139 8806. 13483.
## 3 3 138 13541. 31486.
# We identified the median
# Are households above or below the US median household income?
mystations %>%
# Calculate total (smoothed) nonwhite population per block group
mutate(pop_nonwhite = 1 - pop_white_2020_smooth10) %>%
group_by(nonwhite = case_when(
pop_nonwhite > 0 & pop_nonwhite < 0.25 ~ "mostly_white",
pop_nonwhite >= 0.25 & pop_nonwhite < 0.50 ~ "somewhat_white",
pop_nonwhite >= 0.50 & pop_nonwhite < 0.75 ~ "somewhat_nonwhite",
pop_nonwhite >= 0.75 & pop_nonwhite < 1.00 ~ "mostly_nonwhite",
TRUE ~ NA_character_)) %>%
count()## # A tibble: 4 × 2
## # Groups: nonwhite [4]
## nonwhite n
## <chr> <int>
## 1 mostly_nonwhite 40
## 2 mostly_white 61
## 3 somewhat_nonwhite 68
## 4 somewhat_white 247
mystations %>%
# Calculate total (smoothed) share of residents making UNDER the US median household income
mutate(pop_lower_income = pop_0_40000_2019_smooth10 + pop_40001_60000_2019_smooth10) %>%
select(code, pop_lower_income) %>%
group_by(income = case_when(
pop_lower_income < 0.20 ~ "highest_income",
pop_lower_income >= 0.20 & pop_lower_income < 0.40 ~ "higher_income",
pop_lower_income >= 0.40 & pop_lower_income < 0.60 ~ "lower_income",
pop_lower_income >= 0.60 & pop_lower_income < 0.80 ~ "lowest_income",
TRUE ~ NA_character_)) %>%
count()## # A tibble: 4 × 2
## # Groups: income [4]
## income n
## <chr> <int>
## 1 higher_income 195
## 2 highest_income 115
## 3 lower_income 89
## 4 lowest_income 17
mystations %>%
group_by(edu = ntile(pop_some_college_smooth10, 3)) %>%
summarize(count = n(),
min = min(pop_some_college_smooth10, na.rm = TRUE),
max = max(pop_some_college_smooth10, na.rm = TRUE))## # A tibble: 3 × 4
## edu count min max
## <int> <int> <dbl> <dbl>
## 1 1 139 0.0193 0.130
## 2 2 139 0.131 0.217
## 3 3 138 0.217 0.496
# We identified the median
# Are households above or below the US median household income?
mystations %>%
group_by(age = ntile(pop_over_65_2019_smooth10, 3)) %>%
summarize(count = n(),
min = min(pop_over_65_2019_smooth10, na.rm = TRUE),
max = max(pop_over_65_2019_smooth10, na.rm = TRUE))## # A tibble: 3 × 4
## age count min max
## <int> <int> <dbl> <dbl>
## 1 1 139 0.0166 0.0957
## 2 2 139 0.0962 0.135
## 3 3 138 0.135 0.238
rm(list = ls())
net <- read_rds("our_data/edges_annual.rds")
net %>%
group_by(year,
traffic, distcat,
start_density, end_density,
#start_age, end_age,
start_edu, end_edu,
start_income, end_income,
start_nonwhite, end_nonwhite) %>%
count() %>%
ungroup() %>%
count()## # A tibble: 1 × 1
## n
## <int>
## 1 5810
# 7998 blocksex <- data.frame(
weight = c(100, 200, 300, 200,
500, 300, 200, 100) %>% as.numeric(),
block = c("a", "b", "b", "b",
"c", "c", "c", "b"))
get_sample = function(x){
if(length(x) == 1){
return(x)
}else{
sample(x, size = length(x), replace = FALSE) %>%
return()}
}
ex %>%
split(.$block) %>%
map(~get_sample(.$weight))## $a
## [1] 100
##
## $b
## [1] 200 300 100 200
##
## $c
## [1] 500 300 200
ex %>%
group_by(block) %>%
summarize(perm = get_sample(weight) %>% unlist())## # A tibble: 8 × 2
## # Groups: block [3]
## block perm
## <chr> <dbl>
## 1 a 100
## 2 b 200
## 3 b 300
## 4 b 200
## 5 b 100
## 6 c 500
## 7 c 300
## 8 c 200
rm(.Random.seed, envir=globalenv())
library(tidyverse)
library(sf)
library(tidygraph)
read_rds("our_data/edges_annual.rds") %>%
# Join in key traits of interest
left_join(by = c("start_code" = "code"), y = read_rds("our_data/nodes.rds") %>%
select(code, start_bin_income = bin_income,
start_bin_nonwhite = bin_nonwhite,
start_bin_edu = bin_edu, start_bin_dense = bin_dense)) %>%
left_join(by = c("end_code" = "code"), y = read_rds("our_data/nodes.rds") %>%
select(code, end_bin_income = bin_income,
end_bin_nonwhite = bin_nonwhite,
end_bin_edu = bin_edu, end_bin_dense = bin_dense)) %>%
mutate(bin_nonwhite = paste(start_bin_nonwhite, end_bin_nonwhite, sep = "-"),
bin_income = paste(start_bin_income, end_bin_income, sep = "-"),
bin_edu = paste(start_bin_edu, end_bin_edu, sep = "-"),
bin_dense = paste(start_bin_dense, end_bin_dense, sep = "-")) %>%
write_rds("our_data/edges_annual_prepped.rds")# Repeat our sampling function
get_sample = function(x){
if(length(x) == 1){
return(x)
}else{
sample(x, size = length(x), replace = FALSE) %>%
return()}
}
# Let's write a quick wrapper function to permute the data
get_perm = function(nreps){
print(nreps)
net %>%
group_by(block) %>%
mutate(perm = get_sample(weight) %>% unlist()) %>%
ungroup() %>%
return()
}
# Let's write a second function to calculate summary statistics from the data
get_stat = function(data){
data %>%
# Normalize by the product of source * destination pop density per km2
mutate(perm = perm / weight_pop_density) %>%
# Get homophily by year (since we can just aggregate the final stats later)
group_by(year) %>%
summarize(
# Race
nonwhite_ll = sum(perm[bin_nonwhite == "low-low"], na.rm = TRUE),
nonwhite_lh = sum(perm[bin_nonwhite == "low-high"], na.rm = TRUE),
nonwhite_hl = sum(perm[bin_nonwhite == "high-low"], na.rm = TRUE),
nonwhite_hh = sum(perm[bin_nonwhite == "high-high"], na.rm = TRUE),
# Income
income_ll = sum(perm[bin_income == "low-low"], na.rm = TRUE),
income_lh = sum(perm[bin_income == "low-high"], na.rm = TRUE),
income_hl = sum(perm[bin_income == "high-low"], na.rm = TRUE),
income_hh = sum(perm[bin_income == "high-high"], na.rm = TRUE),
# Education
edu_ll = sum(perm[bin_edu == "low-low"], na.rm = TRUE),
edu_lh = sum(perm[bin_edu == "low-high"], na.rm = TRUE),
edu_hl = sum(perm[bin_edu == "high-low"], na.rm = TRUE),
edu_hh = sum(perm[bin_edu == "high-high"], na.rm = TRUE),
# Dense
dense_ll = sum(perm[bin_dense == "low-low"], na.rm = TRUE),
dense_lh = sum(perm[bin_dense == "low-high"], na.rm = TRUE),
dense_hl = sum(perm[bin_dense == "high-low"], na.rm = TRUE),
dense_hh = sum(perm[bin_dense == "high-high"], na.rm = TRUE)) %>%
return()
}
save(get_sample, get_perm, get_stat, file = "our_data/functions.RData")load("our_data/functions.RData")
# Compare against observed
read_rds("our_data/edges_annual_prepped.rds") %>%
# Quickly rename observed weight to 'perm', so we can reuse function
rename(perm = weight) %>%
# But rest assured, we are indeed calculating the observed statistics, not permuted
get_stat() %>%
# Combine all 4 categories of any of the varibales
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/obs.rds")
rm(list= ls())# Compare against observed
net <- read_rds("our_data/edges_annual_prepped.rds")
# Unique stations
c(net$start_code,
net$end_code) %>% unique() %>% length()## [1] 283
net$year %>% unique() %>% length()## [1] 11
net %>%
select(start_code, end_code) %>%
distinct() %>%
dim()## [1] 5300 2
net$weight %>% sum()## [1] 3570019
# Compare against original network
mystations <- read_rds("our_data/nodes.rds")
first <- read_rds("processing/mytime.rds") %>%
# Filter to valid stations
filter(start_code %in% mystations$code &
end_code %in% mystations$code) %>%
#filter(!is.na(start_code) & !is.na(end_code)) %>%
group_by(start_code, end_code, year = str_sub(day, 1,4)) %>%
summarize(weight = sum(count, na.rm = TRUE)) %>%
ungroup()
c(first$start_code, first$end_code) %>% unique() %>% length()## [1] 411
first$weight %>% sum()## [1] 5716582
first %>%
select(start_code, end_code) %>%
distinct() %>%
dim()## [1] 49793 2
remove(net, mystations, first)load("our_data/functions.RData")
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
mutate(block = paste(year,
#traffic, distcat,
#start_density, end_density,
#start_age, end_age,
#start_edu, end_edu,
#start_income, end_income,
#start_nonwhite, end_nonwhite,
sep = "-") %>% factor() %>% as.numeric()) %>%
select(start_code, end_code, year, block, weight,
bin_nonwhite, bin_income, bin_edu, bin_dense, weight_pop_density) %>%
ungroup()
1:1000 %>%
map_dfr(~get_perm(.) %>% get_stat(), .id = "replicate") %>%
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/perms_1.rds")
rm(list= ls())load("our_data/functions.RData")
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
mutate(block = paste(year, traffic,
#distcat,
#start_density, end_density,
#start_age, end_age,
#start_edu, end_edu,
#start_income, end_income,
#start_nonwhite, end_nonwhite,
sep = "-") %>% factor() %>% as.numeric()) %>%
select(start_code, end_code, year, block, weight,
bin_nonwhite, bin_income, bin_edu, bin_dense, weight_pop_density) %>%
ungroup()
1:1000 %>%
map_dfr(~get_perm(.) %>% get_stat(), .id = "replicate") %>%
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/perms_2.rds")
rm(list= ls())load("our_data/functions.RData")
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
mutate(block = paste(year, traffic,
distcat,
#start_density, end_density,
#start_age, end_age,
#start_edu, end_edu,
#start_income, end_income,
#start_nonwhite, end_nonwhite,
sep = "-") %>% factor() %>% as.numeric()) %>%
select(start_code, end_code, year, block, weight,
bin_nonwhite, bin_income, bin_edu, bin_dense, weight_pop_density) %>%
ungroup()
1:1000 %>%
map_dfr(~get_perm(.) %>% get_stat(), .id = "replicate") %>%
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/perms_3.rds")
rm(list= ls())load("our_data/functions.RData")
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
mutate(block = paste(year, traffic,
distcat,
start_density, end_density,
sep = "-") %>% factor() %>% as.numeric()) %>%
select(start_code, end_code, year, block, weight,
bin_nonwhite, bin_income, bin_edu, bin_age, bin_dense, weight_pop_density) %>%
ungroup()
1:1000 %>%
map_dfr(~get_perm(.) %>% get_stat(), .id = "replicate") %>%
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/perms_4.rds")
rm(list= ls())load("our_data/functions.RData")
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
mutate(block = paste(year, traffic,
distcat,
start_density, end_density,
start_edu, end_edu,
sep = "-") %>% factor() %>% as.numeric()) %>%
select(start_code, end_code, year, block, weight,
bin_nonwhite, bin_income, bin_edu, bin_dense, weight_pop_density) %>%
ungroup()
1:1000 %>%
map_dfr(~get_perm(.) %>% get_stat(), .id = "replicate") %>%
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/perms_5.rds")
rm(list= ls())load("our_data/functions.RData")
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
mutate(block = paste(year, traffic,
distcat,
start_density, end_density,
start_edu, end_edu,
start_age, end_age,
#start_nonwhite, end_nonwhite,
#start_income, end_income,
sep = "-") %>% factor() %>% as.numeric()) %>%
select(start_code, end_code, year, block, weight,
bin_nonwhite, bin_income, bin_edu, bin_dense, weight_pop_density) %>%
ungroup()
1:1000 %>%
map_dfr(~get_perm(.) %>% get_stat(), .id = "replicate") %>%
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/perms_6.rds")
rm(list= ls())load("our_data/functions.RData")
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
mutate(block = paste(year, traffic,
distcat,
start_density, end_density,
start_edu, end_edu,
start_age, end_age,
start_income, end_income,
#start_nonwhite, end_nonwhite,
sep = "-") %>% factor() %>% as.numeric()) %>%
select(start_code, end_code, year, block, weight,
bin_nonwhite, bin_income, bin_edu, bin_dense, weight_pop_density) %>%
ungroup()
1:1000 %>%
map_dfr(~get_perm(.) %>% get_stat(), .id = "replicate") %>%
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/perms_7.rds")
rm(list= ls())load("our_data/functions.RData")
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
mutate(block = paste(year, traffic,
distcat,
start_density, end_density,
start_edu, end_edu,
start_age, end_age,
start_income, end_income,
start_nonwhite, end_nonwhite,
sep = "-") %>% factor() %>% as.numeric()) %>%
select(start_code, end_code, year, block, weight,
bin_nonwhite, bin_income, bin_edu, bin_dense, weight_pop_density) %>%
ungroup()
1:1000 %>%
map_dfr(~get_perm(.) %>% get_stat(), .id = "replicate") %>%
mutate(total = nonwhite_ll + nonwhite_lh + nonwhite_hl + nonwhite_hh) %>%
write_rds("our_data/perms_8.rds")
rm(list= ls())myperms <- bind_rows(
read_rds("our_data/perms_1.rds"),
read_rds("our_data/perms_2.rds"),
read_rds("our_data/perms_3.rds"),
read_rds("our_data/perms_4.rds"),
read_rds("our_data/perms_5.rds"),
read_rds("our_data/perms_6.rds"),
read_rds("our_data/perms_7.rds"),
read_rds("our_data/perms_8.rds"),
.id = "controls") %>%
group_by(controls, replicate) %>%
summarize_at(vars(-c("year", "total")), list(~sum(.) / sum(total) )) %>%
pivot_longer(cols = -c(controls, replicate), names_to = "variable", values_to = "perm")
myobs <- read_rds("our_data/obs.rds") %>%
summarize_at(vars(-c("year", "total")), list(~sum(.) / sum(total) )) %>%
pivot_longer(cols = -c(), names_to = "variable", values_to = "obs")
mycompare <- myobs %>%
left_join(by = c("variable"),
y = myperms %>% select( controls, variable, perm, replicate)) %>%
separate(col = "variable", into = c("variable", "level"), sep = "_") %>%
mutate(variable = variable %>% recode_factor(
"nonwhite" = "<b>Race</b><br>(High = >50% Nonwhite)",
"dense" = "<b>Density</b><br>(High = >Median)",
"income" = "<b>Income</b><br>(High = >50% 60K+)",
"edu" = "<b>Education</b><br>(High = >Median % Some College)"),
level = level %>% recode_factor(
"ll" = "Low->Low",
"lh" = "Low->High",
"hl" = "High->Low",
"hh" = "High->High"),
controls = controls %>% recode_factor(
"1" = "1\nYear",
"2" = "2\nWith\nTraffic",
"3" = "3\nWith\nDistance",
"4" = "4\nWith\nDensity",
"5" = "5\nWith\nEducation",
"6" = "6\nWith\nAge",
"7" = "7\nWith\nIncome",
"8" = "8\nWith\nRace")) %>%
# Filter to income or race, as our main independent variables
filter(str_detect(variable, "Income|Race"))
mytext <- mycompare %>%
group_by(controls, variable, level) %>%
# Standardize the test statistics around 0.
mutate(se = sd(perm, na.rm = TRUE),
mu = mean(perm, na.rm = TRUE),
perm_scaled = (perm - mu) / se,
obs_scaled = (obs - mu) / se) %>%
summarize(
obs = unique(obs),
obs_scaled = unique(obs_scaled),
med = median(perm),
lower = quantile(perm, probs = 0.025),
upper = quantile(perm, probs = 0.975),
# You can either calculate it using scaled measures
p = sum(abs(perm_scaled) >= abs(obs_scaled)) / n(),
# Or calculate it directly, then multiply by 2 for two-tailed test
p_value = case_when(
obs > med ~ sum(perm >= obs) / n(),
obs < med ~ sum(perm <= obs) / n())*2,
# Run a one-tailed test, since we're testing for homophily, not both
# How often do permuted values exceed this level of homophily?
p_one = sum(perm >= obs) / n(),
p_label = paste("p = ", p_one, sep = ""),
# Calculate residual statistic
res = obs - med) %>%
# Update such that NAs become p = 1 (since this just means that
# the permuted and observed values were the same)
mutate(p_label = case_when(p_label == "p = NA" ~ "p = 1",
p_label == "p = 0" ~ "p < 0.001",
TRUE ~ p_label),
obs_scaled = if_else(is.na(obs_scaled), 0, obs_scaled))
# You can test it out here.
#mycompare %>%
# filter(str_detect(variable, "Education") ) %>%
# filter(str_detect(controls, "Education") )%>%
# head()
remove(myperms, myobs)g1 <- mytext %>%
mutate(label = paste(round(obs, 3), p_label, sep = "\n")) %>%
ggplot(mapping = aes(x = controls, y = level, fill = obs_scaled, label = label,
color = p_one < 0.10)) +
geom_tile() +
geom_tile(data = . %>% filter(p_one < 0.10), color = "black", fill = NA, size = 0.5) +
geom_text() +
scale_color_manual(breaks = c("TRUE", "FALSE"),
labels = c("p < 0.10", "not significant"),
values = c("#373737", "darkgrey")) +
facet_grid( rows = vars(variable), scales = "free") +
theme_classic(base_size = 14) +
theme(strip.background = element_blank(),
panel.border = element_rect(fill = NA, color = "#373737"),
axis.line = element_blank(),
axis.ticks = element_blank(),
legend.position = "bottom",
strip.text.y = ggtext::element_markdown(hjust = 0, angle = 0),
axis.text.y = ggtext::element_markdown(hjust = 1),
legend.margin = margin(0,0,0,0, "cm"),
plot.caption = element_text(hjust = 0)) +
scale_x_discrete(expand = expansion(add = c(0,0))) +
scale_y_discrete(expand = expansion(add = c(0,0))) +
scale_fill_gradient2(mid = "white", midpoint = 0, high = "#648FFF", low = "#DC267F") +
labs(x = "Level of Controls, from fewest (1) to most (8)", y = NULL, fill = "Standardized Test Statistic",
color = NULL,
subtitle = "Is this amount of homophily more than we would expect due to chance?",
caption = "Numbers show observed homophily (%); p-values show statistical significance of homophily.\nSignificant statistics (p < 0.10) in black text; insignificant statistics in grey.")
ggsave(g1, filename = "our_data/tiles.png", dpi = 500, width = 10, height = 9.5)Let’s use Moro et al.’s formula to get at that.
myobs <- read_rds("our_data/obs.rds") %>%
summarize_at(vars(-c("year", "total")), list(~sum(.) / sum(total) )) %>%
pivot_longer(cols = -c(), names_to = "variable", values_to = "obs") %>%
separate(col = "variable", into = c("variable", "level"), sep = "_") %>%
pivot_wider(id_cols = c(variable), names_from = level, values_from = obs) %>%
mutate(index = 2/3*( abs(ll - .25) + abs(lh - .25) + abs(hl - .25) + abs(hh - .25))) %>%
select(variable, obs = index)
#0 = complete mixing 1 = maximum division
myperms <- bind_rows(
read_rds("our_data/perms_1.rds"),
read_rds("our_data/perms_2.rds"),
read_rds("our_data/perms_3.rds"),
read_rds("our_data/perms_4.rds"),
read_rds("our_data/perms_5.rds"),
read_rds("our_data/perms_6.rds"),
read_rds("our_data/perms_7.rds"),
read_rds("our_data/perms_8.rds"),
.id = "controls") %>%
group_by(controls, replicate) %>%
summarize_at(vars(-c("year", "total")), list(~sum(.) / sum(total) )) %>%
pivot_longer(cols = -c(controls, replicate), names_to = "variable", values_to = "perm") %>%
separate(col = "variable", into = c("variable", "level"), sep = "_") %>%
pivot_wider(id_cols = c(replicate, controls, variable),
names_from = level, values_from = perm) %>%
mutate(index = 2/3*( abs(ll - .25) + abs(lh - .25) + abs(hl - .25) + abs(hh - .25))) %>%
select(replicate, controls, variable, perm = index)
mycompare <- myobs %>%
left_join(by = c("variable"),
y = myperms %>% select(controls, variable, perm, replicate)) %>%
separate(col = "variable", into = c("variable", "level"), sep = "_") %>%
mutate(variable = variable %>% recode_factor(
"income" = "<b>Income</b><br>High: >50% 60K+",
"nonwhite" = "<b>Race</b><br>High: >50% Nonwhite",
"dense" = "<b>Density</b><br>High: >Median",
"edu" = "<b>Education</b><br>High: >Median % Some College"),
controls = controls %>% recode_factor(
"1" = "1\nYear",
"2" = "2\nWith\nTraffic",
"3" = "3\nWith\nDistance",
"4" = "4\nWith\nDensity",
"5" = "5\nWith\nEducation",
"6" = "6\nWith\nAge",
"7" = "7\nWith\nIncome",
"8" = "8\nWith\nRace")) %>%
filter(str_detect(variable, "Income|Race"))
mytext <- mycompare %>%
group_by(controls, variable, level) %>%
# Standardize the test statistics around 0.
mutate(se = sd(perm, na.rm = TRUE),
mu = mean(perm, na.rm = TRUE),
perm_scaled = (perm - mu) / se,
obs_scaled = (obs - mu) / se) %>%
summarize(
obs = unique(obs),
obs_scaled = unique(obs_scaled),
med = median(perm),
lower = quantile(perm, probs = 0.025),
upper = quantile(perm, probs = 0.975),
# You can either calculate it using scaled measures
p = sum(abs(perm_scaled) >= abs(obs_scaled)) / n(),
# Or calculate it directly, then multiply by 2 for two-tailed test
p_value = case_when(
obs > med ~ sum(perm >= obs) / n(),
obs < med ~ sum(perm <= obs) / n())*2,
# One-tailed test
p_one = sum( perm >= obs) / n(),
p_label = paste("p = ", p_one, sep = ""),
# Calculate residual statistic
res = obs - med) %>%
# Update such that NAs become p = 1 (since this just means that
# the permuted and observed values were the same)
mutate(p_label = case_when(p_label == "p = NA" ~ "p = 1",
p_label == "p = 0" ~ "p < 0.001",
TRUE ~ p_label),
obs_scaled = if_else(is.na(obs_scaled), 0, obs_scaled))
remove(myperms, myobs)
g2 <- mytext %>%
mutate(label = paste(round(obs, 3), p_label, sep = "\n")) %>%
ggplot(mapping = aes(x = controls, y = variable, fill = obs_scaled, label = label,
color = p_one < 0.10)) +
geom_tile(fill = NA) +
geom_tile(data = . %>% filter(p_one < 0.10), color = "black", fill = NA, size = 0.5) +
shadowtext::geom_shadowtext(bg.color = "white", bg.r = 0.1, color = "black") +
scale_color_manual(breaks = c("TRUE", "FALSE"),
labels = c("p < 0.10", "less significant"),
values = c("#373737", "darkgrey"),
guide = guide_legend(override.aes = list(fill = c(NA, NA), size = c(2, 0.5)))
) +
theme_classic(base_size = 14) +
theme(strip.background = element_blank(),
panel.border = element_rect(fill = NA, color = "#373737"),
axis.line = element_blank(),
axis.ticks = element_blank(),
legend.position = "bottom",
strip.text.y = element_text(hjust = 0, angle = 0),
axis.text.y = ggtext::element_markdown(hjust = 1),
legend.margin = margin(0,0,0,0, "cm"),
plot.caption = element_text(hjust = 0),
plot.caption.position = "plot") +
scale_x_discrete(expand = expansion(add = c(0,0))) +
scale_y_discrete(expand = expansion(add = c(0,0))) +
scale_fill_gradient2(mid = "white", midpoint = 0, high = "#648FFF", low = "#DC267F") +
labs(x = "Level of Controls, from fewest (1) to most (8)", y = NULL, fill = "Standardized Test Statistic\n(Less than or Greater than Expected)",
color = NULL,
subtitle = "Is this amount of similarity more than we would expect due to chance?",
caption = "Numbers show observed Similarity Index (0 = complete mixing, 1 = complete similarity); p-values show statistical significance of similarity.\nSignificant statistics (p < 0.10) in black text; insignificant statistics in grey.")
ggsave(g2, filename = "our_data/tiles_index.png", dpi = 500, width = 10, height = 3)myobs <- read_rds("our_data/obs.rds") %>%
summarize_at(vars(-c("year", "total")), list(~sum(.) / sum(total) )) %>%
pivot_longer(cols = -c(), names_to = "variable", values_to = "obs") %>%
separate(col = "variable", into = c("variable", "level"), sep = "_") %>%
mutate(varlabel = variable %>% recode_factor(
"nonwhite" = "<b>Race</b><br>High: >50% Nonwhite<br>Residents",
"income" = "<b>Income</b><br>High: >50% 60K+<br>Earning Households",
"dense" = "<b>Density</b><br>High: >Median<br>Population Density",
"edu" = "<b>Education</b><br>High: >Median<br>% Some College"),
level = level %>% recode_factor(
"hh" = "High<br>▼<br>High",
"hl" = "High<br>▼<br>Low",
"lh" = "Low<br>▼<br>High",
"ll" = "Low<br>▼<br>Low"))
myblues <- c("#2A0B49",
"#250F5A",
"#1A136A",
"#18247A",
"#1d428a",
"#3A6D9B",
"#5793AC",
"#74B5BC",
"#92CCC7",
"#B1DBD0",
"#D0EADE")
g5 <- myobs %>%
ggplot(mapping = aes(x = level, y = obs, fill = level, label = paste(round(obs, 3)*100, "%", sep = "" ))) +
geom_hline(yintercept = 0.25, linetype = "dashed", color = "#373737", alpha = 0.5, size = 1) +
geom_col(color = "#373737", size = 0.2) +
geom_text(nudge_y = 0.04, color = "#373737") +
facet_grid(~varlabel) +
theme_classic(base_size = 14) +
theme(axis.text.x = ggtext::element_markdown(size = 11, hjust = 0.5),
strip.text = ggtext::element_markdown(size = 12, hjust = 0.5),
axis.title.x = ggtext::element_markdown(size = 12, hjust = 0.5),
axis.title.y = ggtext::element_markdown(size = 12, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.background = element_blank(),
axis.line = element_line(color = "#373737", size = 0.5),
axis.ticks = element_blank(),
panel.border = element_rect(fill = NA, color = "#373737", size = 0.5),
panel.spacing = unit(0.5, "cm")) +
scale_fill_manual(values = myblues[c(1,4,6,8)], guide = "none") +
scale_y_continuous(breaks = c(0, 0.25, 0.5, 0.75, 1), limits = c(0, 1),
labels = c("0%", "25%", "50%", "75%", "100%"), expand = expansion(add = c(0,0))) +
labs(y = "<b>Percentage of Ridership</b><br><i>(Ridership Normed by Population Density)</i>",
x = "<b>Traits of Station Pairs by Level of Variables</b><br><i>Type of Source Station → Type of Destination Station</i>",
subtitle = "Neighborhood Mixing Overall (2011-2021)")
ggsave(g5, filename = "our_data/bars.png", dpi = 500, width = 9, height = 5)myobs <- read_rds("our_data/obs.rds") %>%
mutate_at(vars(contains("nonwhite"), contains("income"),
contains("dense"), contains("edu")),
list(~. / total )) %>%
select(year, contains("nonwhite"), contains("income"),
contains("dense"), contains("edu")) %>%
pivot_longer(cols = -c(year), names_to = "variable", values_to = "obs") %>%
separate(col = "variable", into = c("variable", "level"), sep = "_")
mystat <- myobs %>%
pivot_wider(id_cols = c(year, variable), names_from = level, values_from = obs) %>%
mutate(index = 2/3*( abs(ll - .25) + abs(lh - .25) + abs(hl - .25) + abs(hh - .25))) %>%
select(variable, year, obs = index) %>%
mutate(label = round(obs, 2),
label = str_pad(label, width = 4, side = "right", pad = "0"),
label = str_sub(label, 2, 4),
label = if_else(obs == 1, "1.0", label))
myblues <- c("#2A0B49",
"#250F5A",
"#1A136A",
"#18247A",
"#1d428a",
"#3A6D9B",
"#5793AC",
"#74B5BC",
"#92CCC7",
"#B1DBD0",
"#D0EADE")
mystat %>%
ggplot(mapping = aes(x = year, y = obs, group = variable,
color = variable, label = label)) +
geom_line() +
geom_point(data = . %>% filter(variable == "nonwhite"), size = 10) +
geom_text(data = . %>% filter(variable == "nonwhite"), color = "white") +
geom_point(data = . %>% filter(variable == "income"), size = 10) +
geom_text(data = . %>% filter(variable == "income"), color = "white") +
geom_point(data = . %>% filter(variable == "edu"), size = 10) +
geom_text(data = . %>% filter(variable == "edu"), color = "white") +
geom_point(data = . %>% filter(variable == "dense"), size = 10) +
geom_text(data = . %>% filter(variable == "dense"), color = "white") +
theme_classic(base_size = 14) +
scale_color_manual(values = myblues[c(1,4,6,8)],
breaks = c("nonwhite", "income", "edu", "dense"),
labels = c("Nonwhite", "Income", "Education", "Pop. Density")) +
labs(title = "Change in Neighborhood Similarity Over Time",
x = "Year",
y = "Similarity Index\n(0 = All Mixed; 1 = All Same)",
color = "Type of Similarity") facet_wrap(~variable)## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
g3 <- bind_rows(
myobs %>%
filter(variable == "nonwhite" & level == "hh"),
myobs %>%
filter(variable == "income" & level == "lh"),
myobs %>%
filter(variable == "edu" & level == "ll"),
myobs %>%
filter(variable == "dense" & level == "hh")
) %>%
mutate(varlabel = variable %>% recode_factor(
"nonwhite" = "<b>% Nonwhite</b><br>(High->High)",
"dense" = "<b>Population Dense</b><br>(High->High)",
"edu" = "<b>Education</b><br>(Low->Low)",
"income" = "<b>Income</b><br>(Low->High)")) %>%
mutate(label = round(obs, 2),
label = str_pad(label, width = 4, side = "right", pad = "0"),
label = str_sub(label, 2, 4),
label = case_when(obs == 1 ~ "1.0",
obs == 0 ~ "0",
TRUE ~ label)) %>%
ggplot(mapping = aes(x = as.numeric(year), y = obs, group = level,
color = level, label = label)) +
geom_line() +
geom_point(data = . %>% filter(variable == "nonwhite"), size = 10) +
geom_text(data = . %>% filter(variable == "nonwhite"), color = "white") +
geom_point(data = . %>% filter(variable == "income"), size = 10) +
geom_text(data = . %>% filter(variable == "income"), color = "white") +
geom_point(data = . %>% filter(variable == "edu"), size = 10) +
geom_text(data = . %>% filter(variable == "edu"), color = "white") +
geom_point(data = . %>% filter(variable == "dense"), size = 10) +
geom_text(data = . %>% filter(variable == "dense"), color = "white") +
theme_classic(base_size = 14) +
scale_color_manual(values = myblues[c(1,4,6,8)]
# breaks = c("nonwhite", "income", "edu", "dense"),
# labels = c("Nonwhite", "Income", "Education", "Pop. Density")
) +
scale_x_continuous(breaks = seq(from = 2011, to = 2021, by = 2), expand = c(0.05, 0.05)) +
scale_y_continuous(expand = c(.025, .025)) +
labs(title = "Notable Changes in Neighborhood Similarity Over Time",
x = "Year",
y = "Percentage of Rides\n(population density-normed ridership)",
color = "Type of Similarity") +
facet_wrap(~varlabel, scales = "free_y") +
theme(strip.text = ggtext::element_markdown(size = 12, hjust = 0.5),
strip.background = element_blank(),
legend.position = "none",
axis.line = element_blank(),
panel.border = element_rect(fill = NA, color = "#373737"))
ggsave(g3, filename = "our_data/lines_notable.png", dpi = 500, width = 9, height = 6)# Compare against original network
mystations <- read_rds("our_data/nodes.rds") %>%
# Get percentage of residents making OVER 60K a year
mutate(pop_upper_income = 1 - pop_lower_income)
# Get results binned into 20 categories
bins <- 20
wealth <- read_rds("our_data/edges_annual_prepped.rds") %>%
select(start_code, end_code, year, weight, weight_pop_density) %>%
# Join in wealth
left_join(by = c("start_code" = "code"), y = mystations %>% select(code, start_wealthy = pop_upper_income)) %>%
left_join(by = c("end_code" = "code"), y = mystations %>% select(code, end_wealthy = pop_upper_income)) %>%
# Calculate population density normed bins
mutate(weight = weight * weight_pop_density) %>%
# Now get count within 20 bins
group_by(start = floor(start_wealthy*bins),
end = floor(end_wealthy*bins)) %>%
summarize(count = sum(weight, na.rm = TRUE)) %>%
# Now join these tallies into the following grid
right_join(by = c("start", "end"),
y = expand_grid(start = 0:bins, end = 0:bins)) %>%
mutate(count = if_else(is.na(count), 0, as.numeric(count)))
race <- read_rds("our_data/edges_annual_prepped.rds") %>%
select(start_code, end_code, year, weight, weight_pop_density) %>%
# Join in wealth
left_join(by = c("start_code" = "code"), y = mystations %>% select(code, start_nonwhite = pop_nonwhite)) %>%
left_join(by = c("end_code" = "code"), y = mystations %>% select(code, end_nonwhite = pop_nonwhite)) %>%
# Calculate population density normed bins
mutate(weight = weight * weight_pop_density) %>%
# Now get count within 20 bins
group_by(start = floor(start_nonwhite*bins),
end = floor(end_nonwhite*bins)) %>%
summarize(count = sum(weight, na.rm = TRUE)) %>%
# Now join these tallies into the following grid
right_join(by = c("start", "end"),
y = expand_grid(start = 0:bins, end = 0:bins)) %>%
mutate(count = if_else(is.na(count), 0, as.numeric(count)))
net <- bind_rows(wealth %>% mutate(type = "Wealth (% Household Income >60K)"),
race %>% mutate(type = "Race (% Nonwhite)"))
g4 <- net %>%
ggplot(aes(x = start, y = end, fill = count / 1000)) +
geom_tile(color = "grey", size = 0.05) +
geom_tile(data = . %>% filter(count > 0), color = "#373737", size = 0.1) +
geom_vline(mapping = aes(xintercept = 10), alpha = 0.5, size = 2) +
geom_hline(mapping = aes(yintercept = 10), alpha = 0.5, size = 2) +
scale_fill_gradient(
low = myblues[4], high = "white",
# Zeros become NA on a log scale, but we don't want to lose them,
# Since they were actually there,
# So we'll color them with the bottom of the scale
na.value = myblues[2],
limits = NULL, trans = "log",
breaks = c(0.01, 0.1, 1, 10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000),
# Now, multiply all by 1000, since we divided earlier, to get back to original
labels = c(0.01, 0.1, 1, 10, 30, 100, 300, "1,000", "3,000", "10,000", "30,000", "100,000"),
guide = guide_colorbar(barheight = 13, ticks.colour = "#373737", frame.colour = "#373737", ticks.linewidth = 2)) +
scale_x_continuous(breaks = (0:bins), labels = ((0:bins)/bins)*100,
expand = expansion(add = c(0,0))) +
scale_y_continuous(breaks = (0:bins), labels = ((0:bins)/bins)*100,
expand = expansion(add = c(0,0))) +
labs(
title = "Is Ridership linked to Race and Wealth?",
x = "Source Neighborhood \n % of Residents/Households",
y = "Destination Neighborhood \n % of Residents/Households",
subtitle = "White, wealthy neighborhoods are more often a beginning and/or a destination.",
caption = "Data from 2011-2021 (routes with 100 or more users per year)\nMedian US household income is ~$67,000 USD. Dark blue shows 0 rides.",
fill = "<b>Ridership Rate</b>
<br>per station pair
<br>per 1000 residents
<br>per km<sup>2</sup>"
) +
theme_classic(base_size = 14) +
theme(plot.caption = element_text(hjust = 0),
plot.title = element_text(hjust = 0.5, size = 16),
plot.subtitle = element_text(hjust = 0.5, size = 14),
legend.title = ggtext::element_markdown(size = 12, hjust = 0),
axis.ticks = element_blank()) +
coord_fixed(ratio = 1) +
facet_wrap(~type)
ggsave(g4, filename = "our_data/tiles_wealth.png", dpi = 500, height = 7, width = 12.5)rm(list = ls())library(tidyverse)
library(tidygraph)
library(sf)
library(ggnewscale)
library(ggtext)
# Get a 15 km buffer around the center of Suffolk county
mybuffer <- tigris::counties(state = "MA", cb = TRUE, year = 2019) %>%
st_as_sf(crs = 4326) %>%
st_transform(crs = 4326) %>%
select(geoid = GEOID, name = NAME, geometry) %>%
filter(name == "Suffolk") %>%
summarize(geometry = st_centroid(geometry)) %>%
st_buffer(dist = 15000)
mycounties <- tigris::counties(state = "MA", cb = TRUE, year = 2019) %>%
st_as_sf(crs = 4326)
myboston <- tigris::county_subdivisions(state = "MA", county = "Suffolk", cb = TRUE, year = 2019) %>%
st_as_sf(crs = 4326) %>%
filter(NAME == "Boston") %>%
st_transform(crs = 4326)
net <- read_rds("our_data/edges_annual_prepped.rds") %>%
st_as_sf(crs = 4326) %>%
mutate(weight = weight / weight_pop_density)
nodes <- read_rds("our_data/nodes.rds") %>%
st_as_sf(crs = 4326) %>%
# Join in total out-degree (people leaving each source)
left_join(by = "code", y = net %>%
as_tibble() %>%
group_by(code = start_code) %>%
summarize(outdegree = sum(weight, na.rm = TRUE))) %>%
# Join in total out-degree (people leaving each source)
left_join(by = "code", y = net %>%
as_tibble() %>%
group_by(code = end_code) %>%
summarize(indegree = sum(weight, na.rm = TRUE))) %>%
# Now filter to just nodes that were either sources or destinations in the threshold network
# (meaning that a flow of at least 100 folks flowed between that place during some year)
filter(indegree > 0 | outdegree > 0)
mybox <- nodes %>% st_bbox()
bg <- read_rds("our_data/bgdataset.rds") %>%
st_as_sf(crs = 4326)
background <- bg %>% summarize(geometry = st_union(geometry))
library(ggnewscale)
myblues <- c("#2A0B49",
"#250F5A",
"#1A136A",
"#18247A",
"#1d428a",
"#3A6D9B",
"#5793AC",
"#74B5BC",
"#92CCC7",
"#B1DBD0",
"#D0EADE")
g1 <- ggplot() +
# Get background layer
#geom_sf(data = background, fill = "steelblue", color = "steelblue", size = 3, alpha = 0.2) +
geom_sf(data = myboston, fill = "white", color = "#373737", size = 3) +
# Get block group population density
geom_sf(data = bg %>% filter(area > 0), mapping = aes(fill = pop_density_2020_smooth10), color = NA) +
scale_fill_gradient(na.value = "black", low = "black", high = "white", trans = "log",
breaks = bg$pop_density_2020_smooth10 %>%
quantile(probs = c(0.01, .25, .50, .75, 0.99), na.rm = TRUE) %>%
unname() %>% round(0),
name = "Population Density (per km<sup>2</sup>)<br>by Block Group",
guide = guide_legend(order = 4)) +
# geom_sf(data = background, color = "darkgrey", size = 1, fill = NA) +
geom_sf(data = myboston, color = "#373737", size = 1, fill = NA) +
# geom_sf(data = mycounties, color = "black", size = 1, fill = NA) +
# Add outline layer then realy layer
geom_sf(data = net %>% filter(bin_nonwhite == "low-low"),
mapping = aes(size = weight + 2), color = "white") +
geom_sf(data = net %>% filter(bin_nonwhite == "low-low"),
mapping = aes(size = weight, color = bin_nonwhite)) +
# Add outline layer then realy layer
geom_sf(data = net %>% filter(bin_nonwhite == "low-high"),
mapping = aes(size = weight + 2), color = "white") +
geom_sf(data = net %>% filter(bin_nonwhite == "low-high"),
mapping = aes(size = weight, color = bin_nonwhite)) +
# Add outline layer then realy layer
geom_sf(data = net %>% filter(bin_nonwhite == "high-low"),
mapping = aes(size = weight + 2), color = "white") +
geom_sf(data = net %>% filter(bin_nonwhite == "high-low"),
mapping = aes(size = weight, color = bin_nonwhite)) +
# Add outline layer then realy layer
geom_sf(data = net %>% filter(bin_nonwhite == "high-high"),
mapping = aes(size = weight + 2), color = "white") +
geom_sf(data = net %>% filter(bin_nonwhite == "high-high"),
mapping = aes(size = weight, color = bin_nonwhite)) +
scale_size_continuous(range = c(0.1, 2), trans = "log",
breaks = net$weight %>%
quantile(probs = c(0.1, .3, .6, 0.99), na.rm = TRUE) %>%
unname() %>% round(1),
name = "Ridership Rate",
guide = guide_legend(order = 2)) +
scale_color_manual(values = myblues[c(1,4,6,8)],
breaks = c("high-high", "high-low", "low-high", "low-low"),
labels = c("High ▶ High",
"High ▶ Low",
"Low ▶ High",
"Low ▶ Low"),
name = "Demographics<br>of Start/Destination",
guide = guide_legend(order = 1, override.aes = list(size = 3))) +
#####################
# Add Node Positions
#####################
ggnewscale::new_scale("size") +
geom_sf(data = nodes, mapping = aes(size = outdegree),
shape = 21, color = "white", fill = "#696880", stroke = 0.5) +
scale_size_continuous(range = c(0.1, 5), trans = "log",
breaks = nodes$outdegree %>%
quantile(probs = c(0.1, .3, .6, 0.99), na.rm = TRUE) %>%
unname() %>% round(1),
name = "Cumulative Ridership Rate<br>Leaving Starting Station<br>(Weighted Outdegree)",
guide = guide_legend(order = 3)) +
coord_sf(xlim = c(mybox["xmin"], mybox["xmax"]),
ylim = c(mybox["ymin"], mybox["ymax"])) +
theme_void(base_size = 14) +
theme(legend.position = "right",
legend.title = ggtext::element_markdown(size = 12, hjust = 0),
legend.text = ggtext::element_markdown(size = 11, hjust = 0)) +
theme(plot.subtitle = ggtext::element_markdown(size = 14, hjust = 0.5),
panel.border = element_rect(fill = NA, color = "#373737", size = 1)) +
labs(subtitle = "<b>Ridership by Race</b><br>High: Block Groups where >50% Nonwhite Residents") +
ggspatial::annotation_north_arrow(width = unit(0.5, "cm"), height = unit(0.5, "cm")) +
ggspatial::annotation_scale(pad_x = unit(1, "cm"))
g2 <- ggplot() +
# Get background layer
geom_sf(data = myboston, fill = "white", color = "#373737", size = 3) +
# Get block group population density
geom_sf(data = bg %>% filter(area > 0), mapping = aes(fill = pop_density_2020_smooth10), color = NA) +
scale_fill_gradient(na.value = "black", low = "black", high = "white", trans = "log",
breaks = bg$pop_density_2020_smooth10 %>%
quantile(probs = c(0.01, .25, .50, .75, 0.99), na.rm = TRUE) %>%
unname() %>% round(0),
name = "Population Density (per km<sup>2</sup>)<br>by Block Group",
guide = guide_legend(order = 4)) +
# geom_sf(data = background, color = "darkgrey", size = 1, fill = NA) +
geom_sf(data = myboston, color = "#373737", size = 1, fill = NA) +
# geom_sf(data = mycounties, color = "black", size = 1, fill = NA) +
# Add outline layer then realy layer
geom_sf(data = net %>% filter(bin_nonwhite == "low-low"),
mapping = aes(size = weight + 2), color = "white") +
geom_sf(data = net %>% filter(bin_nonwhite == "low-low"),
mapping = aes(size = weight, color = bin_income)) +
# Add outline layer then realy layer
geom_sf(data = net %>% filter(bin_nonwhite == "low-high"),
mapping = aes(size = weight + 2), color = "white") +
geom_sf(data = net %>% filter(bin_nonwhite == "low-high"),
mapping = aes(size = weight, color = bin_income)) +
# Add outline layer then realy layer
geom_sf(data = net %>% filter(bin_nonwhite == "high-low"),
mapping = aes(size = weight + 2), color = "white") +
geom_sf(data = net %>% filter(bin_nonwhite == "high-low"),
mapping = aes(size = weight, color = bin_income)) +
# Add outline layer then realy layer
geom_sf(data = net %>% filter(bin_nonwhite == "high-high"),
mapping = aes(size = weight + 2), color = "white") +
geom_sf(data = net %>% filter(bin_nonwhite == "high-high"),
mapping = aes(size = weight, color = bin_income)) +
scale_size_continuous(range = c(0.1, 2), trans = "log",
breaks = net$weight %>%
quantile(probs = c(0.1, .3, .6, 0.99), na.rm = TRUE) %>%
unname() %>% round(1),
name = "Ridership Rate",
guide = guide_legend(order = 2)) +
scale_color_manual(values = myblues[c(1,4,6,8)],
breaks = c("high-high", "high-low", "low-high", "low-low"),
labels = c("High ▶ High",
"High ▶ Low",
"Low ▶ High",
"Low ▶ Low"),
name = "Demographics<br>of Start/Destination",
guide = guide_legend(order = 1, override.aes = list(size = 3))) +
#####################
# Add Node Positions
#####################
ggnewscale::new_scale("size") +
geom_sf(data = nodes, mapping = aes(size = outdegree),
shape = 21, color = "white", fill = "#696880", stroke = 0.5) +
scale_size_continuous(range = c(0.1, 5), trans = "log",
breaks = nodes$outdegree %>%
quantile(probs = c(0.1, .3, .6, 0.99), na.rm = TRUE) %>%
unname() %>% round(1),
name = "Cumulative Ridership Rate<br>Leaving Starting Station<br>(Weighted Outdegree)",
guide = guide_legend(order = 3)) +
coord_sf(xlim = c(mybox["xmin"], mybox["xmax"]),
ylim = c(mybox["ymin"], mybox["ymax"])) +
theme_void(base_size = 14) +
theme(legend.position = "right",
legend.title = ggtext::element_markdown(size = 12, hjust = 0),
legend.text = ggtext::element_markdown(size = 11, hjust = 0)) +
theme(plot.subtitle = ggtext::element_markdown(size = 14, hjust = 0.5),
panel.border = element_rect(fill = NA, color = "#373737", size = 1)) +
labs(subtitle = "<b>Ridership by Income</b><br>High: Block Groups where >50% 60K+ Earning Households")
combo <- ggpubr::ggarrange(g1,g2, nrow = 1, legend = "right", common.legend = TRUE)
#ggsave(g1, filename = "our_data/network_race.png", dpi = 500, width = 8, height = 6.5)
ggsave(combo, filename = "our_data/network_race.png", dpi = 500, width = 13.5, height = 6.5)knitr::include_graphics("our_data/network_race.png")