Potential uses of Replica data for travel analyses

Introduction

Replica provides us with simulated trip records for a synthetically created population at various levels of geographic aggregation across the United States. This dataset is valuable because rich, consistent, and location-specifc travel survey data is often hard to find or very expensive. Currently at Renaissance, most of the modeling we do that relies on trip record data uses the National Household Travel Survey (NHTS), but this dataset has limited geographic information and is only published every eight years (the most recent being 2017). As a result, models informed by NHTS are highly general. Replica allows us to cater models to particular study areas of interest, and gives us access to modeling approaches previously unavailable to us due to lack of quality data.

Essentially, Replica data amount to highly locationally-specific travel survey data. Thus, the opportunities for applying this data mirror the opportunities presented by any travel survey data. The primary benefits of Replica data come from its consistent structure, geographic scope, and ability to acquire records for project-specific zones of interest.

What comes in Replica data?

Like most travel survey data, Replica provides us trip record tables and person identification tables. Each row in a trip record table defines a single trip; it is tagged with a origin/destination zone, mode, purpose, start/end time, travel time, and travel distance. Each trip is also tagged with a person ID, identifying the person taking the trip. This person ID can be matched to records in the person identification table to obtain a person’s age, employment status, work-from-home status, race/ethnicity, household income, household vehicles, household size, and home/work locations.

Unlike other travel survey data, Replica also provides information on land use with regard to tripmaking. Trip records are tagged with origin and destination land uses, and a separate table is available totaling building area and land area by land use.

Noting that this is not a comprehensive list, the unique features of Replica – and those that are helpful to our analytical efforts – are:

Age and income are defined on continuous scales: More often than not, these two variables are defined in bins (e.g. Age <18, 18-24, …; Income <$10,000, $10k-$20k, …); this is especially the case with income, due to the fact that it is a sensitive variable. When the data is binned, models can only consider these variables as factors, so a great deal of information is [potentially] lost about variability within bins; this is especially problematic with income, which often features larger bin sizes at high incomes. Continuous income and age data will allow for greater richness in demographic models of travel behavior
Work from home status is available: Due to the relatively recent rise of working from home, this variable is often ignored in demographic definitions. Understanding if a person is working from home or in person can be crucial in defining their typical daily trip decisions (e.g. someone working from home, despite being employed, is unlikely to take a work trip)
Trips are tagged with origin/destination zones: Arguably the most important feature of the replica data for our modeling. By understanding zone-to-zone flows, we can gain a great deal more insight on link level usage and the association between zone level statistics and travel behavior. Though many city-specific travel surveys contain information on trip start/end locations, we do not often work with these type of travel surveys; for the most part, we use NHTS, which again offers very limited location data.
Land use information is delivered with the trip records: Replica’s trip origin/destination land use attribute, as well as their zone-level land use aggregation tables, are unique in the travel survey space. This information allows us to understand how land use influences trip generation, and affords the opportunity to bring land use into various zone-level tripmaking models.

Some of the drawbacks of Replica data include:

Demographic data does not feature sex or education: Both sex and education have featured in previous demographically-informed trip behavior models we have created. While the loss of education is not particularly bad, sex has been found to be a key contributor to travel decision making, so the absence of this variable is notable.
Travel time and distance are not exactly continuous: Travel time is reported in minutes, and travel distance is reported in tenths of miles; though this is likely “good enough” for most modeling applications, it can complicate some estimations. In particular, the presence of 0-minute and 0-mile trips is problematic. Simulation based approaches could mitigate the effects of this drawback, but can be time consuming

What can be done with Replica?

Opportunities for applying the Replica data can be split into two categories: theoretical explorations and equity analyses. Note that the lists below are not exhaustive and that the ideas presented are not necessarily complete – instead, they are intended to give an idea of what opportunities may be available from the Replica data, and inspire other avenues for analysis.

Theoretical explorations

The depth of the Replica data presents interesting opportunities for statistical development in the travel modeling space. Below, a few of these opportunities are detailed:

Within-zone travel time modeling: When we skim networks, we run into the problem that the travel time and distance within a zone is 0; this is a result of zones being converted into network nodes for routing, and the fact that the distance between a node and itself is 0. This, of course, is unrealistic, but we do not currently have a great method for estimating within-zone travel times. However, with the replica data, we can create a modeling frame for estimating within-zone travel times as a function of zone characteristics by looking specifically at zone-to-zone trips. We can fit mathematical models to estimate an average within-zone travel time, or to predict parameters of a distribution of within-zone travel times. We would be able to apply the results of such a model to our travel skimming to produce more accurate estimates of travel time along a network.
Variability in travel times around shortest paths: When we skim networks, we generally assume (due to convenience) that all users take the shortest path between zones for their mode of choice. However, we know this is not always the case. Because replica presents a range of travel times for between-zone pairs, we can observe how frequently the shortest path is used. In analyses focusing on just Replica data, we could compare observed travel times to the shortest path travel time to understand where there are high deviations from shortest path use. This could help identify areas where there are barriers to travel. For use in non-Replica studies, we could attempt to model parameters of a distribution of between-zone travel times as a function of the shortest path time. We could then apply the results of these models to produce more realistic expectations of travel time. One use case of this might be to produce bounded accessibilities, where our accessibility estimates would reflect a confidence interval of true accessibility given observed travel time patterns in the region.
Improvements in mode share modeling: In general, our mode share models – whether time-based, demographic-based, or a combination thereof – are “global”, or generalized to an entire region. This, however, ignores the “local” variability observed on a zone to zone basis. Local features that encourage or discourage travel by a particular mode between a pair of zones are rarely quantifiable in mode choice models (e.g. one may expect a lot of walking between two adjacent zones, but if there are no sidewalks, no walking may be observed). However, because between-zone mode shares can be calculated from the Replica data, we can bridge the gap between an “expected” mode share (based on global characteristics) and an “observed” mode share (based on local characteristics). This would allow for more nuanced definition of mode choice, and would thus improve estimates that rely on between-zone travel information (e.g. accessibility).

Code example: Consider for (1) a simple example of estimating within-zone travel time as a function of zone size. This approach comes out of a hypothesis that larger zones should not only have higher average within-zone travel times, but also higher variability. A minimal code example for private auto trips in Miami-Dade county is given below:

wz = trips[origin_bgrp == destination_bgrp & mode == "PRIVATE_AUTO",
           .(origin_bgrp, duration_seconds)] %>%
  .[, .(MEAN_MINS=mean(duration_seconds)/60), by="origin_bgrp"] %>%
  setnames(old="origin_bgrp", new="GEOID")
bg = block_groups(state="FL", progress_bar=FALSE) %>%
  dplyr::filter(GEOID %in% wz$GEOID) %>%
  st_transform(2236) %>%
  dplyr::mutate(SQMI = as.numeric(st_area(.))/(5280^2)) %>%
  st_drop_geometry() %>%
  data.table()
bg$GEOID = as.numeric(bg$GEOID)
wz = merge(wz, bg, on="GEOID") %>%
  .[, .(GEOID, MEAN_MINS, SQMI)]
ggplot(data=wz[MEAN_MINS > 0]) +
  geom_point(aes(x=log(SQMI), y=log(MEAN_MINS))) +
  labs(x="log(square miles)", y="log(mean travel time in minutes)",
       title="Within-zone travel time by zone area") +
  theme_bw()

Though the trend is not perfect, a clear positive correlation provides proof of concept for this sort of approach. More robust mathematical methods, as well as the introduction of additional explanatory variables, could improve modeling in this context.

Equity analyses

A particular point of interest for Renaissance’s use of the Replica data is applying it to equity analyses. The demographic information provided for each traveler coupled with the zone-to-zone level definition of trips makes this data uniquely suited to pursue these sorts of analyses. Below, a few of these opportunities are detailed:

Mode utilization by race/income: This includes questions such as “Do certain groups use certain modes more frequently? To what extent?”. We’ve observed in the past that minority, low-income travelers more frequently rely on non-auto modes. The size of the Replica data would allow us to hone in the effect sizes with great precision, as well as cater them to individual areas of interest.
Travel budget by race/income: This includes questions such as “Are minority and/or poorer populations stuck with longer travel times?”. We’ve observed in the past that minority, low-income travelers are more frequently forced to travel greater distances. As is the case in (1) above, the size of the Replica data would allow us to hone in the effect sizes with great precision. Additionally with the cross classification of race and income, we could explore interaction effects in a multinomial regression setting.
Spatial correlation between populations in demographic subsets and LU types: This includes questions such as “Are certain groups more likely to be proximate to mixed use development?” and “Are land use breakdowns related to demographic breakdowns?”. We’ve never had the ability to answer these questions before, due to difficulty in aligning our common demographic datasets (e.g. ACS) with our common land use datasets (e.g. parcels). Replica provides a clean, cohesive source where this data is available at the same geographic level. By identifying relationships between land use and demographics, we can identify areas where interventions should be made to modernize development.
Vehicle ownership by race/income: This includes questions such as “How much more likely is a person of race X than race Y to be in a household of N vehicles?” and “What is the distribution of household incomes across classes of vehicle ownership?”. These would be direct fallouts of the Replica population data, and are notable because cross tabulations of vehicle ownership by race and/or income are not available in the US Census. This could help contextualize some of the findings in a question like (1), given how crucial vehicle availability is to the mode choice decision. It could also be helpful in establishing general rules-of-thumb for the relationship between these variables, given that other data sources are not available.
Accessibilities to various land uses by race/income: This includes questions such as “Do any demographic subgroups have disproportionate access (or lack of access) to any land use type?”. Such a question could be important in establishing where land use patterns are potentially limiting opportunities for a group of people.

Code example: Consider for (2) measuring if minority populations take longer walk trips than do non-minority populations. Cases where minority walking travel budgets are longer could indicate a need for improved accessibility in high minority population areas. A minimal code example for in Miami-Dade county is given below

MODE = "WALKING"
rd = trips[person_id!="\\N", 
           .(person_id, origin_bgrp, destination_bgrp, mode, time=duration_seconds)] %>%
  .[, person_id := as.numeric(person_id)] %>%
  merge(pop[!is.na(person_id), .(person_id, race_ethnicity)], by="person_id") %>%
  .[race_ethnicity != "\\N"] %>%
  .[, minority := fcase(race_ethnicity=="white_not_hispanic_or_latino", "Non-minority",
                        default="Minority")]
rd %>%
  .[mode == MODE] %>%
  ggplot() +
  geom_density(aes(x=time, y=..density.., color=minority),
               binwidth=60, center=30) +
  scale_x_continuous(name="Travel time (seconds)",
                     limits=c(0,3600)) +
  scale_y_continuous(name="Density") +
  labs(title=paste(str_to_title(MODE), "trip times by minority status")) +
  scale_color_discrete(name="Minority status") +
  theme_bw()

This density plot indicates that, on the scale of the region, minority populations do tend to take slightly longer walk trips than non-minority populations. But, we’d like to know where specifically this is occurring. Because the Replica data gives us information on trip origin and destination, we can identify problematic connections.

tt = rd %>%
  .[mode == MODE] %>%
  .[, .(xbar=mean(time), s=sd(time), n=.N), by=c("origin_bgrp","destination_bgrp","minority")] %>%
  dcast(origin_bgrp + destination_bgrp ~ minority, value.var=c("xbar","s","n")) %>%
  .[n_Minority > 15 & `n_Non-minority` > 15] %>% #rule of thumb for power
  .[, t_num := xbar_Minority - `xbar_Non-minority`] %>%
  .[, t_den := sqrt(s_Minority^2/n_Minority + `s_Non-minority`^2/`n_Non-minority`)] %>%
  .[, t := t_num / t_den] %>%
  .[, df_num := (s_Minority^2/n_Minority + `s_Non-minority`^2/`n_Non-minority`)^2] %>%
  .[, df_den := s_Minority^4/(n_Minority^2*(n_Minority-1)) + `s_Non-minority`^4/(`n_Non-minority`^2*(`n_Non-minority`-1))] %>%
  .[, df := df_num / df_den] %>%
  .[, p := 1-pt(t, df)] %>%
  .[order(p)]
# tt[p < 0.05] #Bonferroni correction for controlling error rate?
# These are the pairs for which we can confidently say that minorities are
# taking longer walk trips than non-minorities. This may suggest lack of access
# to more optimal travel modes between the two zones for minority populations
tt[p < 0.05 & xbar_Minority > 600] %>%
  .[, .(orig=origin_bgrp, dest=destination_bgrp,
        mean_min_tt=xbar_Minority, mean_nm_tt=`xbar_Non-minority`,
        n_min_trips=n_Minority, n_nm_trips=`n_Non-minority`)]

##             orig         dest mean_min_tt mean_nm_tt n_min_trips n_nm_trips
##  1: 120869803001 120860076031    740.1493   403.9024         134         82
##  2: 120860076031 120869803001    869.6703   481.7143          91         35
##  3: 120860103001 120860103001    827.7064   530.2703         327         37
##  4: 120860114015 120860114015    635.2800   511.1881        1000        202
##  5: 120860114041 120860114041    690.6897   558.5047         348        107
##  6: 120860084094 120860081022    748.4746   543.1579          59         19
##  7: 120860090103 120860090101    752.4324   505.7143         111         21
##  8: 120869803001 120860079011    760.8791   653.2653         182         98
##  9: 120860115001 120860115001    868.1633   487.8261          49         23
## 10: 120860125001 120860140001   1804.3636  1391.2500         165         16
## 11: 120860089041 120860146002   1399.6721  1273.8462         183         26

There are 11 zone pairs where not only is walking time for minority populations significantly higher than walking time for non-minority populations, but the mean walking time for minority populations is greater than 10 minutes (considered for this example to be a “long” walking trip). This could be an indication that minority populations traveling between these zones face a barrier to auto or transit access that non-minority populations in the area do not face. As such, this could serve as a starting point for deeper exploration into the specific network attributes, land uses, and development defining these pairs.

Conclusion

Replica’s precise, zone-tagged travel record data, coupled with its detailed traveler demographic information, presents Renaissance with the unique opportunity to cater travel modeling to individual areas of interest. Though slightly incomplete, the demographic data provides sufficient opportunity to explore questions of equity that previous datasets did not support. Additionally, Replica could serve as a useful tool for building generalized rule sets to apply in other travel and land use modeling contexts.