How can data quality issues be handled – specifically, an oversaturation of facilities indicators? For example, what if a city provides its GTFS data for the “transit” facilities indicator, or provides the list of all businesses accepting SNAP benefits for its grocery stores? This could result in an unusually large number of instances of a facilities indicator in a service area, which consequently implies an unusually large amount of routing to complete. If it is ever decided to route using ESRI network tools, routing to all instances would be very credit cost intensive, and would be very time intensive regardless of software used.
A challenge becomes accounting for breadth of facilities indicators accessible from a given point without the constraint of routing to every point. This update to the walk score function suggests routing to only the first \(z_k + 1\) instances of facilities indicator \(k\), where \(z_k\) is the desired number of facilities indicator \(k\). Furthermore, it suggests accounting for breadth in terms of the percentage of a facilities indicator occurring in the service area of interest.
The walkability score at sample point \(s\) is calculated as:
\(W_{s} = min\bigg\{\sum\limits_{k=1}^{K} \Big[100c_k \cdot \min\Big\{\Phi_{s}, 1\Big\}\Big] + sign(\Psi_s) \cdot min\Big\{\Big|\Psi_s\Big|, \Omega_s\Big\} \ \ , \ \ 100\bigg\}\)
Where:
\(\Phi_s = (\frac{min\{r_k, z_k\}}{z_k}) \cdot \alpha\Big((y_k + 1) - \sum\limits_{i=1}^{r_k} (z_k + 1 - M_{s,k,i}) \cdot u(rank(M_{s,k,i}), M_{s,k,i}, z_k)\Big) \cdot (1 + \frac{n_k}{o_k})\)
And:
\(\Psi_{s} = \sum\limits_{q=1}^{Q} f_{q} \cdot 100T_{s,q}\)
Model notation is defined as follows:
\(W_s\): the walkability score for sample point \(s\)
\(s\): identifier for a sample point
\(k\): identifier for a facilities indicator
\(K\): the total number of facilities indicators considered; for raster creation, four are considered: grocery stores, health centers, schools, and transit stops
\(c_k\): the category weight associated with facilities indicator \(k\), subject to \(\sum_{c_{k}} = 1\); for raster creation, it is assumed that \(c_k = 0.25, \ \forall k\)
\(z_k\): the desired number of facilities indicator \(k\) within a walkable distance of \(s\); for raster creation, it is assumed that \(z_k = 1, \ \forall k\)
\(r_k\): the number of instances of facilities indicator \(k\) achieved in searching for \(z_k + 1\) instances
\(y_{k}\): the maximum walk-time distance to an instance of facilities indicator \(k\) (in minutes). Also used to define the walk-time extent of \(A\), the set of all service area polygons; for raster creation, it is assumed that \(y_k = 10, \ \forall k\)
\(i\): identifier for a walkable instances of a facilities indicator (where “walkable” means within \(A_s\), the service area of sample point \(s\))
\(M_{s,k,i}\): the walk-time distance from sample point \(s\) to the \(i^{th}\) walkable instance of the \(k^{th}\) facilities indicator
\(rank(M_{s,k,i})\): the rank of \(i\) in terms of its closeness to \(s\) (i.e. the closest instance will have rank 1, second closest will have rank 2…)
\(u(\cdot, \cdot, \cdot)\): the function used to weight \(i\) based on how likely a user is choose going to \(i\) from all points \(r_k\), subject to \(\sum_{u(\cdot, ,cdot\, \cdot)} = 1\). The number of weights produced by \(u\) is \(z_k\). Ideally, this will be based on both the proximity rank and relative size of walk-time distances. Not yet defined
\(\alpha(\cdot)\): the decay function to weight \(i\) based on \(M_{s,k,i}\)
\(\alpha(M_{s,k,i}) = 1 - \frac{M_{s,k,i} - 1}{2 \cdot y_{k} - 2}\)
Defines a linear decay of “distance weight” as walk-time distance increases. Instances of a facilities indicator within 1 minute of a sample point will always have distance weight 1, while instances within \(y_k-1\) to \(y_k\) minutes will always have weight 0.5 (i.e., assume that walking 1 minute is always considered twice as advantageous as walking the maximum walk-time distance)
\(n_{k}\): the total number of instances of facilities indicator \(k\) in \(A_s\)
\(o_k\): the total number of instances of facilities indicator \(k\) in the AOI
\(q\): identifier for an area indicator
\(Q\): the total number of area indicators considered; for raster creation, four are considered: incidences of crime, vehicle crashes involving pedestrians, historical sites designated by the National Registry of Historical Places, and street trees
\(f_{q}\): the sign associated with area indicator \(q\), indicating whether it considered to have a negative or positive impact on walkability (for example, crime would have \(f_{q} = -1\), because it is considered to make an area less walkable)
Changes are manifest in the \(\Phi_s\) component of the model (the \(\Psi_s\) function is the same as before – i.e., area indicators are assumed to be added in same manner as previously to the model). Each of the terms is detailed below. Descriptions will assume that the score at point \(s\) (with service area \(A_s\)) is being considered.
\(\frac{min\{r_k, z_k\}}{z_k}\)
Assuming a user desires \(z_k\) instances of facilities indicator \(k\) within a walkable distance, routing will be completed to the closest \(z_k + 1\) instances of \(k\) in \(A_s\). Of course, there is no guarantee that there are actually \(z_k + 1\) instances of \(k\) in \(A_s\) – a user could say he wants 3 grocery stores, for example, but only has 1 in his service area. It also could be the case that a user has more instances of \(k\) than he wants – for example, 5 grocery stores when only 3 are wanted. The variable \(r_k\) indicates the number instances of \(k\) that were successfully routed to from \(s\); so, \(r_k \epsilon \{0,1,...,z_k+1\}\). This model term scales the score for a particular facilities indicator by the ratio of present instances to desired number of instances of \(k\). It guarantees that a “perfect score” for \(k\) cannot be achieved if \(z_k\) instances of \(k\) are not present.
To provide an example: assume that at point \(s\), a user – who wants to walk no more than 10 minutes to transit stops and desires 4 walkable transit stops – has only 3 transit stops in \(A_s\), which are 1, 4, and 8 minutes away. Then:
\(\frac{min\{r_k, z_k\}}{z_k} = \frac{min\{3, 4\}}{4} = \frac{3}{4} = 0.75\)
\(\alpha\Big((y_k + 1) - \sum\limits_{i=1}^{r_k} (z_k + 1 - M_{s,k,i}) \cdot u(rank(M_{s,k,i}), z_k)\Big)\)
This term is a reconfiguration of the distance-utility interaction for instances \(i\) of facilities indicator \(k\). First, consider the sum. The sum is a basic weighted average of the \(r_k\) instances of \(i\), where the weights are meant to represent the likelihood of going to \(i\). The walk-time distance \(M_{s,k,i}\) is “flipped” relative to \(z_k\), such that if \(r_k < z_k\), the absence of instances of \(k\) to which it was desired to route may appropriately be treated as 0 terms. The utility weight function \(u\) is a fixed function producing \(z_k\) weights such that \(u(1, M_{s,k,i}, z_k) + ... + u(z_k, M_{s,k,i}, z_k) = 1\). Note that \(u\) is not currently defined. The assumption underlying the use of a weighted average of walk-time distances is that a user will not always go to the closest instance of \(k\), and there is some way to model how often a user will go to particular instances of \(k\), manifest in \(u\). When “flipped back” relative to \(z_k\), the result of this process is an average walk-time distance to indicator \(k\) in \(A_s\). It is then applied to the existing linear distance decay function \(\alpha(\cdot)\), which is unchanged from before.
Consider the same example from above. Heuristically, assume that for \(z_k = 4\), \(u(1:4,M_{s,k,i},4) = 0.7, \ 0.1875, \ 0.075, \ 0.0375\). Then:
\(\alpha\Big((y_k + 1) - \sum\limits_{i=1}^{r_k} (z_k + 1 - M_{s,k,i}) \cdot u(rank(M_{s,k,i}), M_{s,k,i}, z_k)\Big) =\)
\(\alpha\Big(11 - ((11-1)*0.7 + (11-4)*0.1875 + (11-8)*0.075)\Big) =\)
\(\alpha(11 - 8.5375) =\)
\(\alpha(2.4625) = 0.91875\)
\(1 + \frac{n_k}{o_k}\)
By only routing to the closest \(z_k + 1\) instances of \(k\) in \(A_s\), there is a potential that the closest \(z_k + 1, ..., n_k\) would be ignored, and, consequently, that breadth altogether would not contribute to the model. Take, for example, \(A_{s_1}\) with schools 1, 3, 5, and 7 minutes from \(s_1\) and \(A_{s_2}\), with schools 1 and 3 minutes from \(s_2\): if \(z_k = 1\), \(s_1\) and \(s_2\) would get the same score for schools if breadth was not a factor! This model term accounts for breadth by scaling the score for a particular facilities indicator by the percent of instances in \(A_s\) relative to the total instances of \(k\) in the AOI. It also will increase the score for \(s\) if it is walkable to a more rare facilities indicator, which is an advantageous consequence of including this term.
Consider the same example from above. Further, assume there are 30 total transit stops in the AOI. Then:
\(1 + \frac{n_k}{o_k} = 1 + \frac{3}{30} = 1 + \frac{1}{10} = 1.1\)
Following through for the whole example, and assuming a user assigns transit weight \(c_k = 0.25\), then:
\(Transit \ Score \ = \ 100*0.25*0.75*0.91875*1.1 \ = \ 18.9492\)
PROS:
Allows for limiting routing from point \(s\) to instances \(i\) of facilities indicator \(k\) in service area \(A_s\) by only requesting routes up to only the \(z_k + 1\) closest \(i\). This is the primary benefit of using this model. Furthermore, in the case that \(z_k + 1\) instances \(i\) are not present in \(A_s\), the model is equipped to deal with missing data.
Because of the above, this model is robust to data oversaturation – if a Hub customer provides facilities indicator data with too many points, there will be no cost or time concerns associated with over-routing.
Desired number is entered as a more interpretable term in this model. Now, it constrains how many routes will be attempted, and scales the score relative to the percent of those routes achieved – this is more sensible, as it is a better reflection of the user’s specifications. Previously, it was only used to derive the normalization term for the distance-utility interaction.
CONS:
Would require the use of ESRI network tools to route, as the current graph approximation offers no method by which to only route to the closest \(z_k + 1\) points (i.e., routing along the graph is not robust to data oversaturation). This would certainly see a credit cost increase, as service areas will still be created. This is the primary disadvantage of this model.
Using Origin-Destination Cost Matrix (ODCM) would not entail as large a credit cost increase (as each possible pair of routes costs only 0.0005 credits), but would result in a lot of “dead credits” – credits going to routes that are not ultimately used. ODCM charges credits on the number of possible pairs of points, so even if a cutoff was specified such that routes would not be found outside of \(y_k\) minutes or outside of the \(z_k\) closest destinations, credits would still be charged based on the full possible matrix.
Using Closest Facilities would not entail any “dead credits” as described above, but would entail a potentially large credit cost increase. Closest Facilities charges 0.5 credits per route, which is a very costly service.
The \(u(\cdot,\cdot,\cdot)\) function may be difficult to define. Ideally, while remaining unique to each \(z_k\), the weights assigned by \(u\) would be a result of the relative differences between the observed \(M_{s,k,i}\).
For example, assume \(z_k = 2\) for schools, and two service areas \(A_{s_1}\) with grocery stores X and Y 1 and 2 minutes away (respectively) and \(A_{s_2}\) with grocery stores Z and W 1 and 8 minutes away (respectively). At \(s_1\), a user will likely go to stores X and Y almost interchangeably, while at \(s_2\) as user will likely go to Z much more often than W. So, despite having the same \(z_k\) and consequently \(U(\cdot,\cdot,2)\) including only 2 weights, different weights would be desireable for points \(s_1\) and \(s_2\).
The problem with this approach, however, is that \(r_k\) is not always greater than or equal to \(z_k\), so 0-distances can be introduced into the model. How can relative weights be assigned to difference distances when one or more of the distances does not exist? What would likely have to be done is assign a “range” of possible weights associated with the rank of each distance, with the actual value of the weight from that range determined from a function of the distance itself. This \(u(\cdot,\cdot,\cdot)\) function would need to be developed and formalized before implementing this model.