Intro

In addition to being a formidable physical challenge for the participants, the NYC Marathon is a logistical Everest for its organizational body - the New York Road Runners. The course takes some 50,000+ runners through all 5 boroughs. Starting in the secluded Staten Island, and ending in Central Park, the participants are escorted by approximately 10,000 volunteers through the 26.2 mile course.

In its current form - runners are divided into 4 different starting times (called waves), and 3 starting lines, running unique courses over the first 3 miles. While there is an additional layer of sub-division, the corral, this only seeds a runner at their start line. Runners are effectively in groups of 4000 at the start of the race, and nearly 13,000 after the third mile when the courses merge.

Objective

With the advent of COVID-19, the structure of many events will need to be revisited to limit large groups. My ultimate goal for this exercise is to increase the distance between each corral to provide an opportunity for greater space between runners. By manipulating the starting times for each corral we can find to what degree congestion on course could be alleviated. I will also explore how course congestion is a function of time - and how increasing the duration the Marathon course is open relates to lower congestion rates throughout the course. As permits and road closure is undoubtedly one of NYRRs biggest expenditures on race-day, I imagine optimizing in this way would be of particular interest to race promoters.

In order to accomplish the above, I will first model the 2019 NYC Marathon by way of probability distributions based on actual runner performances. From here, I can then measure and visualize congestion on course, before optimizing starting times to minimize runner traffic.

Process

  • Web scrape Results
  • Check for strength of correlation between results and Wave/Corrals/Starting lines.
  • Describe distribution within each group for various splits on course (10k, Half, 30k, Full Marathon)
  • Simulate race with various starting schemes
  • Visualize congestion on course
  • Validate Model: compare actual 2019 marathon results to simulated 2019 results
  • Build function to describe congestion on course numerically.
  • Optimize function: find optimal starting times for each group to minimize traffic
  • Verify model: compare runner congestion with optimal starting times to congestion with actual 2019 starting times

Limitations in Scope

In general, the wave and corral each runner is issued is correlated with their expected finishing time. I will be classifying and describing the performance of the runners in each group, however, I will not attempting to re-engineer the NYRR race prediction process. This project assumes that the NYRR race prediction methods will be consistent in any applications of these findings, and that the performance of runners in is relatively similar to those in 2019. While I suspect there is some variation in performances year to year, I would imagine much of these variations could be attributed to things like weather, which impact all runners, and the relative differences in performance might actual by minimal.

Gather Data

We will begin by aggregating results from the 2019 NYC Marathon. I will be scraping 36 pieces of information about every finisher in the 2019 race. We have “splits” (the time the runner reached a designated mile or km marker) for miles 3-26, every 5 kilometers, and their ultimate finishing time (34 splits in total.) Each of these splits and their official marathon finishing time is a variation of a metric which has been referred to as “chip time” - in these cases the “clock” for each runner does not start till the runner passes the starting line.

We also have a metric called “gun time” for the full marathon, which can think of as a proxy for time of day. Gun time differs from their official marathon finishing time (Chip Time) in that for “gun time” the “clock” starts when the course opens and the first runner starts their race (when the starting pistol or “gun” goes off). With each runners gun time and official marathon finishing time, we can find when they crossed the starting line relative to other runners, as well as when they passed the various mile markers on course.

Finally, we will also be pulling what bib number each runner was issued, as this is correlated with their wave, and by extension their expected finishing time.

We will be pulling the data in Yellow and Green above. Bib number will be pulled form the URL.

Clean Data

Pull Variable names

##  [1] "X"   "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11"
## [13] "V12" "V13" "V14" "V15" "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23"
## [25] "V24" "V25" "V26" "V27" "V28" "V29" "V30" "V31" "V32" "V33" "V34" "V35"
## [37] "V36"

As we see above, our data does not have useful variable names

Reformat times

R has not interpreted our splits as duration - we need to use lubridate to alter accordingly. We will be using seconds as our unit moving forward, though will ultimately measure congestion over a minute period.

Categorize

We will now sort our runners into 72 different categories. As mentioned above, the NYC Marathon has 3 different starting lines denoted by Blue, Green and Orange colored numbers. Additionally there are 4 different start times for the race, and within each wave, there are 6 different starting “corrals” which determine where within each wave and each starting line the runners stand. Luckily, we can determine the start time and starting line of each runner from their issued bib numbers. Corrals, unfortunately, are not listed in this document, which complicates things. That being said - each starting time/line combination consists of 6000 potential numbers for 6 corrals. I will work off the assumption that each corral correlates with each 1000 numbers.

Regarding professional runners - there are a number of professional fields in the NYC Marathon: the pro are broken into “Professional”, “semi-pro” and “local elite.” As this group is an outlier and relatively very small, we will be removing them from our analysis.

Finally - as we will likely be sorting by corral in the future, we will use “one hot encoding” for this variable at this stage - effectively creating a binary variable for each of the 72 corrals. I have hidden this block of code (230 lines in total) but will share the head of the resulting table.

With each of our corrals listed as their own variable, we can pull wave, corral and start color.

Here is a list of all our variables with some example data from the first corral. Note how Corral and Waves are denoted with a W1C5B scheme (Wave 1, Corral 5, Blue starting line)

##   bib num gun time  3M  5K   4M   5M   6M  10K   7M   8M   9M  15K  10M  11M
## 1       2     7839 888 918 1183 1476 1770 1833 2062 2352 2651 2744 2946 3250
## 2       3     7693 885 916 1182 1476 1770 1835 2065 2354 2654 2745 2950 3257
## 3       4     7760 889 921 1184 1477 1768 1834 2060 2353 2655 2745 2950 3257
## 4       5     7716 887 918 1183 1476 1768 1833 2063 2353 2653 2744 2950 3256
## 5       6     7845 886 919 1183 1477 1770 1834 2066 2355 2654 2746 2950 3257
## 6       7     8706 888 918 1184 1477 1769 1836 2065 2357 2655 2747 2951 3258
##    12M  20K  13M HALF  14M  15M  25K  16M  17M  18M  30K  19M  20M  21M  35K
## 1 3545 3680 3855 3891 4157 4462 4625 4757 5056 5350 5546 5653 5941 6248 6457
## 2 3547 3680 3855 3890 4156 4462 4627 4757 5056 5350 5545 5652 5939 6247 6454
## 3 3548 3680 3855 3890 4157 4461 4625 4757 5056 5350 5546 5653 5940 6247 6455
## 4 3548 3679 3854 3889 4157 4461 4626 4756 5057 5350 5545 5652 5940 6248 6455
## 5 3548 3680 3855 3890 4157 4463 4630 4762 5057 5350 5547 5654 5953 6255 6477
## 6 3551 3685 3865 3905 4180 4522 4722 4871 5193 5522 5746 5870 6236 6622 6904
##    22M  23M  24M  40K  25M  26M  MAR W1CPA W1C1B W1C2B W1C3B W1C4B W1C5B W1C6B
## 1 6533 6838 7161 7417 7464 7773 7839  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 6524 6800 7079 7308 7350 7631 7693  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 6523 6801 7105 7349 7395 7693 7760  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 6524 6800 7086 7320 7364 7652 7716  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 6552 6856 7175 7427 7474 7777 7845  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 7002 7406 7832 8166 8227 8621 8706  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##   W1C1O W1C2O W1C3O W1C4O W1C5O W1C6O W1C1G W1C2G W1C3G W1C4G W1C5G W1C6G W2C1B
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   W2C2B W2C3B W2C4B W2C5B W2C6B W2C1O W2C2O W2C3O W2C4O W2C5O W2C6O W2C1G W2C2G
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   W2C3G W2C4G W2C5G W2C6G W3C1B W3C2B W3C3B W3C4B W3C5B W3C6B W3C1O W3C2O W3C3O
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   W3C4O W3C5O W3C6O W3C1G W3C2G W3C3G W3C4G W3C5G W3C6G W4C1B W4C2B W4C3B W4C4B
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   W4C5B W4C6B W4C1O W4C2O W4C3O W4C4O W4C5O W4C6O W4C1G W4C2G W4C3G W4C4G W4C5G
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   W4C6G CorWrap Wave Cor Col Gun_Start Gun_Start_Round
## 1 FALSE   W1CPA   W1  CP   A         0               0
## 2 FALSE   W1CPA   W1  CP   A         0               0
## 3 FALSE   W1CPA   W1  CP   A         0               0
## 4 FALSE   W1CPA   W1  CP   A         0               0
## 5 FALSE   W1CPA   W1  CP   A         0               0
## 6 FALSE   W1CPA   W1  CP   A         0               0

Explore Data

Predictive power of Waves, Start Line, Corrals

We will now measure if Wave, Corral and/or Starting Line assignment is truly indicative of performance.

## [1] 0.4539345

We see a .45 R Squared using Wave alone to predict performance.

As our Corral designation contains Wave information (we used a W1C1B - Wave, Corral, Starting Line Color - numbering system), we need to predict performance by corral or starting line within a wave to learn if Corral contains additional useful information.

## [1] 0.01154825
## [1] 0.03633491

Above we see that neither corral nor starting line is at all predictive.

## [1] 0.04456881

Finally, controlling for wave and corral, we find that starting line alone does not appear to be correlated with performance.

Before moving forward, I do want to call out that it is very likely that the NYRR can predict performance. I believe that my assumption that each corral is correlated to bib number to be false. As we do not know what runners are in which corral, we will be modeling runner performance per wave, and largely ignoring corral and starting line information.

Outline of New Starting Line Format

As starting line and corral designation (as labeled here) are not predictive, we will remove these columns from our Data Frame, but will be keeping the Coral designation in the roll-up field (“Cor_Wrap”). In our new race start format, we will work with corral as our unit, as it is easier to stagger when working groups ~2100 in size. The final plan will release corrals from each of each starting line at the same time (-700 per corral, 3 starting lines), effectively working in units of ~2100 runners. As the 3 starting lines of the Marathon run separate paths till mile 3, they will remain separated during the most congested parts of the race.

Build and Validate Model

Now that we know which our fields are predictive of performance, we can build a model which predicts congestion on course for various starting times and formats.

Check Normality

While not perfect, it seems as the full and half marathon times for the entire field is near a normal distribution.

Lets check how waves impact the distribution of times at the half marathon distance

…and how times are distributed at various splits for runners in Wave 2.

We can see above that the distributions are fairly normal till we get to the tails over 2 SDs above the mean (Marathons are hard, and NYC is a race that one finishes even if they need to walk), we can use a normal distribution to roughly describe the performance of each wave of runners.

Simulate Splits via Distributions.

We will next compare our runners performance by wave at various splits (runner distance) to normal distributions. This can best be accomplished by writing a function that (1) captures the actual mean and SD for each of our sub-groups, (2) applies a delta to the SD for scaling purposes, and (3) generates a random sample of the same size of our sub-group population, so we can compare simulated times (normally distributed) and those of our actual runners with a visualization.

In this function, each runner will have a theoretical time, which compared to their actual performance seems fairly arbitrary. On aggregate, however, the entire sub group (corral/wave) should have comparable performances.

Simulate_Race_Norm <- function(DataSet,W1SD_Delta,W2SD_Delta,W3SD_Delta,W4SD_Delta){

#Simulates Race from normal distribution of preformances; exports SD and Mean for each Wave/Distance pairing. 
  
#Each Runner recieves a simulated time, which while not representitive of their preformance, is part of our normal distribution. This will be used for comparing theortrical vs actual times in graphing later. 
  
#Subset dataset by Wave 
Results_Sample_W1 <-subset(DataSet,WAVE=="W1")
Results_Sample_W2 <-subset(DataSet,WAVE=="W2")
Results_Sample_W3 <-subset(DataSet,WAVE=="W3")
Results_Sample_W4 <-subset(DataSet,WAVE=="W4")

#Define Sd and Mean per Distance and Wave 
W1_10K_Mean_Norm<-mean(as.numeric(Results_Sample_W1$`10K`))
W1_HALF_Mean_Norm<-mean(as.numeric(Results_Sample_W1$HALF))
W1_30K_Mean_Norm<-mean(as.numeric(Results_Sample_W1$`30K`))
W1_MAR_Mean_Norm<-mean(as.numeric(Results_Sample_W1$MAR))
W1_10K_SD_Norm<-sd(as.numeric(Results_Sample_W1$`10K`))*W1SD_Delta
W1_HALF_SD_Norm<-sd(as.numeric(Results_Sample_W1$HALF))*W1SD_Delta
W1_30K_SD_Norm<-sd(as.numeric(Results_Sample_W1$`30K`))*W1SD_Delta
W1_MAR_SD_Norm<-sd(as.numeric(Results_Sample_W1$MAR))*W1SD_Delta
  
W2_10K_Mean_Norm<-mean(as.numeric(Results_Sample_W2$`10K`))
W2_HALF_Mean_Norm<-mean(as.numeric(Results_Sample_W2$HALF))
W2_30K_Mean_Norm<-mean(as.numeric(Results_Sample_W2$`30K`))
W2_MAR_Mean_Norm<-mean(as.numeric(Results_Sample_W2$MAR))
W2_10K_SD_Norm<-sd(as.numeric(Results_Sample_W2$`10K`))*W2SD_Delta
W2_HALF_SD_Norm<-sd(as.numeric(Results_Sample_W2$HALF))*W2SD_Delta
W2_30K_SD_Norm<-sd(as.numeric(Results_Sample_W2$`30K`))*W2SD_Delta
W2_MAR_SD_Norm<-sd(as.numeric(Results_Sample_W2$MAR))*W2SD_Delta
  
W3_10K_Mean_Norm<-mean(as.numeric(Results_Sample_W3$`10K`))
W3_HALF_Mean_Norm<-mean(as.numeric(Results_Sample_W3$HALF))
W3_30K_Mean_Norm<-mean(as.numeric(Results_Sample_W3$`30K`))
W3_MAR_Mean_Norm<-mean(as.numeric(Results_Sample_W3$MAR))
W3_10K_SD_Norm<-sd(as.numeric(Results_Sample_W3$`10K`))*W3SD_Delta
W3_HALF_SD_Norm<-sd(as.numeric(Results_Sample_W3$HALF))*W3SD_Delta
W3_30K_SD_Norm<-sd(as.numeric(Results_Sample_W3$`30K`))*W3SD_Delta
W3_MAR_SD_Norm<-sd(as.numeric(Results_Sample_W3$MAR))*W3SD_Delta
  
W4_10K_Mean_Norm<-mean(as.numeric(Results_Sample_W4$`10K`))
W4_HALF_Mean_Norm<-mean(as.numeric(Results_Sample_W4$HALF))
W4_30K_Mean_Norm<-mean(as.numeric(Results_Sample_W4$`30K`))
W4_MAR_Mean_Norm<-mean(as.numeric(Results_Sample_W4$MAR))
W4_10K_SD_Norm<-sd(as.numeric(Results_Sample_W4$`10K`))*W4SD_Delta
W4_HALF_SD_Norm<-sd(as.numeric(Results_Sample_W4$HALF))*W4SD_Delta
W4_30K_SD_Norm<-sd(as.numeric(Results_Sample_W4$`30K`))*W4SD_Delta
W4_MAR_SD_Norm<-sd(as.numeric(Results_Sample_W4$MAR))*W4SD_Delta

#Simulate times for each of our distances 
Results_Sample_W1$`10KSim` <- rnorm(nrow(Results_Sample_W1),mean = W1_10K_Mean_Norm, sd = W1_10K_SD_Norm)
Results_Sample_W2$`10KSim` <- rnorm(nrow(Results_Sample_W2),mean = W2_10K_Mean_Norm, sd = W2_10K_SD_Norm)
Results_Sample_W3$`10KSim` <- rnorm(nrow(Results_Sample_W3),mean = W3_10K_Mean_Norm, sd = W3_10K_SD_Norm)
Results_Sample_W4$`10KSim` <- rnorm(nrow(Results_Sample_W4),mean = W4_10K_Mean_Norm, sd = W4_10K_SD_Norm)

Results_Sample_W1$HALFSim <- rnorm(nrow(Results_Sample_W1),mean = W1_HALF_Mean_Norm, sd = W1_HALF_SD_Norm)
Results_Sample_W2$HALFSim <- rnorm(nrow(Results_Sample_W2),mean = W2_HALF_Mean_Norm, sd = W2_HALF_SD_Norm)
Results_Sample_W3$HALFSim <- rnorm(nrow(Results_Sample_W3),mean = W3_HALF_Mean_Norm, sd = W3_HALF_SD_Norm)
Results_Sample_W4$HALFSim <- rnorm(nrow(Results_Sample_W4),mean = W4_HALF_Mean_Norm, sd = W4_HALF_SD_Norm)

Results_Sample_W1$`30KSim` <- rnorm(nrow(Results_Sample_W1),mean = W1_30K_Mean_Norm, sd = W1_30K_SD_Norm)
Results_Sample_W2$`30KSim` <- rnorm(nrow(Results_Sample_W2),mean = W2_30K_Mean_Norm, sd = W2_30K_SD_Norm)
Results_Sample_W3$`30KSim` <- rnorm(nrow(Results_Sample_W3),mean = W3_30K_Mean_Norm, sd = W3_30K_SD_Norm)
Results_Sample_W4$`30KSim` <- rnorm(nrow(Results_Sample_W4),mean = W4_30K_Mean_Norm, sd = W4_30K_SD_Norm)

Results_Sample_W1$MARSim <- rnorm(nrow(Results_Sample_W1),mean = W1_MAR_Mean_Norm, sd = W1_MAR_SD_Norm)
Results_Sample_W2$MARSim <- rnorm(nrow(Results_Sample_W2),mean = W2_MAR_Mean_Norm, sd = W2_MAR_SD_Norm)
Results_Sample_W3$MARSim <- rnorm(nrow(Results_Sample_W3),mean = W3_MAR_Mean_Norm, sd = W3_MAR_SD_Norm)
Results_Sample_W4$MARSim <- rnorm(nrow(Results_Sample_W4),mean = W4_MAR_Mean_Norm, sd = W4_MAR_SD_Norm)

#Compile Sim and Actual times per Wave and Distance 
Results_Sample_W1_10K <- data.frame(Results_Sample_W1$`10K`,Results_Sample_W1$`10KSim`) 
Results_Sample_W1_HALF <- data.frame(Results_Sample_W1$HALF,Results_Sample_W1$HALFSim)
Results_Sample_W1_30K <- data.frame(Results_Sample_W1$`30K`,Results_Sample_W1$`30KSim`)                          
Results_Sample_W1_MAR <- data.frame(Results_Sample_W1$MAR,Results_Sample_W1$MARSim)

Results_Sample_W2_10K <- data.frame(Results_Sample_W2$`10K`,Results_Sample_W2$`10KSim`)
Results_Sample_W2_HALF <- data.frame(Results_Sample_W2$HALF,Results_Sample_W2$HALFSim)
Results_Sample_W2_30K <- data.frame(Results_Sample_W2$`30K`,Results_Sample_W2$`30KSim`)                          
Results_Sample_W2_MAR <- data.frame(Results_Sample_W2$MAR,Results_Sample_W2$MARSim)
    
Results_Sample_W3_10K <- data.frame(Results_Sample_W3$`10K`,Results_Sample_W3$`10KSim`)
Results_Sample_W3_HALF <- data.frame(Results_Sample_W3$HALF,Results_Sample_W3$HALFSim)
Results_Sample_W3_30K <- data.frame(Results_Sample_W3$`30K`,Results_Sample_W3$`30KSim`)                        
Results_Sample_W3_MAR <- data.frame(Results_Sample_W3$MAR,Results_Sample_W3$MARSim)

Results_Sample_W4_10K <- data.frame(Results_Sample_W4$`10K`,Results_Sample_W4$`10KSim`)
Results_Sample_W4_HALF <- data.frame(Results_Sample_W4$HALF,Results_Sample_W4$HALFSim)
Results_Sample_W4_30K <- data.frame(Results_Sample_W4$`30K`,Results_Sample_W4$`30KSim`)                        
Results_Sample_W4_MAR <- data.frame(Results_Sample_W4$MAR,Results_Sample_W4$MARSim)

#Reformat to TidyData Format; standardize Col names
dflist2 <- c("Results_Sample_W1_10K","Results_Sample_W1_HALF","Results_Sample_W1_30K","Results_Sample_W1_MAR","Results_Sample_W2_10K","Results_Sample_W2_HALF","Results_Sample_W2_30K","Results_Sample_W2_MAR","Results_Sample_W3_10K","Results_Sample_W3_HALF","Results_Sample_W3_30K","Results_Sample_W3_MAR","Results_Sample_W4_10K","Results_Sample_W4_HALF","Results_Sample_W4_30K","Results_Sample_W4_MAR")

for(i in 1:16){
  j = get(dflist2[i])
  names(j) = c("actual","sim")
  j = pivot_longer(j,cols=c("actual","sim"),names_to = "type",values_to = "time")
  j$W_D=dflist2[i]
  assign(dflist2[i],j)
}


#Roll up Sim and Actual Distance/Time pairings to Df per Wave
Results_Sample_W1 <- rbind(Results_Sample_W1_10K,Results_Sample_W1_HALF,Results_Sample_W1_30K,Results_Sample_W1_MAR)
Results_Sample_W2 <- rbind(Results_Sample_W2_10K,Results_Sample_W2_HALF,Results_Sample_W2_30K,Results_Sample_W2_MAR)
Results_Sample_W3 <- rbind(Results_Sample_W3_10K,Results_Sample_W3_HALF,Results_Sample_W3_30K,Results_Sample_W3_MAR)
Results_Sample_W4 <- rbind(Results_Sample_W4_10K,Results_Sample_W4_HALF,Results_Sample_W4_30K,Results_Sample_W4_MAR)

#Label Waves
Results_Sample_W1$WAVE = "W1"
Results_Sample_W2$WAVE = "W2"
Results_Sample_W3$WAVE = "W3"
Results_Sample_W4$WAVE = "W4"

#Rollup Waves to one file. 
Results_Sample <- rbind(Results_Sample_W1, Results_Sample_W2, Results_Sample_W3, Results_Sample_W4)

#Export File, Export Variables. 
assign("Results_Sample",Results_Sample, envir = .GlobalEnv)

varlist <- c("W1_10K_Mean_Norm","W1_HALF_Mean_Norm","W1_30K_Mean_Norm","W1_MAR_Mean_Norm","W1_10K_SD_Norm","W1_HALF_SD_Norm","W1_30K_SD_Norm","W1_MAR_SD_Norm","W2_10K_Mean_Norm","W2_HALF_Mean_Norm","W2_30K_Mean_Norm","W2_MAR_Mean_Norm","W2_10K_SD_Norm","W2_HALF_SD_Norm","W2_30K_SD_Norm","W2_MAR_SD_Norm","W3_10K_Mean_Norm","W3_HALF_Mean_Norm","W3_30K_Mean_Norm","W3_MAR_Mean_Norm","W3_10K_SD_Norm","W3_HALF_SD_Norm","W3_30K_SD_Norm","W3_MAR_SD_Norm","W4_10K_Mean_Norm","W4_HALF_Mean_Norm","W4_30K_Mean_Norm","W4_MAR_Mean_Norm","W4_10K_SD_Norm","W4_HALF_SD_Norm","W4_30K_SD_Norm","W4_MAR_SD_Norm")

for(i in 1:length(varlist)){
  assign(varlist[i],get(varlist[i]), envir = .GlobalEnv)
}
}

With the above function being built, lets simulate for a delta of 1 (truly normal for all splits and waves, with means and SDs matching the data from that sub-group)

Plotting various distances and waves. Each Row corresponds to a Wave (1-4), Columns to a distance (10k, half marathon ~21k, 30k, full marathon ~42k):

While not perfect, these distributions show a decent starting position. Through trial and error I found that the ideal SD Delta decreases with Wave. Wave 1 and 2 was best fit with an SD of 80% of our sample, Wave 3 and 4 will have the same SD and Mean as our sample.

As we see, the density at the mean for the first 2 waves is still a little low and the sample data has a skew to the right.

Its worth noting that the above plots have a free X and Y axis. Now is a good time to demonstrate how both the mean and SD decrease as the wave count increases, and to sense the degree to which our simulation matches the actual data. Here are the above plots on the same axis; each Wave is labeled, with all 4 distributions on the same graph.

Again, we can see how the 10k distribution for all waves are slightly more congested at the mean, as are all splits for Wave 1 and 2. Additionally, there is a slight shift in Wave 1 and 2s distribution compared to normal. With that in mind, they are fairly close for the size of the sample and sufficient for our model.

Visualize Actual Race

Before we can compare our simulated race to the actual results, we need to visualize the marathon in a way that demonstrates traffic on course. The below function, Actual_Performance_Viz, will take a given column containing start times for each runner from our “Clean Results” Data Frame (we will later nest this in a function to apply alternate start times) and visualizations runner traffic at the Start, 10k, half, 30k and finish line.

To Visualize the actual race, we will use the column we created with the actual start times, however, in future iterations we can nest this in a formula that generates new starting times depending on our use case.

#Filter out elite runners 
Clean_Results3 <- subset(Clean_Results3, `bib num` > 999)

Actual_Performance_Viz <- function(col, output.df.name, title){
  
#Generates Vizualtion of traffic on course using actual race performances, saves results as DF.

  #Inputs:
    #col - column in dataframe containing start times 
    #output.df.name - name of DF to save results in global enviroment
    #title - title of DF

#Find gun-time splits, round
Clean_Results3$`10KGun` <- floor(Clean_Results3$`10K`/60+ Clean_Results3[,col])
Clean_Results3$HalfGun <- floor(Clean_Results3$HALF/60+ Clean_Results3[,col])
Clean_Results3$`30KGun` <- floor(Clean_Results3$`30K`/60+ Clean_Results3[,col])
Clean_Results3$MarGun <- floor(Clean_Results3$MAR/60+ Clean_Results3[,col])

#Restructure Data
Clean_Results4 <- data.frame(Clean_Results3[,col],Clean_Results3$`10KGun`,Clean_Results3$HalfGun,Clean_Results3$`30KGun`,Clean_Results3$Mar,Clean_Results3$CorWrap)

names(Clean_Results4) <- c("Start","10K","Half","30K","Mar","W_C")

#Pivot Data
Clean_Results5 <- pivot_longer(Clean_Results4,cols=c("Start","10K","Half","30K","Mar"),names_to = "Distance",values_to = "Split") %>%
  arrange(desc(Split))

#Save
assign(output.df.name, Clean_Results5, envir = .GlobalEnv)

Clean_Results5$Distance <- factor(Clean_Results5$Distance, levels = c("Start","10K", "Half", "30K", "Mar"))

#Plot
plot1 <- 
  ggplot(subset(Clean_Results5, Distance == "Start"), aes(x=Split, fill=W_C)) + 
  geom_histogram(alpha=.7, binwidth = 1) + theme(legend.position = "none") + geom_density(alpha=.5) + facet_wrap(~Distance, ncol = 1,scales = "free_x")+ ggtitle(title)

plot2 <-
ggplot(subset(Clean_Results5, Distance != "Start"), aes(x=Split, fill=W_C)) + 
  geom_histogram(alpha=.7, binwidth = 1) + theme(legend.position = "none") + geom_density(alpha=.5) + facet_wrap(~Distance, ncol = 1,scales = "free_x")+ ggtitle(title)

return(list(plot1,plot2))

}
## [[1]]

## 
## [[2]]

Volatilize Simulated Race

Our next step is to write a function to simulate, then visualize the marathon with our distributions. While we have already simulated individual performances (when measuring normality), we have yet to account for starting times and have not visualized congestion. As a result, we will use some of our previously defined variables in a new simulation that illustrates congestion as a function of starting time per wave and corral.

Corral Size

While we know that the corrals have approximately 700 runners, as we will be comparing simulated vs actual congestion, we want to be as accurate as possible when setting corral sizes. Lets create a variables for each wave:corral pairing and use these in our later simulations.

Now is a good time to visualize the wave and corral sizes to ensure they are relatively even:

## # A tibble: 24 x 2
##    W_C   Count
##    <chr> <int>
##  1 W1C1   2170
##  2 W1C2   2204
##  3 W1C3   2240
##  4 W1C4   2232
##  5 W1C5   2330
##  6 W1C6   2385
##  7 W2C1   2165
##  8 W2C2   2199
##  9 W2C3   2232
## 10 W2C4   1774
## # ... with 14 more rows
## # A tibble: 72 x 2
##    W_C_L Count
##    <chr> <int>
##  1 W1C1B   725
##  2 W1C1G   725
##  3 W1C1O   720
##  4 W1C2B   742
##  5 W1C2G   715
##  6 W1C2O   747
##  7 W1C3B   700
##  8 W1C3G   730
##  9 W1C3O   810
## 10 W1C4B   730
## # ... with 62 more rows

Starting Line Capacity

Finally, we should take a look at our sample of runners and see at what rate they cross the starting line. In the block below, we will see how many runners cross the starting line in a give minute.

## # A tibble: 387 x 4
## # Groups:   Wave, Col [13]
##    Wave  Col   Gun_Start_Round Count
##    <chr> <chr>           <dbl> <int>
##  1 W1    O                   0  1209
##  2 W2    G                  30   886
##  3 W1    B                   0   884
##  4 W1    G                   0   882
##  5 W1    O                   1   864
##  6 W1    B                   1   860
##  7 W2    O                  30   848
##  8 W1    G                   1   831
##  9 W3    O                  55   800
## 10 W1    G                   2   789
## # ... with 377 more rows

We see above that there are plenty of examples of more than the average corral size crossing the starting line in a one minute interval from each of the 3 starting lines. We can confidently build a model which allows ~1 minute for each corral.

Build Visualization Functions

With all of the logistics in place, we will start with a function that simulates congestion for a given split, and then nest this in another function for each of our relevant splits. With this information all in one Data Frame we can visualize via a ggplot and facet_wrap.

Our function for one split, Simulate_Race_Viz:

Simulate_Race_Viz <- function(Dist.List, Split.Mean.List, Split.SD.List, Distance,title){
#Charts distibutions of times at given split (split being distance on course). 
  #Inputs:
    #Dist.List: List of times each of the 18 corrals starts. 
    #Split.Mean.List: List (4) of Means for distribution of times for each wave for the given split. 
    #Split.SD.List: List (4) of SDs for distribution of times for each wave for the given split.   
  
  
df.list <- c("df.1","df.2","df.3","df.4","df.5","df.6","df.7","df.8","df.9","df.10","df.11","df.12","df.13","df.14","df.15","df.16","df.17","df.18","df.19","df.20","df.21","df.22","df.23","df.24")

wave.list <-c("W1C1","W1C2","W1C3","W1C4","W1C5","W1C6","W2C1","W2C2","W2C3","W2C4","W2C5","W2C6","W3C1","W3C2","W3C3","W3C4","W3C5","W3C6","W4C1","W4C2","W4C3","W4C4","W4C5","W4C6")

Split.Mean.W <- c(Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[4],Split.Mean.List[4],Split.Mean.List[4],Split.Mean.List[4],Split.Mean.List[4],Split.Mean.List[4])

Split.SD.W <- c(Split.SD.List[1],Split.SD.List[1],Split.SD.List[1],Split.SD.List[1],Split.SD.List[1],Split.SD.List[1],Split.SD.List[2],Split.SD.List[2],Split.SD.List[2],Split.SD.List[2],Split.SD.List[2],Split.SD.List[2],Split.SD.List[3],Split.SD.List[3],Split.SD.List[3],Split.SD.List[3],Split.SD.List[3],Split.SD.List[3],Split.SD.List[4],Split.SD.List[4],Split.SD.List[4],Split.SD.List[4],Split.SD.List[4],Split.SD.List[4])

for(i in 1:24){  
  assign(df.list[i], data.frame(rnorm(eval(as.name(paste(wave.list[i],"_Size",sep=""))),mean = Split.Mean.W[i]/60, sd=Split.SD.W[i]/60)+Dist.List[i], wave.list[i]))
}
       
df.all <- data.frame()
df.all <- get(df.list[1])
names(df.all) <- c("Split","W_C")
x <- data.frame()
        
for(i in 2:24){
  x <- get(df.list[i])
  names(x)<-c("Split","W_C")
  df.all <- rbind(df.all,x)  
  
}

df.all$Distance <- Distance

assign("df.test",df.all, envir = .GlobalEnv)

#While our later Viz will use counts insted of density, we will use density in this Viz so we can see waves more easily when viewing one split. 

ggplot(df.all, aes(y = ..density.., x=Split, fill=W_C)) + 
  geom_histogram(alpha=.7, bins = 100) + theme(legend.position = "none") + geom_density(alpha=.5) + ggtitle(title)
}

The above function take a list of Means and a list of Standard Deviations. Each list consists of a Mean or SD for each of the 4 waves. It makes sense to now store these as variables so we can better use the above function. As a quick reminder, the variables we are inserting into this list have already been scaled as planned above (80% of the SD for Wave 1 and Wave 2).

Testing Simulate_Race_Viz for the 10K split and the start times from the 2019 Marathon:

This is largely what we would expect to see. The waves are still mostly isolated at only 6 miles into the race.

With this being tested, we can now build a larger function titled Simulate_Race_Viz_All which calls our Simulate_Race_Viz function once for 4 points on course. Since we will only be using this for simulating the full race, we can “hard code” the above variables into the bigger formula, and keep the start times (dist_list), file name and title as our only variables.

Simulate_Race_Viz_All <- function(Dist.List,title, Output_Df_Name){
#Charts distibutions of times at all splits (split being distance on course) - Start, 10k, half, 30k and the full race. 
  #Inputs:
    #Dist.List: List of times each of the 24 corrals starts. 
    #Title: Title of GGplot output
    #Output_Df_Name: Name for DF during export.    
  
  
df.list <- c("df.1","df.2","df.3","df.4","df.5","df.6","df.7","df.8","df.9","df.10","df.11","df.12","df.13","df.14","df.15","df.16","df.17","df.18","df.19","df.20","df.21","df.22","df.23","df.24")

wave.list <-c("W1C1","W1C2","W1C3","W1C4","W1C5","W1C6","W2C1","W2C2","W2C3","W2C4","W2C5","W2C6","W3C1","W3C2","W3C3","W3C4","W3C5","W3C6","W4C1","W4C2","W4C3","W4C4","W4C5","W4C6")

distance.list <- c("Start","10K","HALF","30K","MAR")

for(i in 1:24){  
  assign(df.list[i], data.frame(rnorm(eval(as.name(paste(wave.list[i],"_Size",sep=""))),mean = 0, sd= .3)+Dist.List[i], wave.list[i],distance.list[1]))
}
       
df.all <- data.frame()
df.all <- get(df.list[1])
names(df.all) <- c("Split","W_C","Distance")
x <- data.frame()
        
for(i in 2:24){
  x <- get(df.list[i])
  names(x)<-c("Split","W_C","Distance")
  df.all <- rbind(df.all,x)  
  
}

Simulate_Race_Viz(Dist.List,TenK_Mean,TenK_SD,"10K",title)
assign("df.10k",df.test)
df.all <- rbind(df.all,df.test)

Simulate_Race_Viz(Dist.List,Half_Mean,Half_SD,"Half",title)
df.all <- rbind(df.all,df.test)

Simulate_Race_Viz(Dist.List,ThirtyK_Mean,ThirtyK_SD,"30K",title)
df.all <- rbind(df.all,df.test)

Simulate_Race_Viz(Dist.List,Mar_Mean,Mar_SD,"Mar",title)
df.all <- rbind(df.all,df.test)


assign(Output_Df_Name,df.all, envir = .GlobalEnv)


plot1 <- 
  ggplot(subset(df.all, Distance == "Start"), aes(x=Split, fill=W_C)) + 
  geom_histogram(alpha=.7, binwidth = 1) + theme(legend.position = "none") + geom_density(alpha=.5) + facet_wrap(~Distance, ncol = 1,scales = "free_x")+ ggtitle(title)

plot2 <-
ggplot(subset(df.all, Distance != "Start"), aes(x=Split, fill=W_C)) + 
  geom_histogram(alpha=.7, binwidth = 1) + theme(legend.position = "none") + geom_density(alpha=.5) + facet_wrap(~Distance, ncol = 1,scales = "free_x")+ ggtitle(title)

return(list(plot1,plot2))
}

We can test with a simulation of the 2019 Marathon, using estimated start times and the sample Mean and SD per wave and distance.

## [[1]]

## 
## [[2]]

While the plot we build for each individual wave was a density histogram with individual waves illustrated, for the purposes of this plot, it makes more sense to use a frequency histogram. We can see how the runner congestion decreases as the race progresses.

Validate Model

Now that we have functions to simulate and visualize both our simulated and the actual race, we can validate our model. A valid model will have our simulated race appear very similar to the actual race. While we do have some limitations - our distributions are only approximations of the actual race, if the simulation is mostly similar to the actual race, we can use our model to optimize starting times for minimal runner traffic.

In each of our visualization functions, Simulate_Race_Viz_All and Actual_Performance_Viz, we export our results as a df to the Global Environment under a variable name of our choosing. We can build a function titled Compare_Actual_Performances to use these data frames and compare these findings in the same plot.

Now we can see if our model is valid:

With our histograms on the same plot, we can really see the skew in the 10k interval. Additionally, the actual data at the half marathon split has more runners at the mean and fewer at the tails. For some purposes, we might want to explore other distributions to minimize the differences between the simulation and the actual data, but for our purposes, these are small compared to the strength of the model as a whole.

We can also compare the highest points in our simulated and actual data to get a sense of scale:

## # A tibble: 16 x 4
## # Groups:   Distance, Wave [17]
##    Distance Wave  Actual   Sim
##    <fct>    <chr>  <int> <int>
##  1 10K      W1       748   734
##  2 10K      W2       829   888
##  3 10K      W3       670   779
##  4 10K      W4       541   540
##  5 Half     W1       459   361
##  6 Half     W2       473   434
##  7 Half     W3       344   349
##  8 Half     W4       263   249
##  9 30K      W1       320   241
## 10 30K      W2       321   278
## 11 30K      W3       252   228
## 12 30K      W4       191   178
## 13 Mar      W1       205   166
## 14 Mar      W2       211   195
## 15 Mar      W3       171   159
## 16 Mar      W4       138   119
## # A tibble: 5 x 3
## # Groups:   Distance [5]
##   Distance Actual   Sim
##   <fct>     <dbl> <dbl>
## 1 Start     1203. 1366.
## 2 10K        324.  312.
## 3 Half       220.  197.
## 4 30K        175.  150.
## 5 Mar        129.  111.
## # A tibble: 10 x 5
## # Groups:   Distance [5]
##    Distance Set    `Avg_Runners/minut~ `Range_of_Runners/min~ `SD_Runners/minut~
##    <fct>    <chr>                <dbl>                  <int>              <dbl>
##  1 Start    Actual               1203.                   2966              908. 
##  2 Start    Sim                  1366.                   2398              771. 
##  3 10K      Actual                324.                    893              244. 
##  4 10K      Sim                   312.                    887              252. 
##  5 Half     Actual                220.                    531              128. 
##  6 Half     Sim                   197.                    458              133. 
##  7 30K      Actual                175.                    372               95.5
##  8 30K      Sim                   150.                    357              107. 
##  9 Mar      Actual                129.                    261               77.8
## 10 Mar      Sim                   111.                    290               86.8

Again, the model is valid for our purposes.

Optimize Starting Times

Now that we are confident that our simulation approximates the actual behavior of runners, we can write a function to measure “traffic” or “congestion” on course and describe this with a numerical value. Once we have this “traffic score”, we can minimize it and optimize start times for minimal traffic.

The function below titled Get_Traffic_Score takes the Standard Distribution and Mean of our previously described distributions (unique to each distance and wave) and calculates the number of runners to pass a given point on the course each minute from when the race starts till the course closes (500 minutes; ~8 hours after the start). By squaring this value for each minute before summing them across the duration the course is open, we can more heavily weight crowded times on the course and compare different race conditions.

Though the number itself is fairly arbitrary, for a fixed number of participants, a higher value indicates a race that is “more crowded” at various times. Minimizing this value results in a more evenly distributed race.

Optimize for one split

Lets start by building our function and then optimizing the starting times for minimal runner traffic at the 10k mark.

Get.Traffic.Score <- function(Dist.List,Split.Mean.List,Split.SD.List){
#Formula to be used in Optimization Function
#Using Disitributions of times each wave will get to a given split (distnace on course), and the amount of time each wave waits at starting line, calculates density of runners through split at each minute of the race. 
#Traffic score is calcualted by summing the density of runners (calc by Dnorm) SQUARED through split at each minute. 
  
  #Inputs:
    #Dist.List: List (24) of times for each Wave to start race. 
    #Split.Mean.List: List(4) of Means to describe distributoin of runners from each wave through a given split (race distance)
    #Split.SD.List: List (4) of SDs to describe distribution of runners from each wave through  a given split. 

Split.Mean.W <- c(Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[1],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[2],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[3],Split.Mean.List[4],Split.Mean.List[4],Split.Mean.List[4],Split.Mean.List[4],Split.Mean.List[4],Split.Mean.List[4])

Split.SD.W <- c(Split.SD.List[1],Split.SD.List[1],Split.SD.List[1],Split.SD.List[1],Split.SD.List[1],Split.SD.List[1],Split.SD.List[2],Split.SD.List[2],Split.SD.List[2],Split.SD.List[2],Split.SD.List[2],Split.SD.List[2],Split.SD.List[3],Split.SD.List[3],Split.SD.List[3],Split.SD.List[3],Split.SD.List[3],Split.SD.List[3],Split.SD.List[4],Split.SD.List[4],Split.SD.List[4],Split.SD.List[4],Split.SD.List[4],Split.SD.List[4])
  
  traffic.score = 0
  traffic.per.wave = vector()
  
  #Range of minutes:
  for(i in 1:500){
    for(j in 1:24){    
      traffic.per.wave[j] = dnorm(i-Dist.List[[j]],mean = Split.Mean.W[j]/60,sd = Split.SD.W[j]/60)
    }
    
    traffic.score.minute = sum(traffic.per.wave)
    traffic.score = traffic.score + traffic.score.minute^2  
  
  }
  return (traffic.score)
}

Testing our formula across 10 minutes intervals:

## [1] 2.178082

We can use the Optim function to minimize this and find ideal corral start times. For consistencies sake, we will set the limits to the length the 2019 start line was open.

## Warning in optim(par = c(6, 12, 18, 24, 30, 36, 37, 38, 39, 40, 41, 42, : bounds
## can only be used with method L-BFGS-B (or Brent)
## $par
##  [1]  0.00000  0.00000  0.00000  0.00000 20.32344 20.31377  7.05509 19.69908
##  [9] 26.68283 31.63481 36.84658 43.05306 45.02899 44.79701 53.17218 58.52408
## [17] 69.81812 69.81583 53.73852 80.00000 80.00000 80.00000 80.00000 80.00000
## 
## $value
## [1] 4.49873
## 
## $counts
## function gradient 
##       78       78 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Quick reminder, we have split our 50k+ runners into 24 groups. These are effectively the “old” wave and corral groupings. We can imagine around 700 runners starting from each of our 3 starting lines at the same time at various intervals. The optimal starting times are listed as the par values in the optim output.

If our goal is to minimize traffic on course, then starting the first 4 and final 4 waves of runners together seems counter intuitive. To illustrate what is occurring here, lets use our Simulate_Race_Viz function to see what is going on here. Lets compare our ideal start times to those that are evenly distributed.

By “front loading” and “back loading” the field - we allow the field to spread across a longer period of time than if we had the groups start in even intervals. Additionally, the ideal distribution is “out of order” in some cases - starting slower waves before quicker ones, to reduce the “peaks and valleys” we see in the evenly distributed visualization.

Optimize for Entire Race (4 splits)

We can write a function called Get.Traffic.Score.All which pulls our Get.Traffic.Score for each of our 4 distances across the race (10K, Half, 30K and the full distance). While each of these are equally weighted, because we are squaring the value which represents congestion on course, we will naturally more heavily weight the earlier splits, as the Standard Deviation is lower at this point in the race and congestion is higher.

Again, testing for 10 minute intervals:

## [1] 5.148431

Lets see what an optimal set of starting times might look like:

## Warning in optim(par = c(6, 12, 18, 24, 30, 36, 37, 38, 39, 40, 41, 42, : bounds
## can only be used with method L-BFGS-B (or Brent)
## $par
##  [1]  0.000000  0.000000  0.000000  0.000000  0.000000 16.787336  9.622321
##  [8] 18.246657 24.734903 31.445129 37.290091 45.261883 43.652083 52.447550
## [15] 52.439336 62.078056 68.627791 68.631991 80.000000 80.000000 80.000000
## [22] 80.000000 80.000000 80.000000
## 
## $value
## [1] 12.26042
## 
## $counts
## function gradient 
##       51       51 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Again we find that it is optimal to have multiple waves starting at the same time. When thinking about the use case of this information, it still seems illogical to have this much congestion in the first few miles of the race - even if it yields a more even flow of runners later on. As we reflect on our ultimate goal - one oversight comes into focus: we need to account for congestion at the starting line in addition to the points on course.

Build Starting Line Model

While we used distributions unique to each wave to model the volume of runners for each split on course, the starting line does not need to be as exact. That being said, we will need to get creative to have the optimization function work as intended. Ideally, we would use a uniform distribution to capture the start, however, the optim function does not behave well with discreet variables and will return an error if it does not see incremental changes in our objective value for incremental changes in parameter variables. As a result, we will use a normal distribution for starting corrals even though it does not reflect runner behavior.

To best capture the full minute it takes the sub-group to cross the start line, we would expect to see a SD of ~.3 . Such a low SD, however, has extremely small values at its tails - as our traffic score function is only taking readings at each minute interval, the optim function would likely return values slightly higher or lower than the whole numbers and receive a very small traffic score in response. A SD of 1 will create a histogram that will likely yield better results in our optim function, however, it would also return comparatively lower values in our Dnorm function than what we would expect to see. We can fix this by changing our weighting.

Now is a good time to note that at the NYC marathon the 3 starting lines run separate courses till mile 3. As a result, we will would normally divide the density by 3 before squaring so it is weighted correctly when compared to later points on course when the 3 start lines are joined. As we need to increase the weighting in this case to compensate for the larger SD, we will divide by 2 instead.

Finally, instead of the “correct” 0, we will use a normal distribution with a mean of 25 to model starting line traffic. Our reasons for selecting a mean of 25 is so that there is no chance that any volume of runners will pass through the starting line “before minute 0” as this will result in the optimization function prioritizing lower starting times.

Lets run a test, and optimize waves to minimize traffic at the starting line. We would expect to see the waves relatively evenly distributed.

## Warning in optim(par = c(0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, :
## bounds can only be used with method L-BFGS-B (or Brent)
## $par
##  [1]  0.000000  3.478329  6.956411 10.435164 13.912799 17.391604 20.868860
##  [8] 24.347920 27.825110 31.304553 34.781719 38.261246 41.738209 45.217655
## [15] 48.694612 52.173987 55.651201 59.130477 62.607980 66.087138 69.564915
## [22] 73.043757 76.521785 80.000000
## 
## $value
## [1] 1.850186
## 
## $counts
## function gradient 
##       59       59 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

The above values are virtually exactly what we would expect.

Simulate Entire Race (4 splits + start)

With this working correctly, we can adjust our function and optimize traffic for the full race:

## Warning in optim(par = c(3.3, 6.6, 10, 13.3, 16.6, 20, 23.3, 26.6, 30, 33.3, :
## bounds can only be used with method L-BFGS-B (or Brent)
## $par
##  [1]  0.000000  0.000000  3.056025  5.977129  9.280706 12.704480 16.100396
##  [8] 20.073632 25.421721 31.277604 36.881556 41.063893 44.632091 48.395890
## [15] 52.599477 57.372850 61.644240 65.371054 68.764486 71.889771 74.711698
## [22] 77.169604 80.000000 80.000000
## 
## $value
## [1] 14.97431
## 
## $counts
## function gradient 
##       36       36 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Verify Model

With an optimal set of starting times, we can nest our Compare_Actual_Performances function in a new function called Alternate_Start_Viz which will calculate new splits based on the actual performances of our 2019 runners had they run the exact race with a new starting time before providing a visual to compare the race traffic. We will also have it return some statistics to describe race traffic in either starting scheme.

As we will only be comparing 2019 actual performances to alternate 2019 performances with different starting times, we can hard-code a few of our variables.

Alternate_Start_Viz <- function(optim_results_name, label_name, final_plot_title){
  #Simulates 2019 race if our actual runners ran an identical race with different starting times. 
  #Inputs:
    #Optim_Results_Name - name of optim function whoose results we used to determin starting times.
    #label name: name of label for plot. Also used in naming DF. 
    #Final_Plot_title: name of title of final vizualation. 

W_C_List <- c("W1C1","W1C2","W1C3","W1C4","W1C5","W1C6","W2C1","W2C2","W2C3","W2C4","W2C5","W2C6","W3C1","W3C2","W3C3","W3C4","W3C5","W3C6","W4C1","W4C2","W4C3","W4C4","W4C5","W4C6")

#Remove an existing "Start_Time_New" field should it exist. 
Clean_Results3 <- Clean_Results3[,!colnames(Clean_Results3) == "Start_Time_New"]

#Pair start times with corral names - inner join with results DF
Start_DF <- data.frame(W_C_List, unlist(optim_results_name$par))
names(Start_DF) <- c("W_C","Start_Time_New")
Clean_Results3 <- inner_join(Clean_Results3,Start_DF)

#Save as global variable. 
assign("Clean_Results3",Clean_Results3,.GlobalEnv)

#run simulation - vizualtion will not be generated
Actual_Performance_Viz("Start_Time_New", paste(label_name,"_resultsDF", sep=""), "title")

#Save as results variable 
assign(paste(label_name,"_resultsDF2",sep=""),get(paste(label_name,"_resultsDF",sep="")),envir = .GlobalEnv)

#Generate Vizual comparaing race traffic
Item1 = Compare_Actual_Performances(get(paste(label_name,"_resultsDF", sep="")),`2019_Actual_Df`,label_name,"Actual",final_plot_title) 

#assign resuts DF to its own varaible for ease of use 
new_data = get(paste(label_name,"_resultsDF", sep=""))

#Generate Statistics: 
new_data[,"Set"]<- label_name
`2019_Actual_Df`[,"Set"]<-"2019 Actual"

Both_Df <- rbind(new_data,`2019_Actual_Df`)

Both_Df$Distance <- factor(Both_Df$Distance, levels = c("Start","10K","Half","30K","Mar"))

Item2 = subset(Both_Df,Distance!="Start") %>%
  group_by(Split = floor(Split),Distance,Set) %>%
  summarise(Count = n())  %>%
  group_by(Distance,Set) %>%
  summarise(Avg_Traffic = mean(Count),Max_Traffic = max(Count), SD_Traffic = sd(Count)) %>%
  arrange(Distance,Set)

return(list(Item1,Item2))

}

Actual Start Times vs Optimal Start Times

With the function built - lets compare optimal vs actual start times.

## Joining, by = "W_C"
## Warning: Column `W_C` joining character vector and factor, coercing into
## character vector
## [[1]]

## 
## [[2]]
## # A tibble: 8 x 5
## # Groups:   Distance [4]
##   Distance Set                      Avg_Traffic Max_Traffic SD_Traffic
##   <fct>    <chr>                          <dbl>       <int>      <dbl>
## 1 10K      2019 Actual                     324.         894      244. 
## 2 10K      80 Minute Start, Optimal        328.         567      162. 
## 3 Half     2019 Actual                     220.         532      128. 
## 4 Half     80 Minute Start, Optimal        219.         375      125. 
## 5 30K      2019 Actual                     175.         373       95.5
## 6 30K      80 Minute Start, Optimal        170.         315      102. 
## 7 Mar      2019 Actual                     129.         262       77.8
## 8 Mar      80 Minute Start, Optimal        126.         259       82.4

… vs Optimal Start Times over 2 hours.

Since we have a function built, we can easily compare other starting schemes as well. Let optimize over a 2 hour period (vs the 80 minutes the 2019 race utilized)

## Warning in optim(par = c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, : bounds
## can only be used with method L-BFGS-B (or Brent)
## Joining, by = "W_C"
## Warning: Column `W_C` joining character vector and factor, coercing into
## character vector
## [[1]]

## 
## [[2]]
## # A tibble: 8 x 5
## # Groups:   Distance [4]
##   Distance Set                Avg_Traffic Max_Traffic SD_Traffic
##   <fct>    <chr>                    <dbl>       <int>      <dbl>
## 1 10K      2019 Actual               324.         894      244. 
## 2 10K      2hr Start, Optimal        262.         457      112. 
## 3 Half     2019 Actual               220.         532      128. 
## 4 Half     2hr Start, Optimal        187.         294       92.0
## 5 30K      2019 Actual               175.         373       95.5
## 6 30K      2hr Start, Optimal        150.         253       79.8
## 7 Mar      2019 Actual               129.         262       77.8
## 8 Mar      2hr Start, Optimal        116.         227       68.2

… vs Optimal Start Times over 3 hours.

Finally, lets see how race traffic is impacted when the starting line is open for 3 hours (more than twice as long as it was in 2019).

## Warning in optim(par = c(7.5, 15, 22.5, 30, 37.5, 45, 52.5, 60, 67.5, 75, :
## bounds can only be used with method L-BFGS-B (or Brent)
## Joining, by = "W_C"
## Warning: Column `W_C` joining character vector and factor, coercing into
## character vector
## [[1]]

## 
## [[2]]
## # A tibble: 8 x 5
## # Groups:   Distance [4]
##   Distance Set                Avg_Traffic Max_Traffic SD_Traffic
##   <fct>    <chr>                    <dbl>       <int>      <dbl>
## 1 10K      2019 Actual               324.         894      244. 
## 2 10K      3hr Start, Optimal        200.         372       81.3
## 3 Half     2019 Actual               220.         532      128. 
## 4 Half     3hr Start, Optimal        154.         268       66.1
## 5 30K      2019 Actual               175.         373       95.5
## 6 30K      3hr Start, Optimal        128.         214       59.8
## 7 Mar      2019 Actual               129.         262       77.8
## 8 Mar      3hr Start, Optimal        103.         200       54.1

Closing Thoughts

Findings:

  • Length of time course is open provides marginal gains.
  • Optimization of starting time yields considerable benefits.

Final notes:

  • This model is specific to the 2019 marathon, would want to measure distributions of times from waves of multiple years to assess reliability of model more broadly.
  • Ideal starting times do not account for race management challenges – times are not rounded to nearest minute, do not account for runner staging.
  • Model ignores those who “dropped out of race” and did not finish.
  • Finally – I would be interested in exploring this model with better fidelity. Corral is almost certainly predictive, could not identify corrals from available data.