Introduction

The Wharf to Wharf race happens every July in Santa Cruz. The mostly flat, 6-mile course hugs the coastline between Santa Cruz and Capitola. The race has grown in popularity since its beginning in 1973, with more than 13,000 registered participants in 2015.

Race results are available online. You can look up runners’ results by bib number, name, city and state. This is ok, but it’s not very handy for anyone wanting to do real data analysis.

I’ve written some code to scrape the data from the results page, clean it up, and plot some interesting things. The code is available on github, in case you want to improve it, <smiley emoticon>.

Summary of results

Data Processing

Library loading, routine definitions

Load a few required libraries, and source a utility file containing helper functions:

library(dplyr)
library(ggplot2)
library(gridExtra)
source("w2w_utils.R")

Loading the data

On the results site, you can grab results with 10, 25, 50 or 100 entries per page. With a bit of snooping I found the url query string, which I can use in a loop to grab all of the data. (I’m caching the data in a local file to avoid burdening the results page too much during debugging and exploration.)

allData <- getData(2015)

Exploring the data

With the data in hand, we can start exploring.

Basic quantities

Number of runners:

nrow(allData)
## [1] 13715

Age distribution (histogram):

qplot(allData$age, ylab = "", xlab = "age")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

So, it’s not exactly a young person’s race… the average age is 40.4110098.

Gender distribution (histogram):

qplot(allData$sex, ylab = "", xlab = "sex")

There are significantly more female paticipants than male (ratio F:M is 1.663624:1).

Geographical distribution - and some data cleaning:

Each participant’s city, state and country is recorded. Where are participants from?

table(allData$country)
## 
##         CAN   ETH   JPN   KEN   UMI   USA 
##     3     2     2     5     4     2 13697

Almost everyone is from the USA (not surprising). Besides 3 blank records, we have participants from the USA, Canada, Ethiopia, Japan, Kenya, and something called “UMI”. What’s “UMI”?

allData[allData$country == "UMI", c("firstname", "lastname", "bib", "city", "state", "country")]
##      firstname lastname   bib      city state country
## 2764      LEAH FRIEDMAN 16109    OXNARD    CA     UMI
## 9208   CYNTHIA    ROTAN  7832 RIVERBANK    CA     UMI

Looks like a typo. Might as well fix that.

allData[allData$country == "UMI", c("country")] <- c("USA")

Three records had no country assignment:

allData[allData$country == "", c("firstname", "lastname", "bib", "city", "state", "country")]
##    firstname lastname bib    city state country
## 2   SHADRACK   KOSGEI   2   KENYA              
## 4     NELSON    OYUGI   3   KENYA              
## 18      HAIS   WELDAY   7 ERITREA

It looks like they just mixed up the country and city assignment (and left state and country blank).

allData[allData$country == "", c("country")] <- allData[allData$country == "", c("city")]
allData[grepl("^KEN", allData$country), c("country")] <- c("KEN")
allData[grepl("^ERI", allData$country), c("country")] <- c("ERI")

… and the final, cleaned-up country distribution is:

table(allData$country)
## 
##   CAN   ERI   ETH   JPN   KEN   USA 
##     2     1     2     5     6 13699

Of entrants from the USA, most are from California:

table(allData[allData$country == "USA", c("state")])
## 
##          AK    AZ    CA    CO    CT    DC    DE    FL    GA    ID    IL 
##    11     2    17 13441    24     2     4     2     1    11    11     6 
##    IN    KS    MA    MD    MI    MN    MO    MT    NC    ND    NE    NJ 
##     2     2    10     3     1     3     4     5     1     3     1     4 
##    NM    NV    NY    OH    OK    OR    PA    TX    UT    VA    WA 
##     4    45     8     1     1    24     2    11     4     4    24

Outliers

There are a few runners who are surprisingly young:

allData[allData$age < 5, c("age", "firstname", "lastname", "country", "sex", "elapsedTime")]
##       age firstname     lastname country sex elapsedTime
## 1697    0    SOPHIA        DAVIS     USA   F  0:52:28.45
## 1870    0     BRIAN      DEDIEGO     USA   M  0:53:17.71
## 2153    0      DICK    PATTERSON     USA   M  0:54:28.96
## 2749    0      JILL        DAVIS     USA   F  0:56:58.52
## 2972    0     LARRY   DEGHETALDI     USA   M  0:57:49.79
## 4200    0      GARY    MCCONNELL     USA   M  1:02:02.33
## 4479    1     LAURA         DUNN     USA   F  1:03:02.96
## 5476    1      FINN     CALDEIRA     USA   M  1:06:24.84
## 5611    0   JEFFERY     KRUSKALL     USA   M  1:06:52.92
## 7433    0     MOLLY       ORDING     USA   F  1:14:10.16
## 7565    0   SUZANNE GOLDEN-RILEY     USA   F  1:14:43.42
## 8837    0    LYNELL        HANCK     USA   F   1:20:58.4
## 8838    0   PATRICK        HANCK     USA   M  1:20:58.59
## 8864    4  IZABELLA  GIERAT-KATZ     USA   F  1:21:07.29
## 8968    0     KELLI   VOLKSWAGEN     USA   F   1:21:41.5
## 8980    0     KELLI   VOLKSWAGEN     USA   F  1:21:45.08
## 9672    2     JORGE        RIVAS     USA   M  1:26:00.99
## 9915    3      JOSH       PARKER     USA   M  1:27:44.74
## 11040   3     KATIE       CAHILL     USA   F  1:35:44.82
## 11100   2     KATIE     PETERSEN     USA   F  1:36:17.42
## 11922   4     QUINN      SCHROMM     USA   F  1:44:14.74
## 13336   4     AIDAN       TORRES     USA   M   2:01:15.6

Since they represent such a small percentage of the total data, I suppose it’s ok just to drop them.

allData <- dplyr::filter(allData, age >= 5)

The last-place runner, Jaynee Caruso, finished after 3 hours, 30 minutes - more than an hour after the previous runner! Also, notice that Jaynee started the race at 6:00AM - something’s very odd there. I’m just dropping that record.

allData[allData$elapsed > 2.49 * 3600 * 1000, c("overall", "firstname", "lastname", "startTime", "elapsedTime")]
##       overall firstname  lastname  startTime elapsedTime
## 13692   13714     BRUCE HARTSOUGH 8:31:29.76  2:29:58.33
## 13693   13715    JAYNEE    CARUSO    6:00:00  3:30:41.13
allData <- dplyr::filter(allData, elapsed < 2.75 * 3600 * 1000)

Last cleanup step: a bunch of runners started before the official start time (8:30AM). I don’t see any obvious pattern for these runners; because their start times cause trouble with my corral start analysis (below), I’m dropping them all.

allData <- dplyr::filter(allData, start > 30090000)

Plotting!

Here’s a plot of runner age vs. race time. To see if there’s a correlation between race time and gender, I’ve colored each runner according to gender. (It seems pretty clear that males have an advantage.) I’ve marked a couple of Ferruccis with black dots, so we can see where we fit in the overall picture. (If you want to mark your own data in black, just provide a file called “friends_priv.R” which implements a function, “getFriends()”, which delivers the subset of data that you’re interested in.)

if (file.exists("friends_priv.R")) {
  source("friends_priv.R")
  friends <- getFriends(allData)
} else {
  friends <- subset(allData,
    lastname == "FERRUCCI"
  )
}
elapsed_ticks <- seq(0, max(allData$elapsed), 900000)
ggplot(allData, aes(x = age, y = elapsed, color=sex)) +
  scale_x_continuous(breaks = seq(0, 100, 10)) + 
  scale_y_continuous(breaks = elapsed_ticks, labels = timestr(elapsed_ticks), name = "elapsed time (hh:mm:ss)") + 
  geom_point() +
  expand_limits(y = 0.375 * 3600 * 1000) +
  stat_smooth(method = "gam", formula = y ~ s(x, bs="cs")) +
  geom_point(data=friends,aes(x = age, y = elapsed, shape=lastname), color = "black")

It’s a lot of data, with a huge range of run (elapsed) times. One feature of the wharf to wharf that’s worth mentioning: lots of people join the race just to walk. To help separate runners from walkers, the participants are organized into “corrals”, with an “elite” corral starting the race first, followed by corrals numbered 1 through 4. The idea is that if you expect to run faster, you’ll be in a lower-numbered corral (or the “elite” corral), and so won’t spend too much of your time dodging around slow-moving obstacles. So - at race day, the “elite” corral starts off, and everyone else moves forward, up to the start line. Then corral 1 starts, and corrals 2-4 move forward. This continues until corral 4 starts off. Each runner’s bib has an RFID tag in it, so that the time at which the runner crosses the start and finish lines can be recorded. Official run time is the difference between the two times.

Fortunately, the race data includes the start time for each runner. A plot of that data, start time vs. elapsed time, is interesting:

start_ticks <- seq(8.5 * 3600 * 1000, max(allData$start), 0.0625 * 3600 * 1000)
ggplot(allData, aes(x = elapsed, y = start, color = sex)) + 
  scale_y_continuous(breaks = start_ticks, labels = timestr(start_ticks)) + 
  scale_x_continuous(breaks = elapsed_ticks, labels = timestr(elapsed_ticks)) +
  expand_limits(x = 0.25 * 3600 * 1000, y = 8.5 * 3600 * 1000) +
  geom_point() +
  geom_point(data=friends,aes(x = elapsed, y = start, shape=lastname), color = "black")

We can clearly see 4 distinct groups here. It looks like the “elite” corral started off around 8:31AM, with almost everyone crossing the start line at the same time. After a delay, a larger group of runners got the go-ahead at about 8:32; this group was so large that they couldn’t all cross the start line at once. Eye-witness report! We were in that group, corral 1, and because we arrived a bit late, we were near the middle of the group, as shown by the black dots.

After corral 1 was off, there was another delay, and then (conjecture) corral 2 was released at about 8:35. Continuing a trend, corral 2 is a larger group, with even more dispersion than the previous.

Finally, another pause, and the rest of the runners are started at about 8:39. This is by far the largest group, with the most dispersion. I think this data shows that corrals 3 and 4 were let go at once (that is, without a pause between corrals 3 and 4). Can any eyewitnesses corroborate the story that the data seems to be telling?

Runners only!

It would be interesting to try to separate runners from walkers. Can we do this by using the start time? I’ll try separating out the first two corrals (“elite” and corral 1), by including only participants with a start time before 8:35:30:

runners <- allData[allData$start < (8 + 35.5/60) * 3600 * 1000,]
runners_ticks <- seq(0, max(runners$elapsed), 900000)
ggplot(runners, aes(x = age, y = elapsed, color=sex)) +
  scale_x_continuous(breaks = seq(0, 100, 10)) + 
  scale_y_continuous(breaks = elapsed_ticks, labels = timestr(elapsed_ticks), name = "elapsed time (hh:mm:ss)") + 
  geom_point() +
  expand_limits(y = 0.25 * 3600 * 1000) +
  stat_smooth(method = "gam", formula = y ~ s(x, bs="cs")) +
  geom_point(data=friends,aes(x = age, y = elapsed, shape=lastname), color = "black")

There’s still quite a wide range of run/walk times there. Looking back at the start time plot, I see that some of the “elite” corral runners finished after upwards of 2 hours. That’s probably a bit of walking. This raises the question: how do you get into that elite corral?

Top 100

How about the top 100, male and female? These are the real runners in this race.

top100 <- allData[allData$oversex <= 100,]
top100_ticks <- seq(0, max(top100$elapsed), (5/60) * 3600 * 1000)
ggplot(top100, aes(x = age, y = elapsed, color=sex)) +
  scale_x_continuous(breaks = seq(0, 100, 10)) + 
  scale_y_continuous(breaks = top100_ticks, labels = timestr(top100_ticks), name = "elapsed time (hh:mm:ss)") + 
  geom_point() +
  expand_limits(y = (27/60) * 3600 * 1000) +
  stat_smooth(method = "gam", formula = y ~ s(x, bs="cs"))

There is a much stronger separation between male and female results in this group. The males tend to be on the younger side, with the oldest entrant under 60. The females continue on into their 70s - that’s pretty impressive.

Future work

  • Rather than using corrals to separate runners from walkers (actually, using start time, which is an indirect way of determining the corrals), how about using plain old run time? Looking at the data for everyone who finished in an hour or less would be simple, and might be interesting. Why an hour? Wikipedia describes “jogging” as “running at a gentle pace”, and says one definition of jogging is “slower than 6 miles per hour”. So, the one-hour threshold is arbitrary, but has some foundation in reality.

  • It would be interesting to scrape data from earlier years, and see if participants tend to be faster or slower, year by year.

References

Github link: https://github.com/aaronferrucci/wharf2wharf
Wharf to Wharf main page: http://www.wharftowharf.com/
Wharf to Wharf results page: http://www.wharftowharf.com/results
Definition of jogging: https://en.wikipedia.org/wiki/Jogging