The Wharf to Wharf race happens every July in Santa Cruz. The mostly flat, 6-mile course hugs the coastline between Santa Cruz and Capitola. The race has grown in popularity since its beginning in 1973, with more than 13,000 registered participants in 2015.
Race results are available online. You can look up runners’ results by bib number, name, city and state. This is ok, but it’s not very handy for anyone wanting to do real data analysis.
I’ve written some code to scrape the data from the results page, clean it up, and plot some interesting things. The code is available on github, in case you want to improve it, <smiley emoticon>.
Load a few required libraries, and source a utility file containing helper functions:
library(dplyr)
library(ggplot2)
library(gridExtra)
source("w2w_utils.R")
On the results site, you can grab results with 10, 25, 50 or 100 entries per page. With a bit of snooping I found the url query string, which I can use in a loop to grab all of the data. (I’m caching the data in a local file to avoid burdening the results page too much during debugging and exploration.)
allData <- getData(2015)
With the data in hand, we can start exploring.
Number of runners:
nrow(allData)
## [1] 13715
Age distribution (histogram):
qplot(allData$age, ylab = "", xlab = "age")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
So, it’s not exactly a young person’s race… the average age is 40.4110098.
Gender distribution (histogram):
qplot(allData$sex, ylab = "", xlab = "sex")
There are significantly more female paticipants than male (ratio F:M is 1.663624:1).
Each participant’s city, state and country is recorded. Where are participants from?
table(allData$country)
##
## CAN ETH JPN KEN UMI USA
## 3 2 2 5 4 2 13697
Almost everyone is from the USA (not surprising). Besides 3 blank records, we have participants from the USA, Canada, Ethiopia, Japan, Kenya, and something called “UMI”. What’s “UMI”?
allData[allData$country == "UMI", c("firstname", "lastname", "bib", "city", "state", "country")]
## firstname lastname bib city state country
## 2764 LEAH FRIEDMAN 16109 OXNARD CA UMI
## 9208 CYNTHIA ROTAN 7832 RIVERBANK CA UMI
Looks like a typo. Might as well fix that.
allData[allData$country == "UMI", c("country")] <- c("USA")
Three records had no country assignment:
allData[allData$country == "", c("firstname", "lastname", "bib", "city", "state", "country")]
## firstname lastname bib city state country
## 2 SHADRACK KOSGEI 2 KENYA
## 4 NELSON OYUGI 3 KENYA
## 18 HAIS WELDAY 7 ERITREA
It looks like they just mixed up the country and city assignment (and left state and country blank).
allData[allData$country == "", c("country")] <- allData[allData$country == "", c("city")]
allData[grepl("^KEN", allData$country), c("country")] <- c("KEN")
allData[grepl("^ERI", allData$country), c("country")] <- c("ERI")
… and the final, cleaned-up country distribution is:
table(allData$country)
##
## CAN ERI ETH JPN KEN USA
## 2 1 2 5 6 13699
Of entrants from the USA, most are from California:
table(allData[allData$country == "USA", c("state")])
##
## AK AZ CA CO CT DC DE FL GA ID IL
## 11 2 17 13441 24 2 4 2 1 11 11 6
## IN KS MA MD MI MN MO MT NC ND NE NJ
## 2 2 10 3 1 3 4 5 1 3 1 4
## NM NV NY OH OK OR PA TX UT VA WA
## 4 45 8 1 1 24 2 11 4 4 24
There are a few runners who are surprisingly young:
allData[allData$age < 5, c("age", "firstname", "lastname", "country", "sex", "elapsedTime")]
## age firstname lastname country sex elapsedTime
## 1697 0 SOPHIA DAVIS USA F 0:52:28.45
## 1870 0 BRIAN DEDIEGO USA M 0:53:17.71
## 2153 0 DICK PATTERSON USA M 0:54:28.96
## 2749 0 JILL DAVIS USA F 0:56:58.52
## 2972 0 LARRY DEGHETALDI USA M 0:57:49.79
## 4200 0 GARY MCCONNELL USA M 1:02:02.33
## 4479 1 LAURA DUNN USA F 1:03:02.96
## 5476 1 FINN CALDEIRA USA M 1:06:24.84
## 5611 0 JEFFERY KRUSKALL USA M 1:06:52.92
## 7433 0 MOLLY ORDING USA F 1:14:10.16
## 7565 0 SUZANNE GOLDEN-RILEY USA F 1:14:43.42
## 8837 0 LYNELL HANCK USA F 1:20:58.4
## 8838 0 PATRICK HANCK USA M 1:20:58.59
## 8864 4 IZABELLA GIERAT-KATZ USA F 1:21:07.29
## 8968 0 KELLI VOLKSWAGEN USA F 1:21:41.5
## 8980 0 KELLI VOLKSWAGEN USA F 1:21:45.08
## 9672 2 JORGE RIVAS USA M 1:26:00.99
## 9915 3 JOSH PARKER USA M 1:27:44.74
## 11040 3 KATIE CAHILL USA F 1:35:44.82
## 11100 2 KATIE PETERSEN USA F 1:36:17.42
## 11922 4 QUINN SCHROMM USA F 1:44:14.74
## 13336 4 AIDAN TORRES USA M 2:01:15.6
Since they represent such a small percentage of the total data, I suppose it’s ok just to drop them.
allData <- dplyr::filter(allData, age >= 5)
The last-place runner, Jaynee Caruso, finished after 3 hours, 30 minutes - more than an hour after the previous runner! Also, notice that Jaynee started the race at 6:00AM - something’s very odd there. I’m just dropping that record.
allData[allData$elapsed > 2.49 * 3600 * 1000, c("overall", "firstname", "lastname", "startTime", "elapsedTime")]
## overall firstname lastname startTime elapsedTime
## 13692 13714 BRUCE HARTSOUGH 8:31:29.76 2:29:58.33
## 13693 13715 JAYNEE CARUSO 6:00:00 3:30:41.13
allData <- dplyr::filter(allData, elapsed < 2.75 * 3600 * 1000)
Last cleanup step: a bunch of runners started before the official start time (8:30AM). I don’t see any obvious pattern for these runners; because their start times cause trouble with my corral start analysis (below), I’m dropping them all.
allData <- dplyr::filter(allData, start > 30090000)
Here’s a plot of runner age vs. race time. To see if there’s a correlation between race time and gender, I’ve colored each runner according to gender. (It seems pretty clear that males have an advantage.) I’ve marked a couple of Ferruccis with black dots, so we can see where we fit in the overall picture. (If you want to mark your own data in black, just provide a file called “friends_priv.R” which implements a function, “getFriends()”, which delivers the subset of data that you’re interested in.)
if (file.exists("friends_priv.R")) {
source("friends_priv.R")
friends <- getFriends(allData)
} else {
friends <- subset(allData,
lastname == "FERRUCCI"
)
}
elapsed_ticks <- seq(0, max(allData$elapsed), 900000)
ggplot(allData, aes(x = age, y = elapsed, color=sex)) +
scale_x_continuous(breaks = seq(0, 100, 10)) +
scale_y_continuous(breaks = elapsed_ticks, labels = timestr(elapsed_ticks), name = "elapsed time (hh:mm:ss)") +
geom_point() +
expand_limits(y = 0.375 * 3600 * 1000) +
stat_smooth(method = "gam", formula = y ~ s(x, bs="cs")) +
geom_point(data=friends,aes(x = age, y = elapsed, shape=lastname), color = "black")
It’s a lot of data, with a huge range of run (elapsed) times. One feature of the wharf to wharf that’s worth mentioning: lots of people join the race just to walk. To help separate runners from walkers, the participants are organized into “corrals”, with an “elite” corral starting the race first, followed by corrals numbered 1 through 4. The idea is that if you expect to run faster, you’ll be in a lower-numbered corral (or the “elite” corral), and so won’t spend too much of your time dodging around slow-moving obstacles. So - at race day, the “elite” corral starts off, and everyone else moves forward, up to the start line. Then corral 1 starts, and corrals 2-4 move forward. This continues until corral 4 starts off. Each runner’s bib has an RFID tag in it, so that the time at which the runner crosses the start and finish lines can be recorded. Official run time is the difference between the two times.
Fortunately, the race data includes the start time for each runner. A plot of that data, start time vs. elapsed time, is interesting:
start_ticks <- seq(8.5 * 3600 * 1000, max(allData$start), 0.0625 * 3600 * 1000)
ggplot(allData, aes(x = elapsed, y = start, color = sex)) +
scale_y_continuous(breaks = start_ticks, labels = timestr(start_ticks)) +
scale_x_continuous(breaks = elapsed_ticks, labels = timestr(elapsed_ticks)) +
expand_limits(x = 0.25 * 3600 * 1000, y = 8.5 * 3600 * 1000) +
geom_point() +
geom_point(data=friends,aes(x = elapsed, y = start, shape=lastname), color = "black")
We can clearly see 4 distinct groups here. It looks like the “elite” corral started off around 8:31AM, with almost everyone crossing the start line at the same time. After a delay, a larger group of runners got the go-ahead at about 8:32; this group was so large that they couldn’t all cross the start line at once. Eye-witness report! We were in that group, corral 1, and because we arrived a bit late, we were near the middle of the group, as shown by the black dots.
After corral 1 was off, there was another delay, and then (conjecture) corral 2 was released at about 8:35. Continuing a trend, corral 2 is a larger group, with even more dispersion than the previous.
Finally, another pause, and the rest of the runners are started at about 8:39. This is by far the largest group, with the most dispersion. I think this data shows that corrals 3 and 4 were let go at once (that is, without a pause between corrals 3 and 4). Can any eyewitnesses corroborate the story that the data seems to be telling?
It would be interesting to try to separate runners from walkers. Can we do this by using the start time? I’ll try separating out the first two corrals (“elite” and corral 1), by including only participants with a start time before 8:35:30:
runners <- allData[allData$start < (8 + 35.5/60) * 3600 * 1000,]
runners_ticks <- seq(0, max(runners$elapsed), 900000)
ggplot(runners, aes(x = age, y = elapsed, color=sex)) +
scale_x_continuous(breaks = seq(0, 100, 10)) +
scale_y_continuous(breaks = elapsed_ticks, labels = timestr(elapsed_ticks), name = "elapsed time (hh:mm:ss)") +
geom_point() +
expand_limits(y = 0.25 * 3600 * 1000) +
stat_smooth(method = "gam", formula = y ~ s(x, bs="cs")) +
geom_point(data=friends,aes(x = age, y = elapsed, shape=lastname), color = "black")
There’s still quite a wide range of run/walk times there. Looking back at the start time plot, I see that some of the “elite” corral runners finished after upwards of 2 hours. That’s probably a bit of walking. This raises the question: how do you get into that elite corral?
How about the top 100, male and female? These are the real runners in this race.
top100 <- allData[allData$oversex <= 100,]
top100_ticks <- seq(0, max(top100$elapsed), (5/60) * 3600 * 1000)
ggplot(top100, aes(x = age, y = elapsed, color=sex)) +
scale_x_continuous(breaks = seq(0, 100, 10)) +
scale_y_continuous(breaks = top100_ticks, labels = timestr(top100_ticks), name = "elapsed time (hh:mm:ss)") +
geom_point() +
expand_limits(y = (27/60) * 3600 * 1000) +
stat_smooth(method = "gam", formula = y ~ s(x, bs="cs"))
There is a much stronger separation between male and female results in this group. The males tend to be on the younger side, with the oldest entrant under 60. The females continue on into their 70s - that’s pretty impressive.
Rather than using corrals to separate runners from walkers (actually, using start time, which is an indirect way of determining the corrals), how about using plain old run time? Looking at the data for everyone who finished in an hour or less would be simple, and might be interesting. Why an hour? Wikipedia describes “jogging” as “running at a gentle pace”, and says one definition of jogging is “slower than 6 miles per hour”. So, the one-hour threshold is arbitrary, but has some foundation in reality.
It would be interesting to scrape data from earlier years, and see if participants tend to be faster or slower, year by year.
Github link: https://github.com/aaronferrucci/wharf2wharf
Wharf to Wharf main page: http://www.wharftowharf.com/
Wharf to Wharf results page: http://www.wharftowharf.com/results
Definition of jogging: https://en.wikipedia.org/wiki/Jogging