Introduction:

The Cherry Blossom Ten Mile Run is a road race held annually in Washington, DC. It was founded in 1973 conceived as a warm up for runners of the Boston marathon, and a tourist friendly run for visitors during the Cherry Blossom season. The race has enjoyed growth in popularity and now has a lottery system for entry and visiting participants/competitors from other cities and countries. https://www.cherryblossom.org/wp-content/uploads/2021/03/Rite_of_Spring_1973_2020-1.pdf

I decided to explore the historical records of this road race using R, incorporating statistical and programming methods we explored during our course.

Motivation for choosing this project:

I chose this project for the challenge of creating and cleaning a dataset that was not in a finished format, as I wanted to encounter realistic challenges faced by practicing data analysts.

The following report is an exploratory data analysis of the finishers data, which can be found here: https://www.cballtimeresults.org/

Shortcomings in study design:

Despite the fact that this is a comprehensive collection of data, in terms of design of sample collection for inferential statistics, the study creates a very narrow population for which inferences can be created. The fact that the data exclusively represents the entire historical population of run participants for one organized event means that any inferences drawn are exclusively applicable to that event.

To increase the sample population of runners for which inferences could be made, one could randomly sample from multiple races of varying lengths, in multiple cities held at different times of day.

The data collected and any inferences and statistical analysis derived, are anecdotal and not transferable to other running events or populations larger than the participant pool of the Cherry Blossom race event.

Details of the data provided:

A peek at the data:

glimpse(runner_data)
## Rows: 364,841
## Columns: 14
## $ ID            <int> 1, 2, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 8, 9, 9, 10, 13, 14,…
## $ race_year     <int> 1998, 1984, 2009, 2011, 2012, 2013, 2012, 2013, 2017, 20…
## $ race_type     <fct> 10M, 10M, 5K, 10M, 10M, 10M, 10M, 10M, 10M, 10M, 10M, 10…
## $ minutes       <int> 103, 65, 36, 108, 105, 106, 104, 94, 95, 91, 99, 68, 72,…
## $ sex           <fct> M, M, M, F, F, F, M, M, M, M, M, M, M, M, M, M, M, F, F,…
## $ age           <int> 53, 37, 36, 37, 38, 39, 47, 48, 52, 53, 54, 47, 44, 42, …
## $ `PiS/TiS`     <chr> "3464/3750", "555/2573", "267/412", "6582/9030", "6577/9…
## $ city          <chr> "Mohegan Lake", "Bladensburg", "Upper Marlboro", "Washin…
## $ state         <chr> "NY", "MD", "MD", "DC", "DC", "DC", "DC", "DC", "DC", "V…
## $ country_1     <chr> "United States", "United States", "United States", "Unit…
## $ races_run     <int> 1, 1, 1, 3, 3, 3, 3, 3, 3, 2, 2, 1, 1, 3, 3, 1, 1, 2, 2,…
## $ latitude      <dbl> 41.3184, 38.9393, 38.8163, 38.8950, 38.8950, 38.8950, 38…
## $ longitude     <dbl> -73.8590, -76.9338, -76.7518, -77.0366, -77.0366, -77.03…
## $ miles_from_dc <dbl> 206.151503, 5.497844, 14.157472, 0.000000, 0.000000, 0.0…

Every record represents a runner’s participation in either the 5k run or the 10 Mile run, and it includes some information as described below. Runners that have participated more than one time will have multiple records associated by a unique ID.

-ID: is unique for every runner. Multiple runs by the same runner will use the same ID.
-race_year: the year of the event
-race_type: Either a 5 kilometer run or a 10 mile run
-Sex: The runner self registers as M(ale) or F(emale)
-age: the age of the runner at the time of the race
-minutes: the completion time of the runner’s race rounded to the nearest whole minute
-PiS/TiS: are the time position in sex (male or female)
-(city, state, country_1): runner provided hometown details–this was one input field that runners could use to enter their Hometowns and various levels of detail are provided for each runner. -longitude -latitude -miles_from_dc: geodesic distance (miles) from Washington, DC’s coordinates, provided by package Imap’s gdist function

Initial exploration

Age distribution of 10 Mile runners:
Most frequent (mode of) age of runners by sex

According to the IIRM’s (Intl. Inst. for Race Medicine) 2019 state of running publication, male and female participation peaks around ages 40 and 36 respectively. The data from the cherry blossom race have age modes of 32 and 27 for male and female runners respectively, indicating that the overall age distribution for this event skews younger than those of IIRM’s data.

tenMileRuns <- runner_data %>% filter(race_type == "10M")
fiveKRuns <- runner_data %>% filter(race_type == "5K")

tenMileRuns %>% 
  ggplot(aes(x = age, color=sex)) +
  geom_histogram(fill= 'white', binwidth = 1) +
  ggtitle("Stacked Histogram of Age of Run Participants Grouped by Sex")

There appears to be a right skew in the age distribution. There is no miniumum age requirement for the 10 mile run, so I think this skew is attributable to selection bias.

library(collapse)
## collapse 1.6.5, see ?`collapse-package` or ?`collapse-documentation`
## Note: stats::D  ->  D.expression, D.call, D.name
## 
## Attaching package: 'collapse'
## The following object is masked from 'package:lubridate':
## 
##     is.Date
## The following object is masked from 'package:stats':
## 
##     D
modeOfRunnerAgesBySex <- fmode(tenMileRuns$age, tenMileRuns$sex, ties = "max")
tibble( sex = names(modeOfRunnerAgesBySex), age_modes = modeOfRunnerAgesBySex)
## # A tibble: 2 x 2
##   sex   age_modes
##   <chr>     <int>
## 1 M            32
## 2 F            27

How has race participation has increased over time?

Sex proportions of total runner pool

The IIRM states that as of 2018, female runners make up the majority of race participants since their participation started increasing significantly in the 1980s. This is reflected in the Cherry Blossom run data as shown above.

tenMileRuns %>%
  group_by(race_year) %>%
  count(sex) %>%
  ggplot(aes(fill = sex, x=race_year, y = n)) +
   geom_bar(alpha=0.7 ,position = "stack", stat = "identity") +
  ggtitle("Bar Plot of Total Runners by Year Grouped by Sex")

The average finish time across both sexes has increased over the years at about the same rate
tenMileRuns %>%
  group_by(race_year, sex) %>%
  mutate(mean_completion_time_by_sex = mean(minutes)) %>%
  distinct(sex, mean_completion_time_by_sex) %>%
  arrange(race_year) %>%
  ggplot(aes(x = race_year, y = mean_completion_time_by_sex, color = sex)) +
    geom_line() +
  ggtitle("Line Chart of Average Completion Time by Year Grouped by Sex")

The change in average finishing time trended upwards as the running event matured and the number of records participant pool grew from 300 or fewer runners in the first handful of years to about 17,000 runners in recent years’ events. The line is very sensitive in the early years as it is using very few data points for the years 1973 to 1977.

tenMileRuns %>% count(race_year)
## # A tibble: 47 x 2
##    race_year     n
##        <int> <int>
##  1      1973     2
##  2      1974    72
##  3      1975   122
##  4      1976   264
##  5      1977   249
##  6      1978   306
##  7      1979  2973
##  8      1980  1617
##  9      1981  3329
## 10      1982  3180
## # … with 37 more rows
tenMileRuns %>%
  ggplot(aes(x = minutes, color= sex, fill=sex)) +
  geom_histogram(fill= 'white', binwidth = 2) +
  ggtitle("Stacked Histogram of Finishing Times (minutes) Grouped by Sex")

This appears to be relatively normally distributed with heavy tails

tenMileRuns %>% 
  ggplot(aes(sample = minutes)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q plot of Finishing Time (minutes)")

Bump around 10 races completed threshold?

This race is very popular and quite difficult to get into. If an individual has run it 10 times, their access to the race is guaranteed, meaning they do not have to enter the lottery. Let’s take a look at the distribution of the number of races run, but only for runners who have run it more than once.

My guess is that we will see people who are close to reaching 10 races completed push themselves past that milestone to get guaranteed entry, and therefore we will see an unexpectedly large number of runners who have completed the 10 mile run at least ten times.

tenMileRuns %>% 
  group_by(ID) %>% 
  summarise(num_10_milers_run = n()) %>%
  filter(num_10_milers_run >2) %>%
  ggplot(aes(x=num_10_milers_run)) +
  geom_bar(fill='lightblue', color= "black") +
  ggtitle("Bar Plot of Number of 10 Mile Events Completed per Runner")

As shown above, there is no departure from the decreasing trend of races completed around the 10 races mark. Possibly, the rule was introduced more recently and so the run data would not reflect the rule as I expected. More likely though, the idea of running in the race that many times is not as appealing to most people.

Geocoded data

I used the mapbox API to get the latitude and longitude points for the user-entered input for Hometowns. I then used the gdist function from Imap (R package) to calculate the geodesic distance of each hometown from Washington, DC’s lat long coordinates.

Explore a possible relationship with distance traveled and finish times.

According to the IIRM’s 2019 study, foreign participation in races is growing at an increasing rate, from 0.1% in 1994 to 1.9% in 2018 for half-marathons. 10 Mile run events are not nearly as widespread as other events, which could be an explanation as to why the Cherry Blossom’s foreign participation rates are much lower as shown in the output below.

tenMileRuns %>% group_by(race_year) %>% summarise(pcnt_not_from_US = mean(!(country_1 %in% "United States")),count_not_from_US = sum(!(country_1 %in% "United States")), num_in_group = n())
## # A tibble: 47 x 4
##    race_year pcnt_not_from_US count_not_from_US num_in_group
##        <int>            <dbl>             <int>        <int>
##  1      1973          0                       0            2
##  2      1974          0                       0           72
##  3      1975          0                       0          122
##  4      1976          0                       0          264
##  5      1977          0                       0          249
##  6      1978          0                       0          306
##  7      1979          0.00202                 6         2973
##  8      1980          0.00124                 2         1617
##  9      1981          0.00240                 8         3329
## 10      1982          0                       0         3180
## # … with 37 more rows

The idea was that there may be a significant relationship between distance traveled and finish times. I was guessing that if you traveled from further away, you were more likely to have a more competitive finish time.

tenMileRuns %>% 
  ggplot(aes(x = miles_from_dc)) +
  geom_histogram(fill= 'lightblue', color= "black", binwidth = 100) +
  ggtitle("Histogram of Miles Traveled to DC for All Runners")

After seeing the scale of the difference between the number of participants from the DC area and those traveling, it’s looking less likely that there is a relationship, but I am still hopeful.

Fit a regression line to: minutes using miles_from_dc as the input
tenMileRuns %>%
  ggplot(aes(x=miles_from_dc, y=minutes))+
  geom_point()+
  geom_smooth(method = "lm") +
  ggtitle("Plot of `miles_from_dc` and Race Completion Time for All Runners with Regression Line")
## `geom_smooth()` using formula 'y ~ x'

fit_finish_time_to_miles_traveled_unrounded <- lm(minutes ~ miles_from_dc, data = tenMileRuns)

summary(fit_finish_time_to_miles_traveled_unrounded)
## 
## Call:
## lm(formula = minutes ~ miles_from_dc, data = tenMileRuns)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.869 -12.256  -0.931  10.761 122.732 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.228e+01  3.037e-02 3038.89   <2e-16 ***
## miles_from_dc -2.872e-03  8.943e-05  -32.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.17 on 339622 degrees of freedom
## Multiple R-squared:  0.003027,   Adjusted R-squared:  0.003025 
## F-statistic:  1031 on 1 and 339622 DF,  p-value: < 2.2e-16

According to the adjusted R-squared of this simple fit of minutes ~ miles, only .3% of the variation in the data is explained by this model.
While the p-value indicates the correlation is of statistically significance, the measure of correlation is not strong.

What happens if I average finish times based on distance traveled (rounded to nearest thousand miles)
tenMileRuns %>%
  mutate(rounded_miles = round(miles_from_dc,-3)) %>%
  group_by(rounded_miles) %>%
  summarise(avg_finish_time_mins_for_dist_from_dc = mean(minutes))
## # A tibble: 10 x 2
##    rounded_miles avg_finish_time_mins_for_dist_from_dc
##            <dbl>                                 <dbl>
##  1             0                                  92.0
##  2          1000                                  95.2
##  3          2000                                  92.5
##  4          3000                                  76.0
##  5          4000                                  79.9
##  6          5000                                  54.7
##  7          6000                                  53.2
##  8          7000                                  51.0
##  9          8000                                  67.6
## 10          9000                                  71.1
tenMileRuns %>%
  mutate(rounded_miles = round(miles_from_dc,-3)) %>%
  group_by(rounded_miles) %>%
  summarise(avg_finish_time_mins_for_dist_from_dc = mean(minutes)) %>%
  ggplot(aes(x=rounded_miles, y=avg_finish_time_mins_for_dist_from_dc))+
  geom_point()+
  geom_smooth(method = "lm") +
  ggtitle("Plot of Average Completion Time for `miles_from_dc` (rounded to nearest thousand) with Regression Line")
## `geom_smooth()` using formula 'y ~ x'

distanceTraveled_vs_finishTime <- tenMileRuns %>%
  mutate(rounded_miles = round(miles_from_dc,-3)) %>%
  group_by(rounded_miles) %>%
  summarise(avg_finish_time_mins_for_dist_from_dc = mean(minutes))

fit_finish_time_to_miles_traveled <- lm(avg_finish_time_mins_for_dist_from_dc ~ rounded_miles, data = distanceTraveled_vs_finishTime)

summary(fit_finish_time_to_miles_traveled)
## 
## Call:
## lm(formula = avg_finish_time_mins_for_dist_from_dc ~ rounded_miles, 
##     data = distanceTraveled_vs_finishTime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.549  -9.861   2.301   8.395  16.420 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.955688   6.965663  13.201 1.03e-06 ***
## rounded_miles -0.004141   0.001305  -3.173   0.0131 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.85 on 8 degrees of freedom
## Multiple R-squared:  0.5573, Adjusted R-squared:  0.5019 
## F-statistic: 10.07 on 1 and 8 DF,  p-value: 0.01313

The adjusted r-squared increases to a satisfying 50%, meaning 50% of the variability in these (highly summarized) data are explained by the model. The trade-off was the increase in our p-value to greater than .01, indicating that these results are more likely to arise by chance than with the first model.

Multiple linear regression

Let’s see if we can get a better model if we include the following variables:
-race_year
-sex
-age
-miles_from_dc

multipleLRfit <- lm(minutes ~ race_year + sex + age + miles_from_dc, data = tenMileRuns)
summary(multipleLRfit)
## 
## Call:
## lm(formula = minutes ~ race_year + sex + age + miles_from_dc, 
##     data = tenMileRuns)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.391  -9.951  -0.798   8.944 127.258 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -9.755e+02  4.992e+00 -195.41   <2e-16 ***
## race_year      5.241e-01  2.492e-03  210.28   <2e-16 ***
## sexF           1.190e+01  5.277e-02  225.48   <2e-16 ***
## age            2.821e-01  2.466e-03  114.41   <2e-16 ***
## miles_from_dc -3.601e-03  7.582e-05  -47.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.55 on 339619 degrees of freedom
## Multiple R-squared:  0.2842, Adjusted R-squared:  0.2842 
## F-statistic: 3.371e+04 on 4 and 339619 DF,  p-value: < 2.2e-16
plot(multipleLRfit, which = c(1:3))

Surprisingly, holding all other variables constant, race_year is the most correlated variable to a runner’s finishing time. A one year increase, is correlated with a .52 minute (~31 second) increase in finishing time.

Formula for the regression line:
\[\widehat{minutes} = race\_year \times .5241 ~+ sexF \times .1190 ~+ age \times .2821 + miles\_from\_dc \times -.003601\]

Concluding thoughts

Regression:

This is an excellent example of why correlation is not causation. Clearly race_year is not causing an increase in runner finishing time. One possible explanation for race_year having the largest correlation coefficient, is that as the popularity of the run increased, the average participant was less competitive. The increase in the number of less competitive participants is being confounded with race_year in regard to it’s impact on finishing time.

The best predictor of finishing time (race_year) was the one I least expected. This was a nice lesson in the value of regression analysis–I was shown something about the data that was not immediately apparent to me.

Collecting and cleaning process:

I learned a lot about collecting and cleaning data and how long it can take, especially if you are learning the programming language and statistics at the same time. It was an iterative process, forcing me to learn new tools at every stage. Some of the more significant tools that were new to me include:

  • rvest: a web scraping package
  • mapbox: a company with a suite of mapping tools. I used the free tier of their geocoding api

Further steps with this dataset:

If I were to further explore this dataset, one thing I am curious about is if the runners’ finishing time trajectories follow a similar pattern. Do runners peak around a similar age? Does their performance decline consistently? At what rate?

Notes regarding the data:

-The 2015 running of the 10 mile race had to be cut short and had a total distance of 9.39 miles. The times have been adjusted for in the official results, and the adjusted times are reflected in this data.
-The 2019 running of the 10 mile race was in actuality 9.94 miles. The times were left unadjusted and these data likewise reflect the official results.
-The 2020 run was held as a virtual event, and those data are excluded from this analysis.

Sources cited:

Race records database: https://www.cballtimeresults.org/
Narrative of evolution of race event https://www.cherryblossom.org/wp-content/uploads/2021/03/Rite_of_Spring_1973_2020-1.pdf
IIRM ‘State of Running’ report: https://racemedicine.org/the-state-of-running-2019/

Highlighting some of the errors in user input for Hometown which leads to inaccuracies in the geocoding

There were many challenges in cleaning and parsing the user entered input. I spent many days cleaning this input, and as shown below I am still coming across some entries which got miscoded in the geocoding.

mistakes <- tenMileRuns %>% filter(miles_from_dc > 1875 & miles_from_dc<2500)  %>% filter(state %in% c("CT", "MD", "ME", "DC", "IL","MA","VA","WI"))
mistakes
## # A tibble: 12 x 14
##        ID race_year race_type minutes sex     age `PiS/TiS` city         state
##     <int>     <int> <fct>       <int> <fct> <int> <chr>     <chr>        <chr>
##  1   4277      2004 10M            89 F        23 1076/3900 Seattle      MD   
##  2   5281      2006 10M            83 M        35 1766/5236 Lusby        MA   
##  3  42483      2011 10M            92 M        25 4177/7009 Alexandria   IL   
##  4  45084      2015 10M           136 M        29 6763/6838 R            MD   
##  5  75587      2012 10M            86 M        27 3385/7195 Gulfport     ME   
##  6  89151      2004 10M            90 M        36 2554/4157 Temple Hills DC   
##  7 158474      1982 10M            91 F        44 337/533   Lanham       ME   
##  8 165160      2012 10M           130 F        45 9443/9729 Hay Market   VA   
##  9 169729      2013 10M           124 M        62 6980/7215 LaCrosse     WI   
## 10 169928      1998 10M            90 M        45 2527/3750 Kellingworth CT   
## 11 169928      2003 10M           103 M        50 3393/3923 Kellingworth CT   
## 12 173993      1984 10M            78 M        32 1697/2573 Watered      CT   
## # … with 5 more variables: country_1 <chr>, races_run <int>, latitude <dbl>,
## #   longitude <dbl>, miles_from_dc <dbl>