A Data Analysis of the 2016 Boston Marathon Finishers

By Susan Li

March 28, 2017

As a runner myself, this dataset from Kaggle naturally got my attention. The dataset contains the name, age, gender, country, city and state (where available), times at 9 different stages of the race, expected time, finish time and pace, overall place, gender place and division place of 26630 finishers of 2016 Boston Marathon.

As alway, check the missing values, wrangle and clean up the data.

##  [1] Bib            Name           Age            Gender        
##  [5] City           Country        5K             10K           
##  [9] 15K            20K            Half           25K           
## [13] 30K            35K            40K            Pace          
## [17] Proj_Time      official_time  overall_place  Gender_place  
## [21] Division_place
## <0 rows> (or 0-length row.names)
## [1] 26630    21

Who were the finishers?

## marathon$Gender: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    31.0    40.0    39.9    47.0    83.0 
## -------------------------------------------------------- 
## marathon$Gender: M
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   36.00   45.00   44.72   53.00   83.00

The youngest finishers were 18 years old, and oldest were 83, this applies to both genders.

The number of female finishers is larger than male for younger runners, but this trend is reversed for older runners, and seems 48 is a good age to run for both male and female.

Female finishers are younger than male finishers in first quartile, median and third quartile.

The split between men and women was 54% (14,463 runners) to 46% (12,167 runners), respectively. And over half (56%) of the finishers were older than 40. There were 9196 male age over 40 compared to just 5263 under 40. The picture looks different for female, there were 5781 female age over 40 compared to 6375 under 40.

The average finish time difference based on age is not significant. 26-40 years old female had the best time in female group. There was almost no differnce on average finish time between 18 and 40 years old in male group.

Age matters

We can see young runners run faster as they mature, on average, there best time is at around 30 years old, after that, their finish time slowly increase. Of course there were outliers, several women age between 70 and 74 were faster than 60 years old men on average. There were men age between 75 and 79 finished within 4 hours.

Elite and the rest of us

According to Mercedes Marathon, to qualify for elite runners, the PR finishing time must be at least 2:35(155 mins) for men and 3:05(185 mins) for women. It makes sense to separate them from the rest of us.

2016 Boston Marathon male and female winners were Lemi Berhanu Hayle and Atsede Baysa, crossed the line in 2:12:45 and 2:29:19, respectively. Most of the elite male runners crossed the finish line shortly after the 150-minute mark, and a great propotion of elite female runners crossed the finish line around the 180-minute mark.

## marathon_normal$Gender: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   185.2   220.0   237.2   246.7   265.0   630.4 
## -------------------------------------------------------- 
## marathon_normal$Gender: M
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   155.3   196.5   217.7   226.2   248.0   505.2

For the rest of us, the fastest male crossed finished line in 155 minutes (2:35) and 185 minutes (3:05) for female. The peak finish time for male was around 205 minutes (3:25) and for female was around 230 minutes (3:50).

When it comes to running, there is a gender gap between male and female, male, on average faster than female. And this gap is greater among the runners who finish last than among those who finish first.


In running , a negative split is a racing strategy that involves completing the second half of a race faster than the first half. In contrast, positive split means running the second half slower than the first half. Even splits are where the two halves of the race are run in the same amount of time.

To add split in my analysis, I will need to do some calculation and add a column.

Majority of the runners ran a positive split. They ran second half slower than the first half, but not by much. There are a very small percent of runners ran a negative split. Are they elite runners?

Indeed, there seems more negative split runners in the elite group, and even they run positive split, the difference is very very small(0 to 0.2), this indicates that they were able to maintain a very steady pace.