Who’s the Fastest of All? Analyzing the 100 Metres Men’s Sprint Data

100 Metres Sprint Best Times, 1958 - Present

John Karuitha

2022-02-27

Background

I use data from World Athletics on the best times posted by male athletes in the 100 metres sprint from 1958 to present 1 The data is available on this link https://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/men/senior?regionType=world&timing=electronic&windReading=regular&page=21&bestResultsOnly=false&firstDay=1900-01-01&lastDay=2021-09-20.

Objectives

I examine the following questions in this article.

  1. Which countries have produced the most successful 100 meters male sprinters?
  2. Of the elite 100 meters sprinters, which sprinters have run the most number of races?
  3. Is there a relationship between the age of a sprinter and the best times posted in the 100 meters sprint?
  4. Is there a relationship between the number of races in a year and the time posted?

Summary of Results

  1. The United States has the most male athletes who have posted best times in the 100 meters sprint. Jamaica comes a distant second.
  2. Usain BOLT has ran the fastest 3 times in the 100 meters sprint.
  3. Michael Rodgers from the United States has run the highest number of races while posting some of the best times.
  4. The peak performance age for male 100 meters sprinters is in the mid-20s.
  5. There appears to be a non-linear relationship between the number of races and times posted.
  6. Regression analysis (Linear Model and Generalized Additive Model, GAM) confirms that both age and number of races are significant drivers of athlete performance. However, the result may be affected by omitted variables bias.

Exploring the Data

The raw data consists of the following variables;

Note that many athletes appear multiple times provided the times. For instance, Usain Bolt posted the top 3 best times in different races.

read_chunk("code/sprints.R")

Exploring the Data

As noted, the data is available on the World Athletics website. This data spans two hundred and twenty-four (224) web pages as at the time I am writing this article (2022-02-27). Tellingly, it would take ages to copy-paste this data. Hence, the first exercise is to scrap the data.

Please refer to my previous project on web scrapping available on https://rpubs.com/Karuitha/web_scrapping_1. I outline the steps in scrapping the data next.

Web Scrapping

The scrapping process takes considerable time. For this reason, I have commented out the last two lines of code that do the scrapping.

NOTE: To repeat the scrapping process, remove the # before the two lines of code.

pages <- 1:225

url <- "https://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/men/senior?regionType=world&timing=electronic&windReading=regular&page="

url_2 <- "&bestResultsOnly=false&firstDay=1900-01-01&lastDay=2021-09-20"

################################################################################
## Scrapping function 

scrapper <- function(x){
        
        Sys.sleep(2)
        
        read_html(paste0(url, x, url_2)) %>% 
                
                html_nodes("table") %>% 
                
                html_table()
        
}

################################################################################
## Below is the code for web scrapping. 
## I have commented it out as it takes time to run.
## I have saved the data in a .csv file.: my_100_dash_data.csv
## You can uncomment to rerun the harvesting of data from the web.
################################################################################

# my_100_dash_data <- pages %>% map_dfr(~ scrapper(.x))

#write_csv(my_100_dash_data, "my_100_dash_data.csv")

################################################################################

Feature Engineering

The resultant dataset has 22,400 rows and 14 columns. The data cleaning process involves converting the date of birth (dob) and date to the date/ time format. After this, I did feature engineering, adding the following variables.

  • Age of athletes at the time of the race in days.

  • Age of athletes at the time of the race in years. I divided the age in days by 365.25 to get years.

  • Venue country code: The code of the country where the race happened.

  • Venue country name: I used the countrycode package in R to convert the country codes into country names. Where missing, I used information available on https://www.olympiandatabase.com/index.php?id=1670&L=1 to fill in the country names.

my_100_dash_data <- read_csv("data/my_100_dash_data.csv") %>% 
        
        clean_names() %>% 
        
        select(-x8) %>% 
        
        mutate(dob = lubridate::dmy(dob), 
               
               date = lubridate::dmy(date), 
               
               age_days = (date - dob),
               
               age_years = as.numeric(age_days / 365.25),
               
               venue_country_code = str_extract_all(venue, "\\([A-Z]*\\)"), 
               
               venue_country_code = str_remove_all(venue_country_code, "\\(|\\)")) %>% 
        
        mutate(venue_country_name = countrycode(venue_country_code, 
                                                
                                                origin = "ioc", 
                                                
                                                destination = "country.name")) %>% 
        
        mutate(venue_country_name = case_when(venue_country_code == "AHO" ~ "Netherlands Antilles",
                                              
                                              venue_country_code == "FRG" ~ "Germany",
                                              
                                              venue_country_code == "GDR" ~ "Germany",
                                              
                                              venue_country_code == "MAC" ~ "Macau",
                                              
                                              venue_country_code == "TCH" ~ "Czechia",
                                              
                                              venue_country_code == "TKS" ~ "Turks and Caicos Islands",
                                              
                                              venue_country_code == "URS" ~ "Russia",
                                              
                                              TRUE ~ venue_country_name
                                              
                                              ))

################################################################################

Missing Data

Only four variables have missing data- wind, dob, age_days, age_years, and pos. However, the extent of missingness is not great, as the proportion of missing data shows in the table below.

Amelia::missmap(my_100_dash_data)

Visualisation of Missing Data

Visualisation of Missing Data
sapply(my_100_dash_data, is.na) %>% 
        
        colSums() %>% 
        
        tibble(variables = names(my_100_dash_data), missing = .) %>% 
        
        arrange(desc(missing)) %>% 
        
        mutate(prop_percent = missing / nrow(my_100_dash_data) * 100) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "Missing Data") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Missing Data

variables missing prop_percent
wind 431 1.9241071
dob 181 0.8080357
age_days 181 0.8080357
age_years 181 0.8080357
pos 50 0.2232143
rank 0 0.0000000
mark 0 0.0000000
competitor 0 0.0000000
nat 0 0.0000000
venue 0 0.0000000
################################################################################

Exploratory Data Analysis

This section begins by examining the distribution of times posted by the athletes. Next, I discuss the countries with the most athletes who posted the best times in the 100 metres dash.

Distribution of 100 Meters Male Sprint Best Times

The graph below shows the distribution of the times posted by male athletes in the 100 metres sprint. It shows the razor-thin margins that separate the world beaters like Usain Bolt and other athletes. The maximum difference between the world record holder and the worst time recorded in the dataset is 0.72.

my_100_dash_data %>% 
        
        ggplot(mapping = aes(x = mark)) + 
        
        geom_histogram(col = "black", fill = "skyblue", binwidth = 0.01) + 
        
        ggthemes::theme_economist() + 
        
        labs(x = "Mark", y = "Count", 
             
             title = "Histogram of Best Times of Top 100 Metres Male Sprinters")

################################################################################

I also examine the trend in world-leading times. The figure below shows a gradual decline in the best times posted by male sprinters over 100 meters. There is no telling whether this trend will continue.

## Improvements over time- best times per year

my_100_dash_data %>% 
        
        group_by(year = year(date)) %>% 
        
        summarise(year_record = min(mark)) %>% 
        
        ggplot(mapping = aes(x = year, y = year_record)) +
        
        geom_line(col = "blue") +
        
        labs(x = "Year", y = "Mark/Best Time", 
             
             title = "Trend in 100 Meters Men Best Times") + 
        
        ggthemes::theme_economist()

Most Successful Countries

The table below shows the countries with the highest number of athletes among the top sprinters. The United States leads, as the table below shows. However, as noted earlier, the athletes are repeated, given that one athlete may have posted multiple times. In the next table, I remove the duplicates to get the country with the most athletes.

my_100_dash_data %>% 
        
        count(nat, sort = TRUE) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "NUmber of Athletes by Nationality") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

NUmber of Athletes by Nationality

nat n
USA 7086
JAM 2426
GBR 1648
NGR 1076
CAN 887
JPN 736
TTO 729
FRA 600
RSA 555
BRA 499
############
my_100_dash_data %>% 
        
        select(competitor, nat) %>% 
        
        filter(!duplicated(.)) %>% 
        
        count(nat) %>% 
        
        arrange(desc(n)) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "Number of Distict Atletes by Nationality") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Number of Distict Atletes by Nationality

nat n
USA 691
JAM 132
GBR 90
JPN 90
NGR 71
CAN 55
RSA 52
CHN 47
FRA 44
GER 42
################################################################################

World Record Holders

The table below shows the ten athletes who have posted the best times in the 100 metres dash. Note that Usain Bolt appears in this list four times.

my_100_dash_data %>% 
        
        select(competitor, mark) %>% 
        
        arrange(mark) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, 
            
            caption = "Top 10 Best Times in 100 Meters Dash") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Top 10 Best Times in 100 Meters Dash

competitor mark
Usain BOLT 9.58
Usain BOLT 9.63
Usain BOLT 9.69
Tyson GAY 9.69
Yohan BLAKE 9.69
Tyson GAY 9.71
Usain BOLT 9.72
Asafa POWELL 9.72
Asafa POWELL 9.74
Justin GATLIN 9.74
###########################################

Top 10 100 Metres Dash Athletes

It is common knowledge that Usain Bolt is the record holder. Usain Bolt holds four of the ten best times in the 100 meters dash. But who are the other top contenders? I remove duplicates so that we have one entry per athlete. With that, the top 10 athletes are in table below.

my_100_dash_data %>% 
        
        select(competitor, mark) %>% 
        
        group_by(competitor) %>% 
        
        arrange(mark) %>% 
        
        slice(1) %>% 
        
        ungroup() %>% 
        
        arrange(mark) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "Top 10 100 Meters Athletes") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Top 10 100 Meters Athletes

competitor mark
Usain BOLT 9.58
Tyson GAY 9.69
Yohan BLAKE 9.69
Asafa POWELL 9.72
Justin GATLIN 9.74
Christian COLEMAN 9.76
Trayvon BROMELL 9.76
Ferdinand OMANYALA 9.77
Nesta CARTER 9.78
Maurice GREENE 9.79
################################################################################

Athletes With the Most appearances in the Fastest Athletes List

The issue here is to examine the athlete that has appeared in the list of elite athletes the most times. In other words, which athlete has had the most races that appear in the dataset of the fastest 100 meters athletes? Here, Michael Rodgers from the USA leads the way, having participated in 267 races and posting some of the best times in the World that appear in this dataset.

my_100_dash_data %>% 
        
        count(competitor, sort = TRUE) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "Most Appearances (Races) in the Top Sprinters List") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Most Appearances (Races) in the Top Sprinters List

competitor n
Michael RODGERS 267
Kim COLLINS 224
Asafa POWELL 196
Dennis MITCHELL 192
Michael FRATER 182
Frank FREDERICKS 173
Justin GATLIN 162
Francis OBIKWELU 161
Linford CHRISTIE 161
Bruny SURIN 155
################################################################################

The Age Structure of Athletes

This section examines the ages of the athletes featured in the data. The average age is 25.119 years, while the median is 24.65, with a standard deviation of 3.835. The Nigerian sprinter Chinedu ORIALA is the youngest athlete in the dataset, while Kim Collins is the oldest at around 41 years. We refer to the athlete’s age when they participated in a race and not how old the athlete is currently.

my_100_dash_data %>% 
        
        skim_without_charts(age_years, mark) %>% 
        
        kbl(., booktabs = TRUE, caption = "Summary Statistics for Athletes Age and Mark") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Summary Statistics for Athletes Age and Mark

skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
numeric age_years 181 0.9919196 25.11868 3.8350342 15.25804 22.3655 24.65161 27.41958 41.295
numeric mark 0 1.0000000 10.18856 0.0965855 9.58000 10.1400 10.21000 10.26000 10.300
################################################################################
## Youngest athlete 

my_100_dash_data[which.min(my_100_dash_data$age_years), ] %>% 
        
        kbl(., booktabs = TRUE, caption = "Youngest Athlete in the Dataset") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Youngest Athlete in the Dataset

rank mark wind competitor dob nat pos venue date results_score age_days age_years venue_country_code venue_country_name
18050 10.28 0 Chinedu ORIALA 1981-12-17 NGR h Benin City (NGR) 1997-03-21 1112 5573 days 15.25804 NGR Nigeria
################################################################################
## Oldest athlete 

my_100_dash_data[which.max(my_100_dash_data$age_years), ] %>% 
        
        kbl(., booktabs = TRUE, caption = "Oldest Athlete in the Dataset") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Oldest Athlete in the Dataset

rank mark wind competitor dob nat pos venue date results_score age_days age_years venue_country_code venue_country_name
13439 10.24 -0.3 Kim COLLINS 1976-04-05 SKN 2sr1 Freeport (BAH) 2017-07-22 1126 15083 days 41.295 BAH Bahamas

The figures below show the age structure of the athletes in this dataset. Note that the age determination is the length between the time a race occurs and the date of birth of a given athlete. Here, we see a slight increase in both the mean and median ages of athletes after 1980. Afterwards, the mean and median ages stabilize between 25 and 26 years.

my_100_dash_data %>% 
        
        group_by(year(date)) %>% 
        
        summarise(mean_age = mean(age_years, na.rm = TRUE),
                  
                  median_age = median(age_years, na.rm = TRUE)) %>% 
        
        rename(year = `year(date)`) %>% 
        
        pivot_longer(-year, names_to = "metric", values_to = "age") %>% 
        
        ggplot(mapping = aes(x = year, y = age, col = metric)) + 
        
        geom_line() + 
        
        ggthemes::theme_economist() +
        
        scale_colour_manual(values = c("red", "blue"))

################################################################################

Next, I examine the overall age structure of the athletes using a histogram. The plot shows that most athletes tend to be in their early 20s when they post leading times. After 30 years of age, the performance of athletes declines markedly with minor exceptions.

my_100_dash_data %>% 
        
        ggplot(mapping = aes(x = age_years)) + 
        
        geom_histogram(col = "black", fill = "skyblue", binwidth = 1) + 
        
        ggthemes::theme_economist() + 
        
        labs(x = "Age in Years", y = "Count", 
             
             title = "Histogram of Age of Top 100 Metres Male Sprinters")

###############################################

Slight Detour: A Focus on Usain Bolt

In this section, I delve deeper into the performance by Usain Bolt, the most successful 100 meters athlete of all time. First, I examine the times Usain posted for each year. The figure below shows that the years that Usain recorded the best times also exhibit higher a standard deviation. The table below confirms this observation.

my_100_dash_data %>% 
        
        filter(competitor == "Usain BOLT") %>% 
        
        ggplot(mapping = aes(x = factor(year(date)), y = mark)) + 
        
        geom_boxplot(mapping = aes(fill = factor(year(date)))) + 
        
        geom_point() + 
        
        ggthemes::theme_economist() +
        
        theme(legend.position = "none") +
        
        labs(x = "Year", y = "Mark/Time in Seconds", 
             
             title = "Usain  Bolt 100 Meters Races History 2007-2017")

my_100_dash_data %>%
        
        filter(competitor == "Usain BOLT") %>% 
        
        group_by(year = lubridate::year(date)) %>% 
        
        summarise(median = median(mark, na.rm = TRUE), 
                  
                  sd = sd(mark, na.rm = TRUE)) %>% 
        
        arrange(desc(sd)) %>% 
        
        kbl(., booktabs = TRUE, 
            
            caption = "Median and Standard Deviation for USAIN BOLT") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Median and Standard Deviation for USAIN BOLT

year median sd
2009 9.910 0.1808214
2008 9.850 0.1618404
2012 9.860 0.1440139
2016 10.010 0.1188036
2010 9.860 0.1171324
2011 9.910 0.1165782
2013 9.940 0.1047681
2015 9.870 0.0717635
2017 10.005 0.0539135
2007 10.030 NA
2014 9.980 NA
################################################################################

Age, Number of Races per Year and the Perfromance of Athletes

In this section, I use regression analysis to examine whether age and the number of races an athlete participates in any given year has a bearing on their performance. The figure below indicated that age matters for 100 meters sprinters, with peak performance, observed in the mid-20s.

Similarly, there appears to be a relationship between the number of races and the times posted by athletes. The next figure shows this relationship. As the number of races increases, generally, athletes perform better. But this trend stabilizes beyond a point. Note, however, that these models could suffer from omitted variables bias and can only serve as a basis for further analysis.

races_time <- my_100_dash_data %>% 
        
        group_by(competitor, year(date)) %>% 
        
        rename(year = `year(date)`) %>% 
        
        summarise(races = n(),
                  
                  age = age_years,
                  
                  best_time = min(mark),
                  
                  median_time = median(mark),
                  
                  mean_time = mean(mark), 
                  
                  max_time = max(mark))

###############################################################################
races_time %>% pivot_longer(-c("competitor", "year", "races", "age"),
                     
                     names_to = "perf", values_to = "time") %>% 
        
        ggplot(mapping = aes(x = races, y = time)) + 
        
        geom_hex(alpha = 0.5) +
        
        scale_fill_gradient(low = "grey", high = "red") +
        
        geom_point(shape = ".") + 
        
        geom_density_2d() + 
        
        geom_smooth(col = "green", lty = "dashed") + 
        
        labs(x = "Races", y = "Best Time", 
             
             title = "No of Races per Year vs Worst/ Max Times",
             
             caption = "John Karuitha, 2021") + 
        
        facet_wrap(~ perf) + 
        
        ggthemes::theme_clean()

################################################################################

The regression analysis shows that both age and number of races significantly correlate with times posted. However, we should take this result with a grain of salt due to the possibility of omitted variables bias.

races_lm <- lm(best_time ~ age + races, 
               
               data = races_time)

broom::tidy(races_lm) %>% 
        
        kbl(., booktabs = TRUE, caption = "Linear Model") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Linear Model

term estimate std.error statistic p.value
(Intercept) 10.2379013 0.0037943 2698.201922 0
age -0.0008326 0.0001524 -5.463868 0
races -0.0145648 0.0000990 -147.179969 0
races_gam <- mgcv::gam(best_time ~ s(age) + s(races), 
                 
                 data = races_time,
                 
                 family = gaussian)


broom::tidy(races_gam) %>% 
        
        kbl(., booktabs = TRUE, caption = "The GAM") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

The GAM

term edf ref.df statistic p.value
s(age) 8.629390 8.957311 10.74528 0
s(races) 8.673532 8.951942 3225.65214 0
################################################################################
stargazer::stargazer(performance::compare_performance(races_lm, races_gam))
% Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu % Date and time: Sun, Feb 27, 2022 - 23:11:16

Overall, the GAM model outperforms the linear model in all metrics.

plot(performance::compare_performance(races_lm, races_gam))

################################################################################

Conclusion

This article examined the performance of male 100 meters athletes using data from World Athletics. Results show a gradual improvement in times posted by athletes. The peak performance age for athletes is the mid-20s. The United States has the highest number of athletes in the dataset, followed by Jamaica. The races are very close, with just over half a second separating the world record holder Usain Bolt world record time with the lowest performance in the dataset. The number of races an athlete runs in a year has a non-linear impact on performance.