Background

I use data from World Athletics on the best times posted by male athletes in the 100 metres sprint from 1958 to present 11 The data is available on this link https://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/men/senior?regionType=world&timing=electronic&windReading=regular&page=21&bestResultsOnly=false&firstDay=1900-01-01&lastDay=2021-09-20.

Objectives

I examine the following questions in this article.

Which countries have produced the most successful 100 meters male sprinters?
Of the elite 100 meters sprinters, which sprinters have run the most number of races?
Is there a relationship between the age of a sprinter and the best times posted in the 100 meters sprint?
Is there a relationship between the number of races in a year and the time posted?

Summary of Results

The United States has the most male athletes who have posted best times in the 100 meters sprint. Jamaica comes a distant second.
Usain BOLT has ran the fastest 3 times in the 100 meters sprint.
Michael Rodgers from the United States has run the highest number of races while posting some of the best times.
The peak performance age for male 100 meters sprinters is in the mid-20s.
There appears to be a non-linear relationship between the number of races and times posted.
Regression analysis (Linear Model and Generalized Additive Model, GAM) confirms that both age and number of races are significant drivers of athlete performance. However, the result may be affected by omitted variables bias.

Exploring the Data

The raw data consists of the following variables;

rank: Starting from the athlete that has posted the best time to date.
mark: The time (in seconds) posted by the athlete.
wind: The wind assist. Negative speeds indicate the athlete was running against the wind.
competitor: The name of the athlete.
dob: Date of birth of the athlete.
nat: Nationality of the athlete.
pos: position of the athlete in the given race.
venue: Venue of the race.
date: Date the race happened.
results_score: The athlete’s score in the race by World Athletics.

Note that many athletes appear multiple times provided the times. For instance, Usain Bolt posted the top 3 best times in different races.

read_chunk("code/sprints.R")

Exploring the Data

As noted, the data is available on the World Athletics website. This data spans two hundred and twenty-four (224) web pages as at the time I am writing this article (2022-02-27). Tellingly, it would take ages to copy-paste this data. Hence, the first exercise is to scrap the data.

Please refer to my previous project on web scrapping available on https://rpubs.com/Karuitha/web_scrapping_1. I outline the steps in scrapping the data next.

Web Scrapping

The scrapping process takes considerable time. For this reason, I have commented out the last two lines of code that do the scrapping.

NOTE: To repeat the scrapping process, remove the # before the two lines of code.

pages <- 1:225

url <- "https://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/men/senior?regionType=world&timing=electronic&windReading=regular&page="

url_2 <- "&bestResultsOnly=false&firstDay=1900-01-01&lastDay=2021-09-20"

################################################################################
## Scrapping function 

scrapper <- function(x){
        
        Sys.sleep(2)
        
        read_html(paste0(url, x, url_2)) %>% 
                
                html_nodes("table") %>% 
                
                html_table()
        
}

################################################################################
## Below is the code for web scrapping. 
## I have commented it out as it takes time to run.
## I have saved the data in a .csv file.: my_100_dash_data.csv
## You can uncomment to rerun the harvesting of data from the web.
################################################################################

# my_100_dash_data <- pages %>% map_dfr(~ scrapper(.x))

#write_csv(my_100_dash_data, "my_100_dash_data.csv")

################################################################################

Feature Engineering

The resultant dataset has 22,400 rows and 14 columns. The data cleaning process involves converting the date of birth (dob) and date to the date/ time format. After this, I did feature engineering, adding the following variables.

Age of athletes at the time of the race in days.
Age of athletes at the time of the race in years. I divided the age in days by 365.25 to get years.
Venue country code: The code of the country where the race happened.
Venue country name: I used the countrycode package in R to convert the country codes into country names. Where missing, I used information available on https://www.olympiandatabase.com/index.php?id=1670&L=1 to fill in the country names.

my_100_dash_data <- read_csv("data/my_100_dash_data.csv") %>% 
        
        clean_names() %>% 
        
        select(-x8) %>% 
        
        mutate(dob = lubridate::dmy(dob), 
               
               date = lubridate::dmy(date), 
               
               age_days = (date - dob),
               
               age_years = as.numeric(age_days / 365.25),
               
               venue_country_code = str_extract_all(venue, "\\([A-Z]*\\)"), 
               
               venue_country_code = str_remove_all(venue_country_code, "\\(|\\)")) %>% 
        
        mutate(venue_country_name = countrycode(venue_country_code, 
                                                
                                                origin = "ioc", 
                                                
                                                destination = "country.name")) %>% 
        
        mutate(venue_country_name = case_when(venue_country_code == "AHO" ~ "Netherlands Antilles",
                                              
                                              venue_country_code == "FRG" ~ "Germany",
                                              
                                              venue_country_code == "GDR" ~ "Germany",
                                              
                                              venue_country_code == "MAC" ~ "Macau",
                                              
                                              venue_country_code == "TCH" ~ "Czechia",
                                              
                                              venue_country_code == "TKS" ~ "Turks and Caicos Islands",
                                              
                                              venue_country_code == "URS" ~ "Russia",
                                              
                                              TRUE ~ venue_country_name
                                              
                                              ))

################################################################################

Missing Data

Only four variables have missing data- wind, dob, age_days, age_years, and pos. However, the extent of missingness is not great, as the proportion of missing data shows in the table below.

Amelia::missmap(my_100_dash_data)

Visualisation of Missing Data

sapply(my_100_dash_data, is.na) %>% 
        
        colSums() %>% 
        
        tibble(variables = names(my_100_dash_data), missing = .) %>% 
        
        arrange(desc(missing)) %>% 
        
        mutate(prop_percent = missing / nrow(my_100_dash_data) * 100) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "Missing Data") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Missing Data

variables	missing	prop_percent
wind	431	1.9241071
dob	181	0.8080357
age_days	181	0.8080357
age_years	181	0.8080357
pos	50	0.2232143
rank	0	0.0000000
mark	0	0.0000000
competitor	0	0.0000000
nat	0	0.0000000
venue	0	0.0000000

################################################################################

Exploratory Data Analysis

This section begins by examining the distribution of times posted by the athletes. Next, I discuss the countries with the most athletes who posted the best times in the 100 metres dash.

Distribution of 100 Meters Male Sprint Best Times

The graph below shows the distribution of the times posted by male athletes in the 100 metres sprint. It shows the razor-thin margins that separate the world beaters like Usain Bolt and other athletes. The maximum difference between the world record holder and the worst time recorded in the dataset is 0.72.

my_100_dash_data %>% 
        
        ggplot(mapping = aes(x = mark)) + 
        
        geom_histogram(col = "black", fill = "skyblue", binwidth = 0.01) + 
        
        ggthemes::theme_economist() + 
        
        labs(x = "Mark", y = "Count", 
             
             title = "Histogram of Best Times of Top 100 Metres Male Sprinters")

################################################################################

I also examine the trend in world-leading times. The figure below shows a gradual decline in the best times posted by male sprinters over 100 meters. There is no telling whether this trend will continue.

## Improvements over time- best times per year

my_100_dash_data %>% 
        
        group_by(year = year(date)) %>% 
        
        summarise(year_record = min(mark)) %>% 
        
        ggplot(mapping = aes(x = year, y = year_record)) +
        
        geom_line(col = "blue") +
        
        labs(x = "Year", y = "Mark/Best Time", 
             
             title = "Trend in 100 Meters Men Best Times") + 
        
        ggthemes::theme_economist()

Most Successful Countries

The table below shows the countries with the highest number of athletes among the top sprinters. The United States leads, as the table below shows. However, as noted earlier, the athletes are repeated, given that one athlete may have posted multiple times. In the next table, I remove the duplicates to get the country with the most athletes.

my_100_dash_data %>% 
        
        count(nat, sort = TRUE) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "NUmber of Athletes by Nationality") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

NUmber of Athletes by Nationality

nat	n
USA	7086
JAM	2426
GBR	1648
NGR	1076
CAN	887
JPN	736
TTO	729
FRA	600
RSA	555
BRA	499

############
my_100_dash_data %>% 
        
        select(competitor, nat) %>% 
        
        filter(!duplicated(.)) %>% 
        
        count(nat) %>% 
        
        arrange(desc(n)) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "Number of Distict Atletes by Nationality") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Number of Distict Atletes by Nationality

nat	n
USA	691
JAM	132
GBR	90
JPN	90
NGR	71
CAN	55
RSA	52
CHN	47
FRA	44
GER	42

################################################################################

World Record Holders

The table below shows the ten athletes who have posted the best times in the 100 metres dash. Note that Usain Bolt appears in this list four times.

my_100_dash_data %>% 
        
        select(competitor, mark) %>% 
        
        arrange(mark) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, 
            
            caption = "Top 10 Best Times in 100 Meters Dash") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Top 10 Best Times in 100 Meters Dash

competitor	mark
Usain BOLT	9.58
Usain BOLT	9.63
Usain BOLT	9.69
Tyson GAY	9.69
Yohan BLAKE	9.69
Tyson GAY	9.71
Usain BOLT	9.72
Asafa POWELL	9.72
Asafa POWELL	9.74
Justin GATLIN	9.74

###########################################

Top 10 100 Metres Dash Athletes

It is common knowledge that Usain Bolt is the record holder. Usain Bolt holds four of the ten best times in the 100 meters dash. But who are the other top contenders? I remove duplicates so that we have one entry per athlete. With that, the top 10 athletes are in table below.

my_100_dash_data %>% 
        
        select(competitor, mark) %>% 
        
        group_by(competitor) %>% 
        
        arrange(mark) %>% 
        
        slice(1) %>% 
        
        ungroup() %>% 
        
        arrange(mark) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "Top 10 100 Meters Athletes") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Top 10 100 Meters Athletes

competitor	mark
Usain BOLT	9.58
Tyson GAY	9.69
Yohan BLAKE	9.69
Asafa POWELL	9.72
Justin GATLIN	9.74
Christian COLEMAN	9.76
Trayvon BROMELL	9.76
Ferdinand OMANYALA	9.77
Nesta CARTER	9.78
Maurice GREENE	9.79

################################################################################

Athletes With the Most appearances in the Fastest Athletes List

The issue here is to examine the athlete that has appeared in the list of elite athletes the most times. In other words, which athlete has had the most races that appear in the dataset of the fastest 100 meters athletes? Here, Michael Rodgers from the USA leads the way, having participated in 267 races and posting some of the best times in the World that appear in this dataset.

my_100_dash_data %>% 
        
        count(competitor, sort = TRUE) %>% 
        
        head(10) %>% 
        
        kbl(., booktabs = TRUE, caption = "Most Appearances (Races) in the Top Sprinters List") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Most Appearances (Races) in the Top Sprinters List

competitor	n
Michael RODGERS	267
Kim COLLINS	224
Asafa POWELL	196
Dennis MITCHELL	192
Michael FRATER	182
Frank FREDERICKS	173
Justin GATLIN	162
Francis OBIKWELU	161
Linford CHRISTIE	161
Bruny SURIN	155

################################################################################

The Age Structure of Athletes

This section examines the ages of the athletes featured in the data. The average age is 25.119 years, while the median is 24.65, with a standard deviation of 3.835. The Nigerian sprinter Chinedu ORIALA is the youngest athlete in the dataset, while Kim Collins is the oldest at around 41 years. We refer to the athlete’s age when they participated in a race and not how old the athlete is currently.

my_100_dash_data %>% 
        
        skim_without_charts(age_years, mark) %>% 
        
        kbl(., booktabs = TRUE, caption = "Summary Statistics for Athletes Age and Mark") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Summary Statistics for Athletes Age and Mark

skim_type	skim_variable	n_missing	complete_rate	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100
numeric	age_years	181	0.9919196	25.11868	3.8350342	15.25804	22.3655	24.65161	27.41958	41.295
numeric	mark	0	1.0000000	10.18856	0.0965855	9.58000	10.1400	10.21000	10.26000	10.300

################################################################################
## Youngest athlete 

my_100_dash_data[which.min(my_100_dash_data$age_years), ] %>% 
        
        kbl(., booktabs = TRUE, caption = "Youngest Athlete in the Dataset") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Youngest Athlete in the Dataset

rank	mark	wind	competitor	dob	nat	pos	venue	date	results_score	age_days	age_years	venue_country_code	venue_country_name
18050	10.28	0	Chinedu ORIALA	1981-12-17	NGR	h	Benin City (NGR)	1997-03-21	1112	5573 days	15.25804	NGR	Nigeria

################################################################################
## Oldest athlete 

my_100_dash_data[which.max(my_100_dash_data$age_years), ] %>% 
        
        kbl(., booktabs = TRUE, caption = "Oldest Athlete in the Dataset") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Oldest Athlete in the Dataset

rank	mark	wind	competitor	dob	nat	pos	venue	date	results_score	age_days	age_years	venue_country_code	venue_country_name
13439	10.24	-0.3	Kim COLLINS	1976-04-05	SKN	2sr1	Freeport (BAH)	2017-07-22	1126	15083 days	41.295	BAH	Bahamas

The figures below show the age structure of the athletes in this dataset. Note that the age determination is the length between the time a race occurs and the date of birth of a given athlete. Here, we see a slight increase in both the mean and median ages of athletes after 1980. Afterwards, the mean and median ages stabilize between 25 and 26 years.

my_100_dash_data %>% 
        
        group_by(year(date)) %>% 
        
        summarise(mean_age = mean(age_years, na.rm = TRUE),
                  
                  median_age = median(age_years, na.rm = TRUE)) %>% 
        
        rename(year = `year(date)`) %>% 
        
        pivot_longer(-year, names_to = "metric", values_to = "age") %>% 
        
        ggplot(mapping = aes(x = year, y = age, col = metric)) + 
        
        geom_line() + 
        
        ggthemes::theme_economist() +
        
        scale_colour_manual(values = c("red", "blue"))

################################################################################

Next, I examine the overall age structure of the athletes using a histogram. The plot shows that most athletes tend to be in their early 20s when they post leading times. After 30 years of age, the performance of athletes declines markedly with minor exceptions.

my_100_dash_data %>% 
        
        ggplot(mapping = aes(x = age_years)) + 
        
        geom_histogram(col = "black", fill = "skyblue", binwidth = 1) + 
        
        ggthemes::theme_economist() + 
        
        labs(x = "Age in Years", y = "Count", 
             
             title = "Histogram of Age of Top 100 Metres Male Sprinters")

###############################################

Slight Detour: A Focus on Usain Bolt

In this section, I delve deeper into the performance by Usain Bolt, the most successful 100 meters athlete of all time. First, I examine the times Usain posted for each year. The figure below shows that the years that Usain recorded the best times also exhibit higher a standard deviation. The table below confirms this observation.

my_100_dash_data %>% 
        
        filter(competitor == "Usain BOLT") %>% 
        
        ggplot(mapping = aes(x = factor(year(date)), y = mark)) + 
        
        geom_boxplot(mapping = aes(fill = factor(year(date)))) + 
        
        geom_point() + 
        
        ggthemes::theme_economist() +
        
        theme(legend.position = "none") +
        
        labs(x = "Year", y = "Mark/Time in Seconds", 
             
             title = "Usain  Bolt 100 Meters Races History 2007-2017")

my_100_dash_data %>%
        
        filter(competitor == "Usain BOLT") %>% 
        
        group_by(year = lubridate::year(date)) %>% 
        
        summarise(median = median(mark, na.rm = TRUE), 
                  
                  sd = sd(mark, na.rm = TRUE)) %>% 
        
        arrange(desc(sd)) %>% 
        
        kbl(., booktabs = TRUE, 
            
            caption = "Median and Standard Deviation for USAIN BOLT") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Median and Standard Deviation for USAIN BOLT

year	median	sd
2009	9.910	0.1808214
2008	9.850	0.1618404
2012	9.860	0.1440139
2016	10.010	0.1188036
2010	9.860	0.1171324
2011	9.910	0.1165782
2013	9.940	0.1047681
2015	9.870	0.0717635
2017	10.005	0.0539135
2007	10.030	NA
2014	9.980	NA

################################################################################

Age, Number of Races per Year and the Perfromance of Athletes

In this section, I use regression analysis to examine whether age and the number of races an athlete participates in any given year has a bearing on their performance. The figure below indicated that age matters for 100 meters sprinters, with peak performance, observed in the mid-20s.

Similarly, there appears to be a relationship between the number of races and the times posted by athletes. The next figure shows this relationship. As the number of races increases, generally, athletes perform better. But this trend stabilizes beyond a point. Note, however, that these models could suffer from omitted variables bias and can only serve as a basis for further analysis.

races_time <- my_100_dash_data %>% 
        
        group_by(competitor, year(date)) %>% 
        
        rename(year = `year(date)`) %>% 
        
        summarise(races = n(),
                  
                  age = age_years,
                  
                  best_time = min(mark),
                  
                  median_time = median(mark),
                  
                  mean_time = mean(mark), 
                  
                  max_time = max(mark))

###############################################################################

races_time %>% pivot_longer(-c("competitor", "year", "races", "age"),
                     
                     names_to = "perf", values_to = "time") %>% 
        
        ggplot(mapping = aes(x = races, y = time)) + 
        
        geom_hex(alpha = 0.5) +
        
        scale_fill_gradient(low = "grey", high = "red") +
        
        geom_point(shape = ".") + 
        
        geom_density_2d() + 
        
        geom_smooth(col = "green", lty = "dashed") + 
        
        labs(x = "Races", y = "Best Time", 
             
             title = "No of Races per Year vs Worst/ Max Times",
             
             caption = "John Karuitha, 2021") + 
        
        facet_wrap(~ perf) + 
        
        ggthemes::theme_clean()

################################################################################

The regression analysis shows that both age and number of races significantly correlate with times posted. However, we should take this result with a grain of salt due to the possibility of omitted variables bias.

races_lm <- lm(best_time ~ age + races, 
               
               data = races_time)

broom::tidy(races_lm) %>% 
        
        kbl(., booktabs = TRUE, caption = "Linear Model") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

Linear Model

term	estimate	std.error	statistic
(Intercept)	10.2379013	0.0037943	2698.201922
age	-0.0008326	0.0001524	-5.463868
races	-0.0145648	0.0000990	-147.179969

races_gam <- mgcv::gam(best_time ~ s(age) + s(races), 
                 
                 data = races_time,
                 
                 family = gaussian)


broom::tidy(races_gam) %>% 
        
        kbl(., booktabs = TRUE, caption = "The GAM") %>% 
        
        kable_classic(full_width = FALSE, latex_option = "hold_position")

The GAM

term	edf	ref.df	statistic	p.value
s(age)	8.629390	8.957311	10.74528	0
s(races)	8.673532	8.951942	3225.65214	0

################################################################################

stargazer::stargazer(performance::compare_performance(races_lm, races_gam))

% Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu % Date and time: Sun, Feb 27, 2022 - 23:11:16

Overall, the GAM model outperforms the linear model in all metrics.

plot(performance::compare_performance(races_lm, races_gam))

################################################################################

Conclusion

This article examined the performance of male 100 meters athletes using data from World Athletics. Results show a gradual improvement in times posted by athletes. The peak performance age for athletes is the mid-20s. The United States has the highest number of athletes in the dataset, followed by Jamaica. The races are very close, with just over half a second separating the world record holder Usain Bolt world record time with the lowest performance in the dataset. The number of races an athlete runs in a year has a non-linear impact on performance.

Who’s the Fastest of All? Analyzing the 100 Metres Men’s Sprint Data

100 Metres Sprint Best Times, 1958 - Present

John Karuitha

2022-02-27

Background

Exploring the Data

Web Scrapping

Feature Engineering

Missing Data

Exploratory Data Analysis

Distribution of 100 Meters Male Sprint Best Times

Most Successful Countries

World Record Holders

Top 10 100 Metres Dash Athletes

Athletes With the Most appearances in the Fastest Athletes List

The Age Structure of Athletes

Slight Detour: A Focus on Usain Bolt

Age, Number of Races per Year and the Perfromance of Athletes

Conclusion