I use data from World Athletics on the best times posted by male athletes in the 100 metres sprint from 1958 to present 1 The data is available on this link https://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/men/senior?regionType=world&timing=electronic&windReading=regular&page=21&bestResultsOnly=false&firstDay=1900-01-01&lastDay=2021-09-20.
Objectives
I examine the following questions in this article.
Summary of Results
Exploring the Data
The raw data consists of the following variables;
rank
: Starting from the athlete that has posted the
best time to date.mark
: The time (in seconds) posted by the athlete.wind
: The wind assist. Negative speeds indicate the
athlete was running against the wind.competitor
: The name of the athlete.dob
: Date of birth of the athlete.nat
: Nationality of the athlete.pos
: position of the athlete in the given race.venue
: Venue of the race.date
: Date the race happened.results_score
: The athlete’s score in the race by World
Athletics.Note that many athletes appear multiple times provided the times. For instance, Usain Bolt posted the top 3 best times in different races.
read_chunk("code/sprints.R")
As noted, the data is available on the World Athletics website. This data spans two hundred and twenty-four (224) web pages as at the time I am writing this article (2022-02-27). Tellingly, it would take ages to copy-paste this data. Hence, the first exercise is to scrap the data.
Please refer to my previous project on web scrapping available on https://rpubs.com/Karuitha/web_scrapping_1. I outline the steps in scrapping the data next.
The scrapping process takes considerable time. For this reason, I have commented out the last two lines of code that do the scrapping.
NOTE: To repeat the scrapping process, remove the # before the two lines of code.
pages <- 1:225
url <- "https://www.worldathletics.org/records/all-time-toplists/sprints/100-metres/outdoor/men/senior?regionType=world&timing=electronic&windReading=regular&page="
url_2 <- "&bestResultsOnly=false&firstDay=1900-01-01&lastDay=2021-09-20"
################################################################################
## Scrapping function
scrapper <- function(x){
Sys.sleep(2)
read_html(paste0(url, x, url_2)) %>%
html_nodes("table") %>%
html_table()
}
################################################################################
## Below is the code for web scrapping.
## I have commented it out as it takes time to run.
## I have saved the data in a .csv file.: my_100_dash_data.csv
## You can uncomment to rerun the harvesting of data from the web.
################################################################################
# my_100_dash_data <- pages %>% map_dfr(~ scrapper(.x))
#write_csv(my_100_dash_data, "my_100_dash_data.csv")
################################################################################
The resultant dataset has 22,400 rows and 14 columns. The data
cleaning process involves converting the date of birth
(dob) and date
to the date/ time format. After this, I did
feature engineering, adding the following variables.
Age of athletes at the time of the race in days.
Age of athletes at the time of the race in years. I divided the age in days by 365.25 to get years.
Venue country code: The code of the country where the race happened.
Venue country name: I used the countrycode
package
in R to convert the country codes into country names. Where missing, I
used information available on https://www.olympiandatabase.com/index.php?id=1670&L=1
to fill in the country names.
my_100_dash_data <- read_csv("data/my_100_dash_data.csv") %>%
clean_names() %>%
select(-x8) %>%
mutate(dob = lubridate::dmy(dob),
date = lubridate::dmy(date),
age_days = (date - dob),
age_years = as.numeric(age_days / 365.25),
venue_country_code = str_extract_all(venue, "\\([A-Z]*\\)"),
venue_country_code = str_remove_all(venue_country_code, "\\(|\\)")) %>%
mutate(venue_country_name = countrycode(venue_country_code,
origin = "ioc",
destination = "country.name")) %>%
mutate(venue_country_name = case_when(venue_country_code == "AHO" ~ "Netherlands Antilles",
venue_country_code == "FRG" ~ "Germany",
venue_country_code == "GDR" ~ "Germany",
venue_country_code == "MAC" ~ "Macau",
venue_country_code == "TCH" ~ "Czechia",
venue_country_code == "TKS" ~ "Turks and Caicos Islands",
venue_country_code == "URS" ~ "Russia",
TRUE ~ venue_country_name
))
################################################################################
Only four variables have missing data- wind, dob, age_days, age_years, and pos. However, the extent of missingness is not great, as the proportion of missing data shows in the table below.
Amelia::missmap(my_100_dash_data)
Visualisation of Missing Data
sapply(my_100_dash_data, is.na) %>%
colSums() %>%
tibble(variables = names(my_100_dash_data), missing = .) %>%
arrange(desc(missing)) %>%
mutate(prop_percent = missing / nrow(my_100_dash_data) * 100) %>%
head(10) %>%
kbl(., booktabs = TRUE, caption = "Missing Data") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Missing Data
variables | missing | prop_percent |
---|---|---|
wind | 431 | 1.9241071 |
dob | 181 | 0.8080357 |
age_days | 181 | 0.8080357 |
age_years | 181 | 0.8080357 |
pos | 50 | 0.2232143 |
rank | 0 | 0.0000000 |
mark | 0 | 0.0000000 |
competitor | 0 | 0.0000000 |
nat | 0 | 0.0000000 |
venue | 0 | 0.0000000 |
################################################################################
This section begins by examining the distribution of times posted by the athletes. Next, I discuss the countries with the most athletes who posted the best times in the 100 metres dash.
The graph below shows the distribution of the times posted by male
athletes in the 100 metres sprint. It shows the razor-thin margins that
separate the world beaters like Usain Bolt
and other
athletes. The maximum difference between the world record holder and the
worst time recorded in the dataset is 0.72.
my_100_dash_data %>%
ggplot(mapping = aes(x = mark)) +
geom_histogram(col = "black", fill = "skyblue", binwidth = 0.01) +
ggthemes::theme_economist() +
labs(x = "Mark", y = "Count",
title = "Histogram of Best Times of Top 100 Metres Male Sprinters")
################################################################################
I also examine the trend in world-leading times. The figure below shows a gradual decline in the best times posted by male sprinters over 100 meters. There is no telling whether this trend will continue.
## Improvements over time- best times per year
my_100_dash_data %>%
group_by(year = year(date)) %>%
summarise(year_record = min(mark)) %>%
ggplot(mapping = aes(x = year, y = year_record)) +
geom_line(col = "blue") +
labs(x = "Year", y = "Mark/Best Time",
title = "Trend in 100 Meters Men Best Times") +
ggthemes::theme_economist()
The table below shows the countries with the highest number of athletes among the top sprinters. The United States leads, as the table below shows. However, as noted earlier, the athletes are repeated, given that one athlete may have posted multiple times. In the next table, I remove the duplicates to get the country with the most athletes.
my_100_dash_data %>%
count(nat, sort = TRUE) %>%
head(10) %>%
kbl(., booktabs = TRUE, caption = "NUmber of Athletes by Nationality") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
NUmber of Athletes by Nationality
nat | n |
---|---|
USA | 7086 |
JAM | 2426 |
GBR | 1648 |
NGR | 1076 |
CAN | 887 |
JPN | 736 |
TTO | 729 |
FRA | 600 |
RSA | 555 |
BRA | 499 |
############
my_100_dash_data %>%
select(competitor, nat) %>%
filter(!duplicated(.)) %>%
count(nat) %>%
arrange(desc(n)) %>%
head(10) %>%
kbl(., booktabs = TRUE, caption = "Number of Distict Atletes by Nationality") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Number of Distict Atletes by Nationality
nat | n |
---|---|
USA | 691 |
JAM | 132 |
GBR | 90 |
JPN | 90 |
NGR | 71 |
CAN | 55 |
RSA | 52 |
CHN | 47 |
FRA | 44 |
GER | 42 |
################################################################################
The table below shows the ten athletes who have posted the best times in the 100 metres dash. Note that Usain Bolt appears in this list four times.
my_100_dash_data %>%
select(competitor, mark) %>%
arrange(mark) %>%
head(10) %>%
kbl(., booktabs = TRUE,
caption = "Top 10 Best Times in 100 Meters Dash") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Top 10 Best Times in 100 Meters Dash
competitor | mark |
---|---|
Usain BOLT | 9.58 |
Usain BOLT | 9.63 |
Usain BOLT | 9.69 |
Tyson GAY | 9.69 |
Yohan BLAKE | 9.69 |
Tyson GAY | 9.71 |
Usain BOLT | 9.72 |
Asafa POWELL | 9.72 |
Asafa POWELL | 9.74 |
Justin GATLIN | 9.74 |
###########################################
It is common knowledge that Usain Bolt is the record holder. Usain Bolt holds four of the ten best times in the 100 meters dash. But who are the other top contenders? I remove duplicates so that we have one entry per athlete. With that, the top 10 athletes are in table below.
my_100_dash_data %>%
select(competitor, mark) %>%
group_by(competitor) %>%
arrange(mark) %>%
slice(1) %>%
ungroup() %>%
arrange(mark) %>%
head(10) %>%
kbl(., booktabs = TRUE, caption = "Top 10 100 Meters Athletes") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Top 10 100 Meters Athletes
competitor | mark |
---|---|
Usain BOLT | 9.58 |
Tyson GAY | 9.69 |
Yohan BLAKE | 9.69 |
Asafa POWELL | 9.72 |
Justin GATLIN | 9.74 |
Christian COLEMAN | 9.76 |
Trayvon BROMELL | 9.76 |
Ferdinand OMANYALA | 9.77 |
Nesta CARTER | 9.78 |
Maurice GREENE | 9.79 |
################################################################################
The issue here is to examine the athlete that has appeared in the
list of elite athletes the most times. In other words, which athlete has
had the most races that appear in the dataset of the fastest 100 meters
athletes? Here, Michael Rodgers
from the USA leads the way,
having participated in 267 races and posting some of the best times in
the World that appear in this dataset.
my_100_dash_data %>%
count(competitor, sort = TRUE) %>%
head(10) %>%
kbl(., booktabs = TRUE, caption = "Most Appearances (Races) in the Top Sprinters List") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Most Appearances (Races) in the Top Sprinters List
competitor | n |
---|---|
Michael RODGERS | 267 |
Kim COLLINS | 224 |
Asafa POWELL | 196 |
Dennis MITCHELL | 192 |
Michael FRATER | 182 |
Frank FREDERICKS | 173 |
Justin GATLIN | 162 |
Francis OBIKWELU | 161 |
Linford CHRISTIE | 161 |
Bruny SURIN | 155 |
################################################################################
This section examines the ages of the athletes featured in the data.
The average age is 25.119 years, while the median is 24.65, with a
standard deviation of 3.835. The Nigerian sprinter
Chinedu ORIALA
is the youngest athlete in the dataset,
while Kim Collins
is the oldest at around 41 years. We
refer to the athlete’s age when they participated in a race and
not how old the athlete is currently.
my_100_dash_data %>%
skim_without_charts(age_years, mark) %>%
kbl(., booktabs = TRUE, caption = "Summary Statistics for Athletes Age and Mark") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Summary Statistics for Athletes Age and Mark
skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
---|---|---|---|---|---|---|---|---|---|---|
numeric | age_years | 181 | 0.9919196 | 25.11868 | 3.8350342 | 15.25804 | 22.3655 | 24.65161 | 27.41958 | 41.295 |
numeric | mark | 0 | 1.0000000 | 10.18856 | 0.0965855 | 9.58000 | 10.1400 | 10.21000 | 10.26000 | 10.300 |
################################################################################
## Youngest athlete
my_100_dash_data[which.min(my_100_dash_data$age_years), ] %>%
kbl(., booktabs = TRUE, caption = "Youngest Athlete in the Dataset") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Youngest Athlete in the Dataset
rank | mark | wind | competitor | dob | nat | pos | venue | date | results_score | age_days | age_years | venue_country_code | venue_country_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18050 | 10.28 | 0 | Chinedu ORIALA | 1981-12-17 | NGR | h | Benin City (NGR) | 1997-03-21 | 1112 | 5573 days | 15.25804 | NGR | Nigeria |
################################################################################
## Oldest athlete
my_100_dash_data[which.max(my_100_dash_data$age_years), ] %>%
kbl(., booktabs = TRUE, caption = "Oldest Athlete in the Dataset") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Oldest Athlete in the Dataset
rank | mark | wind | competitor | dob | nat | pos | venue | date | results_score | age_days | age_years | venue_country_code | venue_country_name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13439 | 10.24 | -0.3 | Kim COLLINS | 1976-04-05 | SKN | 2sr1 | Freeport (BAH) | 2017-07-22 | 1126 | 15083 days | 41.295 | BAH | Bahamas |
The figures below show the age structure of the athletes in this dataset. Note that the age determination is the length between the time a race occurs and the date of birth of a given athlete. Here, we see a slight increase in both the mean and median ages of athletes after 1980. Afterwards, the mean and median ages stabilize between 25 and 26 years.
my_100_dash_data %>%
group_by(year(date)) %>%
summarise(mean_age = mean(age_years, na.rm = TRUE),
median_age = median(age_years, na.rm = TRUE)) %>%
rename(year = `year(date)`) %>%
pivot_longer(-year, names_to = "metric", values_to = "age") %>%
ggplot(mapping = aes(x = year, y = age, col = metric)) +
geom_line() +
ggthemes::theme_economist() +
scale_colour_manual(values = c("red", "blue"))
################################################################################
Next, I examine the overall age structure of the athletes using a histogram. The plot shows that most athletes tend to be in their early 20s when they post leading times. After 30 years of age, the performance of athletes declines markedly with minor exceptions.
my_100_dash_data %>%
ggplot(mapping = aes(x = age_years)) +
geom_histogram(col = "black", fill = "skyblue", binwidth = 1) +
ggthemes::theme_economist() +
labs(x = "Age in Years", y = "Count",
title = "Histogram of Age of Top 100 Metres Male Sprinters")
###############################################
In this section, I delve deeper into the performance by Usain Bolt, the most successful 100 meters athlete of all time. First, I examine the times Usain posted for each year. The figure below shows that the years that Usain recorded the best times also exhibit higher a standard deviation. The table below confirms this observation.
my_100_dash_data %>%
filter(competitor == "Usain BOLT") %>%
ggplot(mapping = aes(x = factor(year(date)), y = mark)) +
geom_boxplot(mapping = aes(fill = factor(year(date)))) +
geom_point() +
ggthemes::theme_economist() +
theme(legend.position = "none") +
labs(x = "Year", y = "Mark/Time in Seconds",
title = "Usain Bolt 100 Meters Races History 2007-2017")
my_100_dash_data %>%
filter(competitor == "Usain BOLT") %>%
group_by(year = lubridate::year(date)) %>%
summarise(median = median(mark, na.rm = TRUE),
sd = sd(mark, na.rm = TRUE)) %>%
arrange(desc(sd)) %>%
kbl(., booktabs = TRUE,
caption = "Median and Standard Deviation for USAIN BOLT") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Median and Standard Deviation for USAIN BOLT
year | median | sd |
---|---|---|
2009 | 9.910 | 0.1808214 |
2008 | 9.850 | 0.1618404 |
2012 | 9.860 | 0.1440139 |
2016 | 10.010 | 0.1188036 |
2010 | 9.860 | 0.1171324 |
2011 | 9.910 | 0.1165782 |
2013 | 9.940 | 0.1047681 |
2015 | 9.870 | 0.0717635 |
2017 | 10.005 | 0.0539135 |
2007 | 10.030 | NA |
2014 | 9.980 | NA |
################################################################################
In this section, I use regression analysis to examine whether age and the number of races an athlete participates in any given year has a bearing on their performance. The figure below indicated that age matters for 100 meters sprinters, with peak performance, observed in the mid-20s.
Similarly, there appears to be a relationship between the number of races and the times posted by athletes. The next figure shows this relationship. As the number of races increases, generally, athletes perform better. But this trend stabilizes beyond a point. Note, however, that these models could suffer from omitted variables bias and can only serve as a basis for further analysis.
races_time <- my_100_dash_data %>%
group_by(competitor, year(date)) %>%
rename(year = `year(date)`) %>%
summarise(races = n(),
age = age_years,
best_time = min(mark),
median_time = median(mark),
mean_time = mean(mark),
max_time = max(mark))
###############################################################################
races_time %>% pivot_longer(-c("competitor", "year", "races", "age"),
names_to = "perf", values_to = "time") %>%
ggplot(mapping = aes(x = races, y = time)) +
geom_hex(alpha = 0.5) +
scale_fill_gradient(low = "grey", high = "red") +
geom_point(shape = ".") +
geom_density_2d() +
geom_smooth(col = "green", lty = "dashed") +
labs(x = "Races", y = "Best Time",
title = "No of Races per Year vs Worst/ Max Times",
caption = "John Karuitha, 2021") +
facet_wrap(~ perf) +
ggthemes::theme_clean()
################################################################################
The regression analysis shows that both age and number of races significantly correlate with times posted. However, we should take this result with a grain of salt due to the possibility of omitted variables bias.
races_lm <- lm(best_time ~ age + races,
data = races_time)
broom::tidy(races_lm) %>%
kbl(., booktabs = TRUE, caption = "Linear Model") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
Linear Model
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 10.2379013 | 0.0037943 | 2698.201922 | 0 |
age | -0.0008326 | 0.0001524 | -5.463868 | 0 |
races | -0.0145648 | 0.0000990 | -147.179969 | 0 |
races_gam <- mgcv::gam(best_time ~ s(age) + s(races),
data = races_time,
family = gaussian)
broom::tidy(races_gam) %>%
kbl(., booktabs = TRUE, caption = "The GAM") %>%
kable_classic(full_width = FALSE, latex_option = "hold_position")
The GAM
term | edf | ref.df | statistic | p.value |
---|---|---|---|---|
s(age) | 8.629390 | 8.957311 | 10.74528 | 0 |
s(races) | 8.673532 | 8.951942 | 3225.65214 | 0 |
################################################################################
stargazer::stargazer(performance::compare_performance(races_lm, races_gam))
% Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard
University. E-mail: hlavac at fas.harvard.edu % Date and time: Sun, Feb
27, 2022 - 23:11:16
Overall, the GAM model outperforms the linear model in all metrics.
plot(performance::compare_performance(races_lm, races_gam))
################################################################################
This article examined the performance of male 100 meters athletes
using data from World Athletics
. Results show a gradual
improvement in times posted by athletes. The peak performance age for
athletes is the mid-20s. The United States has the highest number of
athletes in the dataset, followed by Jamaica. The races are very close,
with just over half a second separating the world record holder Usain
Bolt world record time with the lowest performance in the dataset. The
number of races an athlete runs in a year has a non-linear impact on
performance.