In the inauguration of the Olympic games in 1986, 17 men raced from the Marathon Bridge to the Olympic Stadium in Athens. The race was to honor the legend of a Greek foot-soldier, Pheidippides, who was sent from Marathon to Athens with the news of the victory over the Persian army.The winner of the first marathon race was a 24-year-old man by the name of Spyridon Louis from Greece with a time of two hours fifty-eight minutes and fifty seconds. This race inspired John Graham and Herbert H. Holton to recreate the same event in the city of Boston.
The Boston Marathon was founded in 1987 and was organized by US Olympic Team manager John Graham and Boston businessman Herbert H. Holton. The two men decided on a 24.5-mile road course from Metcalf’s Mill in Ashland to the Irvington Oval in Boston. On April 19, 1897, 15 participants entered the first Boston Marathon. John J. McDermott emerged the winner of the race with a time of 2:55:10. Thrifty- four years later, the Boston Marathon is still a competitive event attracting more than twenty thousand participants every year.
The Boston Marathon distance has since been adjusted to meet the Olympic standard of 42 kilometers, which was set in 1924. The male and female record holders for the Boston Marathon are Geoffrey Mutai and Buzunesh Deba. On April 18, 2011, Geoffrey Mutai from Kenya set a new course record for men with a time of 2:03:02. On April 21, 2014, Buzunesh Deba from Ethiopia set a new course record for women with a time of 2:19:59.
I have recently started running and had thoughts about entering a 5K race to test my merit. The marathon is one of the ultimate challenges for distance runners. It is considered a great test of muscular and respiratory endurance.
What does it take to be an elite runner in the Boston Marathon? The term elite in this exercise means to be within the 25th percentile. I will be using data from runners who participated in the Boston Marathon between 2015 and 2017 to attempt to answer this question.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'stringr' was built under R version 4.0.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(countrycode)
## Warning: package 'countrycode' was built under R version 4.0.5
The 2015, 2016, and 2017 Boston Marathon datasets were acquired from https://www.kaggle.com/rojour/boston-results.
#set up work environment
setwd("C:/Users/mivul/OneDrive/Desktop/Data 110/Datasets")
#upload dataset csv files
boston17 <- read.csv("marathon_results_2017.csv")
boston16 <- read.csv("marathon_results_2016.csv")
boston15 <- read.csv("marathon_results_2015.csv")
The boston17 dataset is being used as an example to demonstrate the structure of the datasets. Each dataset has around twenty-six thousand observations. Each observation references a runner that participated in the Boston Marathon. The 2015 and 2017 datasets each contain 25 variables, while the 2016 dataset has 24 variables. The 2016 dataset doesn’t include the number of participants as a variable.
str(boston17)
## 'data.frame': 26410 obs. of 25 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Bib : chr "11" "17" "23" "21" ...
## $ Name : chr "Kirui, Geoffrey" "Rupp, Galen" "Osako, Suguru" "Biwott, Shadrack" ...
## $ Age : int 24 30 25 32 31 40 33 28 27 28 ...
## $ M.F : chr "M" "M" "M" "M" ...
## $ City : chr "Keringet" "Portland" "Machida-City" "Mammoth Lakes" ...
## $ State : chr "" "OR" "" "CA" ...
## $ Country : chr "KEN" "USA" "JPN" "USA" ...
## $ Citizen : chr "" "" "" "" ...
## $ X.1 : chr "" "" "" "" ...
## $ X5K : chr "0:15:25" "0:15:24" "0:15:25" "0:15:25" ...
## $ X10K : chr "0:30:28" "0:30:27" "0:30:29" "0:30:29" ...
## $ X15K : chr "0:45:44" "0:45:44" "0:45:44" "0:45:44" ...
## $ X20K : chr "1:01:15" "1:01:15" "1:01:16" "1:01:19" ...
## $ Half : chr "1:04:35" "1:04:35" "1:04:36" "1:04:45" ...
## $ X25K : chr "1:16:59" "1:16:59" "1:17:00" "1:17:00" ...
## $ X30K : chr "1:33:01" "1:33:01" "1:33:01" "1:33:01" ...
## $ X35K : chr "1:48:19" "1:48:19" "1:48:31" "1:48:58" ...
## $ X40K : chr "2:02:53" "2:03:14" "2:03:38" "2:04:35" ...
## $ Pace : chr "0:04:57" "0:04:58" "0:04:59" "0:05:03" ...
## $ Proj.Time : chr "-" "-" "-" "-" ...
## $ Official.Time: chr "2:09:37" "2:09:58" "2:10:28" "2:12:08" ...
## $ Overall : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Division : int 1 2 3 4 5 1 6 7 8 9 ...
x: Number of participants
bib: Assigned race number based on qualifying time. “F” could appear for female elites.
name: Name of runner (Last, First)
age: Age of runner on the day of the race
M.F: Runner’s gender
country: Runner’s country of residence
city: Runner’s city of residence
state: Runner’s state of residence (if applicable)
Citizen: Runner’s nationality
X.1: Unknown
X5K, X10K,X15K, X20K, Half, X25K, X30K, X35K, X40K: Estimated time at kilometer checkpoint
Proj.Time: Expected time of participant finishing the race
pace: Average pace of race
Official.Time: Final time participants finished the race
overall place: Placement in the overall position
gender place: Placement based off of gender in the event
division place: Placement as compared to your age group
The three datasets will be merged to observe the overall top performances from the runners who participated in the marathon. Adjustments to the table were made for analytic purposes.
#Merge the datasets
nbos <- boston17 %>%
#Merge the dataset by stacking the rows on top of each other
rbind(boston15) %>%
#remove columns that do match in order to merge dataset
select(!c(X.1, X)) %>%
#Merge the dataset by stacking the rows on top of each other
rbind(select(boston16, !X))
#Convert variables from characters to time variables
nbos$Official.Time<-as.POSIXct(x = nbos$Official.Time, format = "%H:%M:%S")
nbos$M.F <-as.factor(nbos$M.F)
nbos$X5K<-as.POSIXct(nbos$X5K, format="%H:%M:%S")
nbos$X10K<-as.POSIXct(nbos$X10K, format="%H:%M:%S")
nbos$X15K<-as.POSIXct(nbos$X15K, format="%H:%M:%S")
nbos$X20K<-as.POSIXct(nbos$X20K, format="%H:%M:%S")
nbos$X25K<-as.POSIXct(nbos$X25K, format="%H:%M:%S")
nbos$X30K<-as.POSIXct(nbos$X30K, format="%H:%M:%S")
nbos$X35K<-as.POSIXct(nbos$X35K, format="%H:%M:%S")
nbos$X40K<-as.POSIXct(nbos$X40K, format="%H:%M:%S")
#Convert Age variable from characters to numeric
nbos$Age <-as.numeric(nbos$Age)
#Create Age groups
setDT(nbos)[Age <1, agegroup := "0-1"]
nbos[Age >17 & Age <40, agegroup := "18-39"]
nbos[Age >39 & Age <45, agegroup := "40-44"]
nbos[Age >44 & Age <50, agegroup := "45-49"]
nbos[Age >49 & Age <55, agegroup := "50-54"]
nbos[Age >54 & Age <60, agegroup := "55-59"]
nbos[Age >59 & Age <65, agegroup := "60-64"]
nbos[Age >64 & Age <70, agegroup := "65-69"]
nbos[Age >69 & Age <75, agegroup := "70-74"]
nbos[Age >74 & Age <80, agegroup := "75-79"]
nbos[Age >80, agegroup := "80+"]
#Convert three letter country code to country name
nbos$Country<-countrycode(nbos$Country, "wb", "country.name", nomatch = NULL)
The filter function will be used on the new table to separate the participants by gender. The quantile function is used find the official time ran by men and women in the 25th percentile.
#filter nbos for male
m_nbos <-nbos %>%
filter(M.F == "M")
#filter nbos for female
f_nbos <- nbos %>%
filter(M.F == "F")
#Find the quantile for male and female official times
quantile(m_nbos$Official.Time, na.rm = TRUE)
## 0% 25% 50%
## "2021-04-21 02:09:17 EDT" "2021-04-21 03:13:10 EDT" "2021-04-21 03:35:17 EDT"
## 75% 100%
## "2021-04-21 04:05:31 EDT" "2021-04-21 08:25:09 EDT"
quantile(f_nbos$Official.Time)
## 0% 25% 50%
## "2021-04-21 02:21:52 EDT" "2021-04-21 03:37:46 EDT" "2021-04-21 03:55:38 EDT"
## 75% 100%
## "2021-04-21 04:23:30 EDT" "2021-04-21 10:30:23 EDT"
A new table is created for the 25th percentile. The table will be used for comparison analysis in the other sections.
#create new table that only has the participants in the 25th percentile
nbos_25 <- nbos %>%
#filter for times less than or equal to the threshold for men and women in the 25th percentile
filter(M.F == "M" & Official.Time <="2021-04-21 03:13:10" | M.F == "F" & Official.Time <="2021-04-21 03:37:46")
The complete participant table and the 25th percentile table will be used to conduct a comparison analysis of what makes an elite runner.
Observation of the distribution of all participant’s ages by gender.
#plot a histogram to view the distribution of ages for all participants
ggplot(nbos, aes(x = Age, fill = M.F)) +
#create histogram with bind width of five
geom_histogram(binwidth = 5, color = "black")+
#set theme
theme_classic()+
#set color pallete
scale_fill_manual(values = c("yellow","blue")) +
#separate the graph by gender
facet_wrap( ~M.F) +
#set title
ggtitle("Age Distribution of All Participants")+
#remove legend
theme(legend.position = "none")
The histogram for all participants displays that women between the ages of 15 to 35 have higher levels of participation compared to men in the same age interval. The median age of male participants in the event is between 45 and 50 years of age. The median age of female participants in the event is between 40 and 45 years of age.
#calculate median age of all female participants
median(f_nbos$Age)
## [1] 40
#calculate median age of all male participants
median(m_nbos$Age)
## [1] 45
The median age calculation confirmed that median age for elite females and males is 40 and 45, respectively.
Next will be the observation of the distribution of only elite participants’ ages by gender.
#plot a histogram to view the distribution of ages for elite participants
ggplot(nbos_25, aes(x = Age, fill = M.F)) +
#
geom_histogram(binwidth = 5, color = "black")+
#set theme
theme_classic()+
#set color pallete
scale_fill_manual(values = c("yellow", "blue")) +
#separate the graph by gender
facet_wrap(~M.F)+
#set title
ggtitle("Age Distribution of Elite Participant")+
#remove legend
theme(legend.position = "none")
The histogram for elite participants displays the median age for elite female athletes is between 35 and 40. The median age of elite male athletes is also between 35 and 40.
#filter 25th percentile table to observe only men
m_nbos_25 <-nbos_25 %>%
filter(M.F == "M")
#filter 25th percentile table to observe only women
f_nbos_25 <-nbos_25 %>%
filter(M.F == "F")
#calculate median age of elite male participants
median(m_nbos_25$Age)
## [1] 36
#calculate median age of elite female participants
median(f_nbos_25$Age)
## [1] 34
The median age calculation by gender confirmed that median age for elite males and females is 36 and 34, respectively.
The participant observation demonstrated that the median age between the the elite runners and general population is lower for women compared to men.
The next observation will be on the top twenty participants’ country of residence. The United states displayed the greatest number of participants in the event. It has been removed from the graph in order to better visualize the other countries.
nbos %>%
#remove the United States
filter(Country != "United States")%>%
group_by(M.F) %>%
#count number of countries
count(Country)%>%
#filter for only male
filter(M.F == "M")%>%
#arrange in descending order
arrange(desc(n)) %>%
#get top 20
head(20) %>%
ggplot(aes(x=Country, y = n, fill = M.F)) +
#create bar graph
geom_bar(stat = "identity", color = "black") +
#rotate the graph
coord_flip() +
#set theme
theme_classic()+
#set color palette
scale_fill_manual(name = "Gender", values = c("blue"))+
#set title
ggtitle("Top 20 Countries of Male Runners") +
#set y axis label
ylab("Number of Runners")+
#remove legend
theme(legend.position = "none")
nbos %>%
#remove the United States
filter(Country != "United States")%>%
group_by(M.F) %>%
#count number of countries
count(Country)%>%
#filter for only female
filter(M.F == "F")%>%
#arrange in descending order
arrange(desc(n)) %>%
#get top 20
head(20) %>%
ggplot(aes(x=Country, y = n, fill = M.F)) +
#create bar graph
geom_bar(stat = "identity", color = "black") +
#rotate graph
coord_flip() +
#set theme
theme_classic()+
#set color palette
scale_fill_manual(name = "Gender", values = c("yellow"))+
#set title
ggtitle("Top 20 Countries of Female Runners") +
#set y axis label
ylab("Number of Runners")+
#remove legend
theme(legend.position = "none")
nbos_25 %>%
#remove the United States
filter(Country != "United States")%>%
group_by(M.F) %>%
#count number of countries
count(Country)%>%
#filter for only male
filter(M.F == "M")%>%
#arrange in descending order
arrange(desc(n)) %>%
#get top 20
head(20) %>%
ggplot(aes(x=Country, y = n, fill = M.F)) +
#create bar graph
geom_bar(stat = "identity", color = "black") +
#rotate graph
coord_flip() +
#set theme
theme_classic()+
#set color palette
scale_fill_manual(name = "Gender", values = c("blue"))+
#set title
ggtitle("Top 20 Countries of Elite Male Runners") +
#set y axis label
ylab("Number of Runners")+
#remove legend
theme(legend.position = "none")
nbos_25 %>%
#remove the United States
filter(Country != "United States")%>%
group_by(M.F) %>%
#count number of countries
count(Country)%>%
#filter for only female
filter(M.F == "F")%>%
#arrange in descending order
arrange(desc(n)) %>%
head(20) %>%
ggplot(aes(x=Country, y = n, fill = M.F)) +
geom_bar(stat = "identity", color = "black") +
coord_flip() +
#set theme
theme_classic()+
#set color palette
scale_fill_manual(name = "Gender", values = c("yellow"))+
#set title
ggtitle("Top 20 Countries of Elite Female Runners") +
#set y axis label
ylab("Number of Runners")+
#remove legend
theme(legend.position = "none")
The top 3 the Countries of Residence for all runners are Canada, Mexico, and the United Kingdom, according to the graphs.
#create table with top 20 countries of all participants
nbos_m_c<-m_nbos %>%
filter(Country != "United States")%>%
count(Country)%>%
arrange(desc(n)) %>%
head(20)
nbos_f_c<-f_nbos %>%
filter(Country != "United States")%>%
count(Country)%>%
arrange(desc(n)) %>%
head(20)
#create table with top 20 countries of elite participants
elite_m_c<-m_nbos_25 %>%
filter(Country != "United States")%>%
count(Country)%>%
arrange(desc(n)) %>%
head(20)
elite_f_c<- f_nbos_25 %>%
filter(Country != "United States")%>%
count(Country)%>%
arrange(desc(n)) %>%
head(20)
#conduct anti joins to identify countries only in the elite graph
elite_m_c %>%
anti_join(nbos_m_c, by = "Country")
## Country n
## 1: CRC 37
## 2: Poland 25
elite_f_c%>%
anti_join(nbos_f_c, by = "Country")
## Country n
## 1: Ethiopia 16
## 2: GUA 13
## 3: Kenya 11
## 4: Argentina 10
United Kingdom, Canada, Mexico, and Brazil have the highest representation of male elite runners by country, outside the United States. The elite runner countries of origin not displayed in the all-male participant graph are Poland and Costa Rica.
United Kingdom, Canada, Mexico, and Australia have the highest representation of female elite runners by country, outside the United States. The elite runner countries of origin not displayed in the all-female participant graph are Ethiopia, Guam, Kenya, and Argentina.
The boxplot will be used to evaluate the distribution of official times by the age groups set by the event.
ggplot(nbos, aes( x= agegroup, y = Official.Time, fill = M.F)) +
#create boxplot
geom_boxplot( alpha = .5) +
#rotate graph
coord_flip()+
#set theme
theme_classic()+
#set color palette
scale_fill_manual(values = c("yellow","blue"))+
#separate graph by gender
facet_wrap(~M.F) +
#set y axis label
ylab ("Official Time (HH:MM)")+
#set x axis label
xlab("Age Group (Year)") +
ggtitle("All Participants")
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
ggplot(nbos_25, aes( x= agegroup, y = Official.Time, fill = M.F)) +
#create boxplot
geom_boxplot( alpha = .5) +
#rotate graph
coord_flip()+
#set theme
theme_classic()+
#set color palette
scale_fill_manual(values = c("yellow","blue"))+
#separate graph by gender
facet_wrap(~M.F) +
#set y axis label
ylab ("Official Time (HH:MM)")+
#set x axis label
xlab("Age Group (Year)") +
ggtitle("Elite Participants")
The boxplot displays the distribution of the official time participants completed the race, by the age groups. The median time tends to increase as the age group increases for all participants. It is expected as a person get older that they will be slower.
The median official time for all runners appears to be within an interval of 2 hours for both women and men at all age groups. The median official time for elite runners appears to be the within an interval of 15 minutes for both women and men at all age groups with an official time between 3 hours and 3 and a half hours.
The elite runners must maintain a strong pace in order to keep the median official time within a small interval. A heatmap will be used to observe the pace of the runners by age at each 5 kilometer checkpoint. First we will observe the pace of all the runners.
#Create a new table named g3
nbosm<- nbos %>%
#subset the rows for the new table
select(Age, X5K, X10K,X15K, X20K, X25K, X30K, X35K, X40K ) %>%
#group observations by publisher
group_by(Age) %>%
#summarize the sum of sales by region
summarise(across(.cols = c(X5K, X10K,X15K, X20K, X25K, X30K, X35K, X40K) , .fns = mean, na.rm = TRUE)) %>%
#Arrange the observations in descending order by global sales
arrange(desc(Age))
#set order of table by global sales
nbosm <- nbosm[order(nbosm$Age),]
#Convert tibble to a Dataframe
nbosm <- as.data.frame(nbosm)
#Set the publisher column as the name of the observations
row.names(nbosm) <- nbosm$Age
#subset the graph for the columns that will go into the matrix
nbosm_subset <- nbosm[,2:9]
#Create Matrix
g3_matrix <- data.matrix(nbosm_subset)
# parameter for RowSideColors
varcols = setNames(colorRampPalette(brewer.pal(nrow(g3_matrix), "YlGnBu"))(nrow(g3_matrix)), rownames(g3_matrix))
## Warning in brewer.pal(nrow(g3_matrix), "YlGnBu"): n too large, allowed maximum for palette YlGnBu is 9
## Returning the palette you asked for with that many colors
#Create Heatmap
heatmap(g3_matrix, Rowv = NA,
Colv = NA,
col= colorRampPalette(brewer.pal(nrow(g3_matrix), "YlGnBu"))(nrow(g3_matrix)), rownames(g3_matrix),
s=0.6, v=1, scale="column",
margins=c(10,15),
main = "Average Pace by Age",
xlab ="Kilometer Intervals",
ylab="Age",
cexCol=1, cexRow =1, RowSideColors = varcols)
## layout: widths = 0.05 0.2 4 , heights = 0.25 4 ; lmat=
## [,1] [,2] [,3]
## [1,] 0 0 4
## [2,] 3 1 2
## Warning in brewer.pal(nrow(g3_matrix), "YlGnBu"): n too large, allowed maximum for palette YlGnBu is 9
## Returning the palette you asked for with that many colors
The pace of runners tends to increase with age, as expected. Runners at the age of 18 have a slower pace than other runners under the age of 58. This could be due to inexperience in running a marathon. The pace of runners above the age of 68 displays a lot of variations. It will need further investigation.
Next we will observe the pace of the elite runners at the 5 kilometer check points.
#Create a new table named g3
nbosm_25<- nbos_25 %>%
#subset the rows for the new table
select(Age, X5K, X10K,X15K, X20K, X25K, X30K, X35K, X40K ) %>%
#group observations by publisher
group_by(Age) %>%
#summarize the sum of sales by region
summarise(across(.cols = c(X5K, X10K,X15K, X20K, X25K, X30K, X35K, X40K) , .fns = mean, na.rm = TRUE)) %>%
#Arrange the observations in descending order by global sales
arrange(desc(Age))
#set order of table by global sales
nbosm_25 <- nbosm_25[order(nbosm_25$Age),]
#Convert tibble to a Dataframe
nbosm_25 <- as.data.frame(nbosm_25)
#Set the publisher column as the name of the observations
row.names(nbosm_25) <- nbosm_25$Age
#subset the graph for the columns that will go into the matrix
nbosm_25_subset <- nbosm_25[,2:9]
#Create Matrix
g2_matrix <- data.matrix(nbosm_25_subset)
# parameter for RowSideColors
varcols = setNames(colorRampPalette(brewer.pal(nrow(g2_matrix), "YlGnBu"))(nrow(g2_matrix)), rownames(g2_matrix))
## Warning in brewer.pal(nrow(g2_matrix), "YlGnBu"): n too large, allowed maximum for palette YlGnBu is 9
## Returning the palette you asked for with that many colors
#Create Heatmap
heatmap(g2_matrix, Rowv = NA,
Colv = NA,
col= colorRampPalette(brewer.pal(nrow(g2_matrix), "YlGnBu"))(nrow(g2_matrix)), rownames(g2_matrix),
s=0.6, v=1, scale="column",
margins=c(10,15),
main = "Elite Average Pace by Age",
xlab ="Kilometer Intervals",
ylab="Age",
cexCol=1, cexRow =1, RowSideColors = varcols)
## layout: widths = 0.05 0.2 4 , heights = 0.25 4 ; lmat=
## [,1] [,2] [,3]
## [1,] 0 0 4
## [2,] 3 1 2
## Warning in brewer.pal(nrow(g2_matrix), "YlGnBu"): n too large, allowed maximum for palette YlGnBu is 9
## Returning the palette you asked for with that many colors
The heatmap of elite runners displays that runners under the age of 34 seem to maintain a faster pace. Some participants around the age of 58 and 69 also display a fast pace throughout the race. The 18-year-old participants in the elite group all seem to have a slow time compared to the rest of the elite group. Unlike the heatmap covering all participants, the elite heatmap shows that all the elite runners are running around the same pace.
It is expected that the pace a runner is executing would positively correlate with the a lower official time. We will observe this correlation with a scatter plot of the participants running.
nbos %>%
group_by(Division, M.F) %>%
slice_min(Official.Time)%>%
ggplot(aes(y= Official.Time, x = Pace, col= M.F))+
#create scatter plot
geom_point()+
#separate graph by gender
facet_wrap(~M.F)+
theme_dark()+
#remove y and x axis ticks and labels
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank())+
#set color pallete
scale_color_manual(values = c("yellow","blue"))+
#set y axis label
ylab("Official Time (Hour)")+
#set x axis label
xlab("Pace (Minute)")+
#set title
ggtitle("Pace v Official Time")+
#remove legend
theme(legend.position = "none")
Pace and official time have a positive correlation, according the graph. It is expected that a faster pace would result in a faster official time. The points at the end of each graph seem like the line might start to increase exponentially.
We will bring the points discussed in the previous sections into one visualization to determine what makes an elite runner. In this visualization, we will focus on the outliers identified in the previous boxplot. We will asses the characteristic of these runners.
An interactive boxplot was created to get the lower fence time of each age group.
nbos_25 %>%
#filter only male
filter(M.F == "M") %>%
#set x and y axis and color variable
plot_ly( y =~ Official.Time, x =~ agegroup, color =~ M.F) %>%
#create boxplot with blue aesthetic
add_boxplot(colors = "blue") %>%
#set labels for x-axis, y-axis, and title in layout
layout(xaxis = list(title = "Age Group (Year)"),
yaxis = list(title = "Official Time (HH:MM)"),
title = "Male Runners")
nbos_25 %>%
#filter only female
filter(M.F == "F") %>%
#set x and y axis and color variable
plot_ly( y =~ Official.Time, x =~ agegroup, color =~ M.F) %>%
#create boxplot with yellow aesthetic
add_boxplot(colors = "yellow") %>%
#set labels for x-axis, y-axis, and title in layout
layout(xaxis = list(title = "Age Group (Year)"),
yaxis = list(title = "Official Time (HH:MM)"),
title = "Female Runners")
The lower fence and age group will be used to filter for the outliers to be displayed in the interactive scatter plot.
m_nbos_25 %>%
#filter for outliers
filter( agegroup == "18-39" & Official.Time < "2021-04-21 02:29:20" | agegroup == "40-44" & Official.Time < "2021-04-21 02:39:39" | agegroup == "45-49" & Official.Time < "2021-04-21 02:58:48" | agegroup == "50 -54" & Official.Time < "2021-04-21 02:59:35" | agegroup == "60-64" & Official.Time < "2021-04-21 02:59:40") %>%
#Set axis and color variable
plot_ly(x =~ Official.Time, y=~ Age, color = ~agegroup, #Set hover information to text
hoverinfo = "text",
#set text information
text = ~paste("Name: ", Name,"<br>", "Country: ", Country,"<br>", "Pace: ", Pace, "<br>", "Official Time: ", Official.Time)) %>%
#create scatter plot
add_markers() %>%
layout(yaxis = list(title = "Age (Year)"),
xaxis = list(title = "Official Time (HH:MM)"),
title = "Outlier Male Runners")
f_nbos_25 %>%
#filter for outliers
filter(agegroup == "18-39" & Official.Time < "2021-04-21 02:56:26" |agegroup == "40-44" & Official.Time < "2021-04-21 03:06:30" |agegroup == "45-49" & Official.Time < "2021-04-21 03:10:41" | agegroup == "50 -54" & Official.Time < "2021-04-21 03:05:09" |agegroup == "55-59" & Official.Time < "2021-04-21 03:13: 56") %>%
plot_ly(x =~ Official.Time, y=~ Age, color = ~agegroup,
#Set hover information to text
hoverinfo = "text",
#set text information
text = ~paste("Name: ", Name,"<br>", "Country: ", Country,"<br>", "Pace: ", Pace, "<br>", "Official Time: ", Official.Time)) %>%
#create scattefr plot
add_markers() %>%
layout(yaxis = list(title = "Age (Year)"),
xaxis = list(title = "Official Time (HH:MM)"),
title = "Outlier Female Runners")
The information gathered from the analysis revealed that age is a factor of performance, but older participant have displayed the ability to be elite runners. A strong pace is required to get a fast official time in the race. The heatmap displayed that to be an elite runner you need to maintain a strong pace throughout the whole race. A runner needs to aim for a time between 3 hours and 3 hours and half to complete a marathon and meet the median standard. The analysis of the outliers confirmed that age is not a factor to be a top competitor, but to win the marathon one will need to exceed the standards observed in the analysis.
Home: Boston Athletic Association. (n.d.). Retrieved April 18, 2021, from https://www.baa.org/
Rojour. (2017, April 29). Finishers Boston Marathon 2015, 2016 & 2017. Retrieved April 18, 2020, from https://www.kaggle.com/rojour/boston-results