3/10/2022

Description of Data

The dataset used in this analysis is MarathonData.csv from https://www.kaggle.com/girardi69/marathon-time-predictions.

This data contains information about 87 athletes that participated in the 2017 Prague Marathon.

Training data was collected by running tracking apps such as Garmin and Strava.

The data includes finishing times of each athlete at the marathon, their average speed in km/h over the last 4 weeks of training, their average distance per week in km over the last 4 weeks of training, their age, and their gender.

Goal of Analysis

Attempt to find a linear equation that would allow athletes to accurately predict their marathon finishing time based on the average speed and distance of their training.

Having an accurate equation to predict finishing time would allow athletes to optimize training or set realistic goal times.

Cleaning Data

I noticed the average speed for one of the athletes was listed as 11,125 km/h, which was a clear outlier from the rest of the data. This outlier was removed, and the rest of the analysis will not include this data point.

filtered = data %>%
  filter(sp4week < 10000)

Pie Chart of Finishing Times

The split between the four groups is close to equal, with the 3:00-3:20 group having most and the 3:40-4:00 group having the least.

Important to note that there are no times above 4:00, even though many runners do not achieve a time under 4 hours.

3D Plot

It looks as if there is a possible linear relationship between speed and finishing time, as well as distance and finishing time.

Barchart of Ages

There is about an even split between men under 40 and men over 40.

There are only 4 women in the dataset, and are all under 40.

Plots of linear regression

Upon graphing the line of best fit for training distance per week vs finishing time and speed per week vs finishing time, neither seem to be a good fit.

R Code for analyzing linear regression

The lm function is used to find the line of best fit for distance vs finishing time. The R squared value and p value are stored for analysis.

The same code is used for finding the line of best fit for speed vs finishing time, but with a different variable used in the lm function.

fit = lm(filtered$MarathonTime~filtered$km4week)
rSquared = summary(fit)$r.squared;
pValue = summary(fit)$coefficients[2,4];

Linear regression using distance vs Finishing Time

rSquared
## [1] 0.3607888
pValue
## [1] 9.736639e-10
  • Linear regression using average training distance is statistically significant at the 5% level, but only accounts for 36.08% of variance. This is not a good fit.

Linear regression using speed vs Finishing time

rSquared
## [1] 0.3907496
pValue
## [1] 1.248621e-10
  • Linear regression using average training speed is statistically significant at the 5% level, but only accounts for 39.07% of variance. This is not a good fit.

Multiple Linear Regression using speed and distance vs Finishing Time

rSquared
## [1] 0.6321664
speedPValue
## [1] 1.447912e-11
distancePValue
## [1] 1.096832e-10

  • Both variables are statistically significant at the 5% level, and 63.2% of variance is accounted for.

Conclusions

Using linear regression with only training speed or only training distance was not accurate.

Using multiple linear regression with training speed and training distance was moderately accurate, with 64.25% of variance accounted for.

Although the multiple linear regression method wasn’t completely accurate, it could be used for a rough estimate of marathon finishing time.

Equation: Marathon time(hours) = 5.7405 + -.1647(average training speed in km/h) + -.0069(average training distance per week in km)