Antoinette G. Anastacio (s4137216)
Last updated: 29 May, 2026
Running continues to be one of the most accessible forms of exercise all around the world. In fact, every year, millions of runners join the most famous long-distance running events, the World Marathon Majors. These are 42 kilometer long runs in key countries all around the globe, namely: Boston, Tokyo, London, Berlin, Chicago, and New York.
Unfortunately, due to the COVID-19 pandemic which happened on March 2020, restrictions were set in place to fundamentally disrupt daily lives worldwide. As the COVID-19 pandemic spread, governments from different countries established public health restrictions, from stay-at-home orders, limits of outdoor activity, and venue closures. This level of restriction is referred to as the stringency index of a country, with 0 being the lowest and 100 being the highest (Hale et al., 2021).
Since running is mostly an outdoor activity, these COVID-19 restrictions may have significantly affected athletes’ running behaviour. In fact, Asian running behaviour is compelling to study in particular, given that it was the first continent to identify the coronavirus (WHO, 2020).
Problem Statement
This report investigates the following research question:
Did government COVID-19 lockdown stringency significantly affect the weekly running distances of athletes in Asia during 2020, compared to their pre-pandemic running behaviour in 2019?
Methodology
To address this question, the following steps have been applied:
Exploratory Data Analysis: Descriptive statistics and visualisations are used to characterise the dataset and identify any possible trends or patterns in running behaviour from 2019 to 2020
Hypothesis Testing: A two-sample t-test is conducted to determine whether mean weekly distances of Asian athletes differed significantly between the pre-pandemic (2019) and pandemic periods (2020).
Linear Regression: A multiple linear regresion model is fitted to examine the extent to which the government lockdown stringency, gender, age group, and country can predict the weekly running distanc of Asian athletes in 2020.
Two datasets were used and merged for this report.
Main Datasets
Running data: Sourced from Kaggle — Long-Distance Running Dataset (Mexwell, 2023). This dataset contains 10.7 million training records of 36,412 athletes all around the world (129 countries), from 2019 to 2020. Data was sourced via web scraping of a major social network for athletes. Notably, all runners here have run at least 1 major marathon (42km run). For the purposes of this report, only athletes from Asia have been included in the analysis.
COVID Stringency Index: Sourced from Our World in Data (2021). This Oxford COVID-19 Government Response Stringency Index is a score from 0 (no restrictions) to 100 (strictest restrictions), reflecting the strictness of government responses including lockdowns, school/workplace closures, and travel restrictions. The stringency index has been aggregated to a weekly level to align with the weekly running data.
Data Integration
The two datasets were merged by matching each athlete’s country of origin with the corresponding stringency index value for the week of their run. All 2019 values were assigned a value if NA given that the stringency index only existed in 2020, during the COVID-19 pandemic.
Key Variables The following variables were used throughout the report:
| Variable | Type | Description |
|---|---|---|
year |
Year | Year of the training activity (2019 or 2020) |
datetime |
Date | Start of the week |
athlete |
Numeric (int) | Computer-generated ID for the athlete |
gender |
Ordered Factors | ‘Male’ or ‘Female’ |
age_group |
Ordered Factors | ‘18 - 34’, ‘35 - 54’, ‘55 +’ |
country |
Characters | Athlete’s country of origin |
distance |
Numeric (km) | Weekly running distance covered by the athlete |
stringency_index |
Numeric (0–100) | Government restriction severity during this week |
lockdown-level |
Ordered Factors | Level of lockdown based on stringency index |
First, the running dataset was loaded and preprocessed, before being merged to the stringency dataset. Pre-processing took the most time out of the whole report since the dataset contains millions of rows of data.
Load the Data
2019 and 2020 datasets were loaded, merged into one dataframe, and examined.
#get the files for running
run_2019 <- read_csv("run_ww_2019_w.csv")
run_2020 <- read_csv("run_ww_2020_w.csv")
#combine to one dataframe
run_data <- bind_rows(run_2019, run_2020)
#examine rows
glimpse(run_data)## Rows: 3,786,848
## Columns: 9
## $ ...1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ datetime <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01,…
## $ athlete <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ distance <dbl> 0.00, 5.27, 9.30, 103.13, 34.67, 69.85, 65.38, 0.00, 32.34, …
## $ duration <dbl> 0.00000, 30.20000, 98.00000, 453.40000, 185.65000, 353.93333…
## $ gender <chr> "F", "M", "M", "M", "M", "F", "M", "M", "M", "M", "M", "M", …
## $ age_group <chr> "18 - 34", "35 - 54", "35 - 54", "18 - 34", "35 - 54", "35 -…
## $ country <chr> "United States", "Germany", "United Kingdom", "United Kingdo…
## $ major <chr> "CHICAGO 2019", "BERLIN 2016", "LONDON 2018,LONDON 2019", "L…
Format the Data Columns & Values
#delete first column, not needed
run_data <- run_data %>%
select(-`...1`)
#format datetime as date
run_data$datetime <- as.Date(run_data$datetime)
#change gender column to ordered factors
run_data$gender<-factor(run_data$gender,
levels = c("F","M"),
labels=c("Female","Male"))
#change age_group column to ordered factors
unique(run_data$age_group)## [1] "18 - 34" "35 - 54" "55 +"
run_data$age_group<-factor(run_data$age_group,
levels = c("18 - 34","35 - 54", "55 +"),
labels=c("18 - 34","35 - 54", "55 +"))
#quick skim to make sure there are no misspelled country names
unique(run_data$country)## [1] "United States" "Germany" "United Kingdom"
## [4] "Australia" "Spain" "Canada"
## [7] "Colombia" "Japan" "Malaysia"
## [10] "Belarus" "Switzerland" "Italy"
## [13] "Norway" "Netherlands" "France"
## [16] "Mexico" "Brazil" "Taiwan"
## [19] "Peru" "Russia" "Luxembourg"
## [22] "Sweden" "Singapore" NA
## [25] "Slovenia" "Costa Rica" "Indonesia"
## [28] "Denmark" "Austria" "Poland"
## [31] "Chile" "South Africa" "Belgium"
## [34] "China" "Isle of Man" "Cayman Islands"
## [37] "Iceland" "Portugal" "Romania"
## [40] "Thailand" "Estonia" "Finland"
## [43] "Moldova" "South Korea" "Argentina"
## [46] "Czechia" "Ukraine" "Slovakia"
## [49] "Dominican Republic" "Israel" "Guatemala"
## [52] "Jersey" "Ireland" "Turkey"
## [55] "United Arab Emirates" "Uruguay" "New Zealand"
## [58] "Hungary" "Philippines" "Myanmar"
## [61] "Greece" "India" "Croatia"
## [64] "Panama" "Cyprus" "Vietnam"
## [67] "Guernsey" "Mongolia" "Lithuania"
## [70] "Bolivia" "Andorra" "El Salvador"
## [73] "Latvia" "Nicaragua" "Jordan"
## [76] "Ecuador" "Kazakhstan" "Kosovo"
## [79] "Bulgaria" "Malta" "Kenya"
## [82] "Venezuela" "Serbia" "Zimbabwe"
## [85] "Monaco" "Montenegro" "Suriname"
## [88] "Armenia" "Bahrain" "Honduras"
## [91] "Tunisia" "Nigeria" "Barbados"
## [94] "Ghana" "Azerbaijan" "Botswana"
## [97] "Liechtenstein" "Faroe Islands" "Saudi Arabia"
## [100] "Paraguay" "Senegal" "Angola"
## [103] "Mauritius" "Lebanon" "Bosnia and Herzegovina"
## [106] "Bermuda" "Gibraltar" "Uganda"
## [109] "Afghanistan" "Anguilla" "Morocco"
## [112] "Jamaica" "Belize" "Iran"
## [115] "Bahamas" "Uzbekistan" "Namibia"
## [118] "Trinidad and Tobago" "Ivory Coast" "San Marino"
## [121] "Sudan" "South Sudan" "Maldives"
## [124] "East Timor" "Kuwait" "Fiji"
## [127] "Laos" "Brunei" "Egypt"
## [130] "Cape Verde"
Missing Data
Upon checking, there are missing entries for country, so this was investigated further to determine how to deal with the missing entries. On the other hand, by doing this, it was discovered that the column major had no empty entries, meaning that all athletes in this dataset have run at least 1 marathon.
## datetime athlete distance duration gender age_group country major
## 0 0 0 0 0 0 34216 0
First, we check if the athletes may have other entries where their countries are present, so we could impute that value. However, doing a quick check of all unique athletes and a random athlete number showed that there is nowhere in the master dataset where the countries of those athletes can be found.
#get the athlete numbers with missing country names
missing_athletes <- run_data %>%
filter(is.na(country)) %>%
pull(athlete)
#check the athletes with missing countries in the master dataset if their country exists there
run_data %>%
filter(athlete %in% missing_athletes) %>%
select(datetime, athlete, country)#check 1 sample athlete if any of their countries exists in the master dataset
run_data %>%
filter(athlete == 95) %>%
select(datetime, athlete, country, datetime) ## [1] 0.009035483
Since the athletes with missing countries is only 0.9% of the dataset, and their country of origin cannot be imputed from any existing data, they will just be removed.
#remove the athletes with missing countries
run_data_clean <- run_data %>%
filter(!athlete %in% missing_athletes)
#confirm the deletion has been done
colSums(is.na(run_data_clean))## datetime athlete distance duration gender age_group country major
## 0 0 0 0 0 0 0 0
Add Extra Columns
To assist with the analysis, extra columns were created. Additionally, the continent of the countries was also identified.
run_data_clean <- run_data_clean %>%
mutate(
year = year(datetime),
week = isoweek(datetime),
pace = ifelse(distance > 0, duration / distance, 0)) # min/km
# change Kosovo to nearest bordering country, Serbia
run_data_clean <- run_data_clean %>%
mutate(country = case_when(
country == "Kosovo" ~ "Serbia",
TRUE ~ country))
# create a new column to identify the continent of each country
run_data_clean <- run_data_clean %>%
mutate(
continent = countrycode(country,
origin = "country.name",
destination = "continent"))The COVID-19 stringency dataset was also loaded. The column names were just changed to be similar to the ones in the running dataset, and the data was aggregated to the mean weekly stringency index to align with the running dataset’s weekly time frame, and then this data was merged to the running dataset.
#load covid stringency data
stringency_raw <- read_csv("covid-stringency-index.csv")
glimpse(stringency_raw)## Rows: 74,056
## Columns: 4
## $ Entity <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghani…
## $ Code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG…
## $ Date <date> 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020…
## $ stringency_index <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#rename column names
stringency_raw <- stringency_raw %>%
rename(
country = Entity,
country_code = Code,
date = Date) %>%
mutate(date = as.Date(date))
#aggregate stringency values to weekly instead of daily
stringency_weekly <- stringency_raw %>%
mutate(
year = year(date),
week = isoweek(date)) %>%
group_by(country, year, week) %>%
summarise(stringency_index = mean(stringency_index, na.rm = TRUE),
.groups = "drop")Both the running dataset and the stringency dataset were merged into one full_data dataset. Then, another column named lockdown_level was created to group together the levels of stringency to high, low, or minimal levels.
Since the COVID-19 pandemic happened by 2020, the stringency indexes did not exist during 2019. Therefore, the label for training data for 2019 would be Pre-Pandemic (2019). Stringency indexes less than 20 (incuding no restrictions at all) were labeled as Minimal Restrictions (<20). Entries labeled as Low Stringency were those with indeces from 20 to 59, and those above 60 were labeled as High Stringency.
# Join stringency onto running data by country + year + week
full_data <- run_data_clean %>%
left_join(stringency_weekly, by = c("country", "year", "week"))
# create a case when to label the level of stringency based on stringency index
full_data <- full_data %>%
mutate(
lockdown_level = case_when(
year == 2019 ~ "Pre-Pandemic (2019)",
stringency_index >= 60 ~ "High Stringency (≥60)",
stringency_index >= 20 ~ "Low Stringency (20–59)",
stringency_index < 20 ~ "Minimal Restrictions (<20)",
is.na(stringency_index) ~ "Minimal Restrictions (<20)"
),
lockdown_level = factor(lockdown_level, levels = c(
"Pre-Pandemic (2019)",
"Minimal Restrictions (<20)",
"Low Stringency (20–59)",
"High Stringency (≥60)"
))
)The table below shows that there are a significant number of athletes and training records per continent. However, for the sake of simplicity, we will only focus on one continent, Asia, which has 2,802 athletes and 291,408 rows of training data. Again, this is because Asia has been one of the first continents to be exposed to the novel coronavirus, which may make for interesting insights and data patterns during the analysis phase.
full_data %>%
group_by(continent) %>%
summarise(
athletes = n_distinct(athlete),
records = n()
) %>%
arrange(desc(athletes)) %>%
knitr::kable(caption = "Athletes and Records per Continent")| continent | athletes | records |
|---|---|---|
| Americas | 16712 | 1738048 |
| Europe | 15731 | 1636024 |
| Asia | 2802 | 291408 |
| Oceania | 640 | 66560 |
| Africa | 198 | 20592 |
Moving forward, the report will focus on one continent, particularly Asia. The dataset will be filtered to select only data from Asian athletes, and select only the relevant variables for the report. Furthermore, since the focus of this report will focus on run duration, only this metric will remain, and the run distance and pace will be filtered out.
#select needed rows
full_data <- full_data %>%
filter(continent == "Asia") %>%
select(year, datetime, athlete, gender, age_group, country, distance, stringency_index, lockdown_level) %>%
mutate(year = as.character(year))
glimpse(full_data)## Rows: 291,408
## Columns: 9
## $ year <chr> "2019", "2019", "2019", "2019", "2019", "2019", "2019…
## $ datetime <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019…
## $ athlete <dbl> 18, 21, 60, 64, 77, 84, 87, 102, 163, 181, 208, 223, …
## $ gender <fct> Male, Male, Male, Female, Male, Female, Male, Male, M…
## $ age_group <fct> 35 - 54, 18 - 34, 35 - 54, 18 - 34, 35 - 54, 55 +, 35…
## $ country <chr> "Japan", "Malaysia", "Taiwan", "Taiwan", "Taiwan", "J…
## $ distance <dbl> 52.53, 58.47, 13.32, 0.00, 40.10, 16.51, 60.80, 22.62…
## $ stringency_index <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lockdown_level <fct> Pre-Pandemic (2019), Pre-Pandemic (2019), Pre-Pandemi…
To ensure that the entries included would reflect realistic runners’ performance, outliers will be removed based on the IQR values and based on what is considered an outlier from domain knowledge about running behaviour. A quick check of the values shows that the data is highly skewed to the right, therefore, the IQR will be used to determine the outliers. Z-scores could not be used since they assume that the datasets have normal distributions (the dataset we are working with does not).
Remove Implausible Values
Weekly distances below 1 km are implausible as a training record, and were removed prior to outlier detection. It is possible that these could be athlete’s mistakes to accidentally record a run, without actually running.
Examine Distribution of Distance
Looking at the histograms and boxplots of the distance variable shows extreme outliers, confirming that the distance variable is heavily skewed to the right, with the majority of Asian athletes running less than 80km per week. Furthermore, the boxplot identifies values beyond 100km as outliers, justifying the use of the IQR method to remove these outliers.
p1 <- ggplot(full_data_clean, aes(x = distance)) +
geom_histogram(bins = 10, fill = "#56B4E9", colour = "white") +
labs(title = "Distribution of Distance", x = "Distance (km)", y = "Count") +
theme_minimal()
p2 <- ggplot(full_data_clean, aes(x = distance)) +
geom_boxplot(fill = "#56B4E9", outlier.alpha = 0.1, outlier.size = 0.5) +
labs(title = "Distribution of Distance", x = "Distance (km)", y = "Count") +
theme_minimal()
p1 | p2Applying the IQR Method
The inter quartile ranges will be used to identify the range that will be filtered out. More specifically, values falling below 1.5 times the IQR of the 1st quartile and above 1.5 times the IQR of the 3rd quartile will be identified as outliers and removed.
#distance ranges
Q1_dist <- quantile(full_data_clean$distance, 0.25)
Q3_dist <- quantile(full_data_clean$distance, 0.75)
IQR_dist <- Q3_dist - Q1_dist
#filter data based on the IQRs
full_data_clean <- full_data_clean %>%
filter(
distance >= Q1_dist - 1.5 * IQR_dist,
distance <= Q3_dist + 1.5 * IQR_dist)Distribution After Outlier Removal
Looking at the boxplots and histograms below, there are now more interpretable. The histogram now shows that most Asian athletes run a weekly distance of around 15-20 km. The boxplot shows that there are some outliers running at 100 km weekly, but overall, the distances fall within the accepted IQR range and represent legitimate high-volume training weeks.
p1 <- ggplot(full_data_clean, aes(x = distance)) +
geom_histogram(bins = 10, fill = "#56B4E9", colour = "white") +
labs(title = "Distribution of Distance", x = "Distance (km)", y = "Count") +
theme_minimal()
p2 <- ggplot(full_data_clean, aes(x = distance)) +
geom_boxplot(fill = "#56B4E9", outlier.alpha = 0.1, outlier.size = 0.5) +
labs(title = "Distribution of Distance", x = "Distance (km)", y = "Count") +
theme_minimal()
p1 | p2This segment explores key patterns and trends in the running behaviour of asian athletes across 2019 and 2020. Data visualisations and summary statistics are used to identify any changes that may be attributable to the COVID-19 pandemic, as well as characterise the distribution of weekly running distance.
Mean Weekly Running Distance by Country
The table below summarises the weekly running distance for each country in Asia, providing an initial overview of how training volume differs across countries, with some countries like Laos running a weekly distance of 14.90 km, to countries like Lebanon running a weekly distance of 72.96 km.
table1 <- full_data_clean %>%
group_by(country) %>%
summarise(
Min = min(distance, na.rm = TRUE),
Mean = mean(distance, na.rm = TRUE),
Max = max(distance, na.rm = TRUE),
SD = sd(distance, na.rm = TRUE))
knitr::kable(table1, caption = "Weekly Running Distance by Country")| country | Min | Mean | Max | SD |
|---|---|---|---|---|
| Afghanistan | 1.940 | 31.45095 | 81.550 | 18.369030 |
| Armenia | 1.240 | 26.43627 | 98.170 | 23.821443 |
| Azerbaijan | 1.600 | 31.82224 | 99.380 | 20.138312 |
| Bahrain | 2.410 | 42.16794 | 101.690 | 23.834114 |
| Brunei | 3.670 | 14.74541 | 44.980 | 8.360544 |
| China | 1.010 | 35.59482 | 102.290 | 24.036517 |
| Cyprus | 1.160 | 42.50952 | 101.360 | 24.489902 |
| East Timor | 2.220 | 20.11997 | 60.840 | 11.128419 |
| India | 1.010 | 36.32849 | 102.170 | 22.547213 |
| Indonesia | 1.010 | 28.40569 | 102.180 | 19.634022 |
| Iran | 3.440 | 23.35766 | 65.300 | 12.877387 |
| Israel | 1.150 | 35.66561 | 102.230 | 24.681073 |
| Japan | 1.010 | 33.32658 | 102.300 | 22.921029 |
| Jordan | 1.230 | 31.98363 | 96.380 | 17.859288 |
| Kazakhstan | 1.690 | 40.54369 | 102.179 | 27.582594 |
| Kuwait | 2.009 | 18.12398 | 59.270 | 12.402453 |
| Laos | 5.040 | 14.89624 | 36.889 | 8.017799 |
| Lebanon | 8.050 | 72.96030 | 101.510 | 23.187811 |
| Malaysia | 1.020 | 27.98087 | 102.110 | 20.938736 |
| Maldives | 3.000 | 35.70532 | 72.560 | 21.544160 |
| Mongolia | 1.170 | 10.92718 | 29.880 | 8.801318 |
| Myanmar | 1.720 | 47.71716 | 102.140 | 25.250906 |
| Philippines | 1.010 | 27.13469 | 102.019 | 19.286379 |
| Saudi Arabia | 1.010 | 28.91812 | 99.220 | 21.306816 |
| Singapore | 1.010 | 34.47856 | 102.260 | 23.564462 |
| South Korea | 1.010 | 34.87827 | 102.290 | 23.229860 |
| Taiwan | 1.010 | 35.31146 | 102.260 | 24.461123 |
| Thailand | 1.010 | 31.01599 | 102.218 | 21.553591 |
| Turkey | 1.130 | 36.70298 | 102.220 | 24.301293 |
| United Arab Emirates | 1.010 | 31.48995 | 102.170 | 21.195461 |
| Uzbekistan | 1.010 | 38.72967 | 89.990 | 28.760910 |
| Vietnam | 1.020 | 26.62017 | 82.220 | 17.279973 |
Distribution of Weekly Distance
The boxplot below shows the overall distribution of weekly distance across Asian runners from 2019 to 2020. While most athletes run moderate distances (approximately 20 - 50km weekly), there are a smaller proportion of athletes who train at considerably higher volumes, hence the outliers.
# Boxplot of weekly distance
full_data_clean %>%
ggplot(aes(y = distance)) +
geom_boxplot(fill = "#56B4E9", outlier.alpha = 0.1, outlier.size = 0.5) +
labs(
title = "Weekly Running Distance",
y = "Distance (km)") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
Mean Weekly Distance Over Time
The time series below plots the mean weekly running distance across all Asian athletes from January 2019 to December 2020. The red dashed line marks the data when the WHO declared COVID-19 as a global pandemic (on March 11, 2020).
Further, several notable patterns are seen in this time series:
Seasonal peaks and dips from 2019: An interesting behaviour of runners is to increase training volume (ie. running distance) in preparation for marathons, followed by a sharp drop in running volume as a reflection of their recovery period following the marathon. For example, the steady rise of running distance is seen around end of quarter 1 (for Tokyo in March, Boston & London in April), followed by another peak around the 3rd quarter of the year (Berlin in September, Chicago in October, New York in November).
Early 2020 drop: A slight decline is distance begins at the start of 2020, even before the WHO declaration, likely coinciding with early COVID-19 outbreaks across Asia.
Post-COVID-19 Declaration: Training volume increases but stabilises at a lower level compared to 2019 for the remainder of 2020. This suggests a sustained impact of restrictions on running behaviour throughout the year.
# Aggregate mean weekly distance across all athletes
weekly_avg <- full_data_clean %>%
group_by(datetime) %>%
summarise(mean_distance = mean(distance, na.rm = TRUE), .groups = "drop")
ggplot(weekly_avg, aes(x = datetime, y = mean_distance)) +
geom_line(colour = "#009E73") +
geom_vline(xintercept = as.Date("2020-03-11"), #pandemic WHO declaration
colour = "red", linetype = "dashed", linewidth = 0.8) +
annotate("text", x = as.Date("2020-03-11"), y = Inf,
label = "WHO Pandemic",
hjust = -0.1, vjust = 1.5, size = 2.5, colour = "red") +
labs(
title = "Mean Weekly Running Distance - Asia (2019–2020)",
x = "Date",
y = "Mean Distance (km)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Mean Weekly Distance for 2019 and 2020
The bar chart below compares the mean weekly running distance between 2019 and 2020, which shows that the 2019 mean running distance is 33.52 km, while the distance for 2020 is 32.84 km. There is a clear decline in the mean distance observed in 2020, providing initial visual evidence that the COVID-19 pandemic may have reduced training volume. This difference will be formally tested in the hypothesis test section.
table3 <- full_data_clean %>%
group_by(year) %>%
summarise(
Mean_Distance = round(mean(distance, na.rm = TRUE), 2),
SD_Distanc = round(sd(distance, na.rm = TRUE), 2),
n = n()) %>%
mutate(year = factor(year))
ggplot(table3, aes(x = year, y = Mean_Distance, fill = year)) +
geom_col(position = "dodge") +
geom_text(aes(label = Mean_Distance),
vjust = -0.5,
size = 4,
fontface = "bold") +
scale_fill_manual(values = c("2019" = "#56B4E9",
"2020" = "#E69F00")) +
labs(
title = "Mean Weekly Running Distance - Asia: 2019 vs 2020",
x = "Year",
y = "Mean Distance (km)",
fill = "Year") +
theme_minimal()
Mean Weekly Distance by Lockdown Level
The mean running distance for 2020 is compared across all different lockdown levels. The pre-pandemic category in 2019 serves as the baseline (33.52 km) while the remaining categories reflect increasing levels of restriction severity during 2020.
The values below suggest that the mean weekly distance was broadly similar across restriction levels, even as stringency increased (low stringency mean distance was 32.55 km, and high stringency mean distance was 32.65 km). Notably, athletes under Minimal Restrictions (<20) appear to have run more than the pre-pandemic baseline, with a mean distance of 34.41 km. This may potentially reflect a period of adjustment by athletes where they have increased or maintained their training distance before stricter restrictions were imposed. This observation will be formally examined in the regression analysis segment of the report.
table4 <- full_data_clean %>%
group_by(lockdown_level) %>%
summarise(
Mean_Distance = round(mean(distance, na.rm = TRUE), 2),
SD_Distance = round(sd(distance, na.rm = TRUE), 2),
n = n())
ggplot(table4, aes(x = lockdown_level, y = Mean_Distance, fill = lockdown_level)) +
geom_col() +
geom_text(aes(label = Mean_Distance),
vjust = -0.5,
size = 4,
fontface = "bold") +
scale_fill_manual(values = c(
"Pre-Pandemic (2019)" = "#56B4E9",
"Minimal Restrictions (<20)" = "#009E73",
"Low Stringency (20–59)" = "#E69F00",
"High Stringency (≥60)" = "#D55E00")) +
labs(
title = "Mean Weekly Running Distance by Lockdown Level - Asia",
x = "Lockdown Level",
y = "Mean Distance (km)",
fill = "Lockdown Level") +
ylim(0, 40) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")Hypotheses
This section investigates whether the mean weekly distances ran by Asian athletes from 2019 (pre-pandemic) was significantly different from 2020 (pandemic) A significance level of α = 0.05 is used throughout.
\[H_0: \mu_{2019} = \mu_{2020} \] The null hypothesis states that the mean weekly running distance of Asian athletes in 2019 is equal to that of 2020.
\[H_A: \mu_{2019} \ne \mu_{2020}\] The alternative hypothesis states that the mean weekly running distance of Asian athletes in 2019 is significantly different from that of 2020.
Descriptive Statistics
First, the data is split into pre-pandemic (2019) and pandemic (2020). Then, prior to testing, some descriptive statistics are examined by year.
# Split data into 2019 and 2020
data_2019_clean <- full_data_clean %>% filter(year == "2019")
data_2020_clean <- full_data_clean %>% filter(year == "2020")The table below summarises the distribution of distance for 2019 and 2020, and the boxplot provides a visual comparison between the two periods.
table_desc <- full_data_clean %>%
group_by(year) %>%
summarise(Min = min(distance, na.rm = TRUE),
Q1 = quantile(distance, probs = .25, na.rm = TRUE),
Median = median(distance, na.rm = TRUE),
Q3 = quantile(distance, probs = .75, na.rm = TRUE),
Max = max(distance, na.rm = TRUE),
Mean = mean(distance, na.rm = TRUE),
SD = sd(distance, na.rm = TRUE),
n = n())
knitr::kable(table_desc, caption = "Weekly Running Distance by Year")| year | Min | Q1 | Median | Q3 | Max | Mean | SD | n |
|---|---|---|---|---|---|---|---|---|
| 2019 | 1.01 | 15.01 | 29.130 | 48.08000 | 102.29 | 33.51891 | 22.62318 | 106874 |
| 2020 | 1.01 | 13.72 | 27.435 | 47.46975 | 102.30 | 32.84393 | 23.23298 | 94074 |
The boxplot below suggests that the 2020 weekly mean distance appears to be lower than those of 2019’s. However, the hypothesis test can help us see if these decreases are statistically significant, more than relying on visual inspection alone.
# Boxplot of distance by year, faceted per continent
full_data_clean %>%
ggplot(aes(x = year, y = distance, fill = year)) +
geom_boxplot(outlier.alpha = 0.1, outlier.size = 0.3) +
scale_fill_manual(values = c("2019" = "#56B4E9", "2020" = "#E69F00")) +
labs(
title = "Weekly Running Distance: 2019 vs 2020 in Asia",
x = "Year",
y = "Distance (km)",
fill = "Year") +
theme_minimal()Check Assumptions
Normality: Normality of the distance variable is assessed using QQ plots. As seem below, the tails deviate from the normal line, indicating that the raw data is not normally distributed. However, as sample sizes are more than 30 observations, the Central Limit Theorem (CLT) applies, and the sampling distribution will approximately be normal regardless of the shape of the raw data. The normality assumption is therefore satisfied.
Homogeneity of variances: Levene’s test is used to check if the spread of variances are equal between 2019 and 2020. If false (p < 0.05), then Welch’s test must be used instead of the standard two sample t test. As seen below, Levene’s test returned a value of p-value < 0.05. Therefore, Asia does not have equal variances between 2019 and 2020. Welch’s t-test (var.equal = FALSE) will therefore be used.
levene_results <- full_data_clean %>%
summarise(
levene_p = leveneTest(distance ~ factor(year),
data = cur_data())$`Pr(>F)`[1]) %>%
mutate(
levene_p = round(levene_p, 4), #round off
`Equal Variance?` = ifelse(levene_p > 0.05, "TRUE", "FALSE"))
knitr::kable(levene_results, caption = "Levene's Test for Homogeneity of Variances")| levene_p | Equal Variance? |
|---|---|
| 0 | FALSE |
Run Hypothesis Test
First, the distances for each year (2010 and 2020) is pulled.
asia_2019 <- full_data_clean %>% filter(year == "2019") %>% pull(distance)
asia_2020 <- full_data_clean %>% filter(year == "2020") %>% pull(distance)Then, the t-tests are run. Again, since Asia’s p-value is < 0.05, Welch’s test is used (var.equal=FALSE).
The results are summarised in a table below. Between 2019 and 2020, there is a minimal difference between Asian athlete’s mean distance ran. However, the t-test results show that the mean differences are statistically significant.
results_table <- data.frame(
Continent = "Asia",
Mean_2019 = round(mean(asia_2019), 2),
Mean_2020 = round(mean(asia_2020), 2),
Difference = round(mean(asia_2020) - mean(asia_2019), 2),
T_Stat = round(test_asia$statistic, 3),
DF = round(test_asia$parameter, 1),
P_Value = round(test_asia$p.value, 4),
CI_Lower = round(test_asia$conf.int[1], 2),
CI_Upper = round(test_asia$conf.int[2], 2),
Significant = "Yes")
knitr::kable(results_table,
caption = "Two-Sample T-Test Results: 2019 vs 2020 Distance - Asia")| Continent | Mean_2019 | Mean_2020 | Difference | T_Stat | DF | P_Value | CI_Lower | CI_Upper | Significant | |
|---|---|---|---|---|---|---|---|---|---|---|
| t | Asia | 33.52 | 32.84 | -0.67 | 6.579 | 196282 | 0 | 0.47 | 0.88 | Yes |
Results
The p-value < 0.05. The 95% confidence interval does not capture 0. Therefore, we reject the null hypothesis.
Conclusion
The Welch’s two-sample t-test for Asia was statistically significant (t = 6.579, df = 196,282, p = 0.001). The 95% confidence interval [0.47, 0.88] did not capture 0, indicating that the null hypothesis can be rejected. Mean weekly distance in 2020 (34.84 km) was significantly lower than 2019 (33.52 km), with an observed difference of -0.67 km.
While statistically significant, the magnitude of this difference is modest, with a mean decrease of just 0.67km per week. However, the direction of the effect (of the running distance decreasing in 2020), aligns with the bigger narrative that COVID-19 restrictions negatively affected the training behaviour of runners. The extent to which the government stringency specifically predicts this running distance is examined further in the next section.
This section investigates if the government stringency level, gender, age group, and country can predict the weekly running distance of Asian athletes during 2020. A multiple linear regression model is fitted using only 2020 data, as the stringency index is only available for this period.
# Filter to 2020 only
data_2020 <- full_data_clean %>%
filter(year == 2020, !is.na(stringency_index))Visualising the Relationship
A simple scatter plot below visualises the possible relationship of the stringency index score and the running distances ran by athletes. However, since the dataset contains a large number of individual records, the mean weekly distance is aggregated for each stringency index score for clarity. That is, the mean weekly running distance of runners will be aggregated per stringency value for visualisation purposes. The regression model will still be fitted on individual level observations.
data_2020_avg <- data_2020 %>%
mutate(stringency_bin = round(stringency_index)) %>% # round to whole numbers
group_by(stringency_bin) %>%
summarise(mean_distance = mean(distance, na.rm = TRUE),
n = n(),
.groups = "drop")The scatter plot below shows that there is a very slight negative slope. Meaning, as stringency increased, the athlete’s mean weekly distance dropped as well.
While the difference between mean distances run per continent was statistically significant between 2019 and 2020 (from the previous hypothesis test done in the previous task), the stringency index itself may not be a strong predictor of distance run for 2020 for most continents.
ggplot(data_2020_avg, aes(x = stringency_bin, y = mean_distance)) +
geom_point(aes(size = n), alpha = 0.6) + # size by number of observations
geom_smooth(method = "lm", se = TRUE, colour = "red", linewidth = 0.8) +
scale_size_continuous(range = c(1, 6)) +
labs(
title = "Mean Weekly Running Distance vs Stringency Index in Asia (2020)",
subtitle = "Point size reflects number of observations at each stringency level",
x = "COVID-19 Stringency Index",
y = "Mean Weekly Distance (km)") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))Fitting the Model
A multiple linear regression model is fitted using stringency index, gender, age group, and country as predictors of weekly running distance.
The baseline reference group for the initial equation contains these attributes: * from Afghanistan (the first country alphabetically in the countries) * Female * 18 - 34 years old (youngest in the age group) * stringency index is 0
\[distance=β0+β1(stringencyindex)+β2(gender)+β3(agegroup)+β4(country)+ε\]
model1 <- lm(distance ~ stringency_index + gender + age_group + country, data = data_2020)
model1 %>% summary()##
## Call:
## lm(formula = distance ~ stringency_index + gender + age_group +
## country, data = data_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.573 -18.570 -5.115 14.539 75.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.797200 2.452808 12.964 < 2e-16 ***
## stringency_index -0.031724 0.003926 -8.080 6.57e-16 ***
## genderMale 3.895603 0.210370 18.518 < 2e-16 ***
## age_group35 - 54 0.517905 0.184991 2.800 0.005117 **
## age_group55 + -0.579058 0.336378 -1.721 0.085173 .
## countryAzerbaijan -2.757121 3.119053 -0.884 0.376719
## countryBahrain 2.309793 3.122008 0.740 0.459398
## countryBrunei -20.809586 4.230299 -4.919 8.71e-07 ***
## countryChina 2.944684 2.445936 1.204 0.228628
## countryCyprus 8.995736 2.621931 3.431 0.000602 ***
## countryIndia 1.367375 2.462072 0.555 0.578639
## countryIndonesia -5.869439 2.447839 -2.398 0.016496 *
## countryIran -3.285086 4.092766 -0.803 0.422175
## countryIsrael -0.478382 2.498194 -0.191 0.848141
## countryJapan -1.246258 2.437632 -0.511 0.609172
## countryJordan 0.451656 3.017194 0.150 0.881006
## countryKazakhstan 8.085302 2.680709 3.016 0.002561 **
## countryKuwait -15.046716 3.861164 -3.897 9.75e-05 ***
## countryLaos -19.188631 5.120728 -3.747 0.000179 ***
## countryLebanon 47.483096 4.679782 10.146 < 2e-16 ***
## countryMalaysia -6.984807 2.475604 -2.821 0.004782 **
## countryMongolia -20.112953 6.413080 -3.136 0.001712 **
## countryMyanmar 16.683408 3.254535 5.126 2.96e-07 ***
## countryPhilippines -7.622778 2.507979 -3.039 0.002371 **
## countrySaudi Arabia -4.981555 2.783052 -1.790 0.073463 .
## countrySingapore 2.113906 2.453099 0.862 0.388839
## countrySouth Korea -0.720911 2.473391 -0.291 0.770695
## countryTaiwan -0.271956 2.445368 -0.111 0.911448
## countryThailand -3.771169 2.444520 -1.543 0.122906
## countryTurkey 0.957620 2.498320 0.383 0.701494
## countryUnited Arab Emirates -2.854262 2.476420 -1.153 0.249087
## countryUzbekistan -20.426635 5.469381 -3.735 0.000188 ***
## countryVietnam -6.774907 2.690815 -2.518 0.011811 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.95 on 93924 degrees of freedom
## Multiple R-squared: 0.02457, Adjusted R-squared: 0.02424
## F-statistic: 73.95 on 32 and 93924 DF, p-value: < 2.2e-16
The estimated regression equation is:
Interpreting the Coefficients
Intercept (31.797): A female runner aged 18–34 from Afghanistan with a stringency index of 0 is predicted to run about 31.8 km of distance weekly. This serves as the baseline from which all other coefficients are interpreted.
Stringency Index (β = -0.032, p < 0.001): For every 1-point increase in government stringency index, the predicted running distance decreases by about 0.032 km, holding gender, age group, and country constant. A p-value < 0.001 makes this statistically significant, confirming that stricter restrictions were associated with reduced training volume.
Gender — Male (β = 3.896, p < 0.001): Male runners are predicted to run about 3.90 more kms than female runners, controlling for other variables. This is also statistically significant finding, with a p-value < 0.001.
Age Group 35–54 (β = 0.518, p < 0.005): Athletes in the age group 35-54 run about 0.52 more kms than the reference age group, with a statistically significant p- value < 0.005.
Age Group 55+ (β = -0.579, p > 0.05): Athletes aged 55 and above did not show a statistically significant difference in weekly distance compared to the baseline. This suggests that the oldest group’s training volume is not meaningfully different from the youngest group once other factors are accounted for.
Country: Each country was compared against the reference country. Some of the countries with the most statistically significant events include: Lebanon, Brunei, Myanmar, Uzbekistan, etc. This suggests that there may be regional differences in running behaviour across Asia that may not be fully explained by stringency alone.
Model Fit
Although several predictors were statistically significant, the model explained only a small proportion of the variance in running distance. More specifically, the model produced an R² of 0.0245. This means the model explains only 2.45% of variation in running distance.
This suggests that while stringency, gender, age group, and country all contribute meaningfully, the majority of variability in individual training behaviour is driven by other unmeasured factors. Examples of these factors could be personal fitness levels, individual training plans, access to proper running routes, or personal motivation.
Check Assumptions
Independence: Each record represents a unique athlete-week combination, and is assumed to be independent of other records.
Linearity: The Residuals vs Fitted plot above shows that the red line is not perfectly flat, with some mild deviation. Furthermore, tthe residuals are not perfectly random, since the spread of the residuals changes slightly across the fitted values. There is mild non-linearity, suggesting that the linear assumption is only approximately satisfied.
Normality of Residuals: The QQ plot above shows that the residuals are skewed because the ends of the line curve into an S-shape. However, due to the large sample size, the Central Limit Theorem again states that the sampling distribution is normal regardless of distribution. The values from the regression model will still be reliable.
Homoscedasticity: The Scale-Location plot above shows that the red line slopes upward slightly. This means that the variance of residuals increases somewhat across fitted values. Therefore, there is a slight violation of the homoscedasticity assumption. However, given the large sample size, the impact of inference is minimal.
Influential Cases: Checking the Residuals vs Leverage plot above suggests that there are no extreme points outside Cook’s distance lines. This means that there are no highly influential cases.
Overall, while minor assumption violations were identified, the large sample size mitigates their impact on the reliability of the regression results.
Sample Predictions
To illustrate the practical application of the model, the predicted weeklyrunning distance is estimated for a male athlete aged 35–54 from the Philippines during a period of moderate restriction (stringency index = 50).
prediction_data <- data.frame(
stringency_index = 50,
gender = "Male",
age_group = "35 - 54",
country = "Philippines")
predicted <- predict(model1,
newdata = prediction_data,
interval = "confidence")
predicted## fit lwr upr
## 1 27.00174 25.82261 28.18087
For a male athlete aged 35-54 during a high stringency period (index = 50), the predicted weekly running distance was 27.0 km (95% CI: 25.8 to 28.2 km). Once again, since the model has an R² value of 0.0245, predictions may not be accurate and should be interpreted cautiously, as many other factors may influence the running distance.
Conclusion
Overall, the regression model was statistically significant, F(32, 93924) = 73.95, p < 0.001. However, the model explained only 2.45% of the variance in weekly running distance. Higher stringency index values were significantly associated with lower weekly running distance, suggesting a small but statistically significant negative relationship between COVID-19 restrictions and training behaviour. Male athletes and those aged 35–54 also ran significantly more than their respective baseline groups. All in all, these findings suggest that while COVID-19 restrictions did influence running behaviour, majority of other factors may not have been captured in this model, given that the variables used did not capture majority of the variation in weekly training distance.
Summary of Findings
This report investigated the running behaviour of athletes in Asia using training data from 2019 to 2020, considering the Oxford COVID-19 Government Response Stringency Index.
Hypothesis Testing: Using the Welch’s two-sample t-test, a statistically significant difference was found between the mean weekly distance ran by athletes from 2019 and 2020. This provides evidence that the COVID-10 pandemic negatively impacted the training behaviour of Asian runners.
Regression Analysis: The multiple linear regression model was statistically significant overall, confirming that stringency index, gender, age group, and country collectively predict weekly running distance. However, the model only explained 2.45% of the variance in the distance run by athletes, indicating that these variables are not sufficient to fully explain or influence the weekly running distance by an athlete. Furthermore, this also proves that running is a very individual and personal which can be influenced by a wide range of factors which can be challenging to measure, such as individual fitness levels, personal motivation, access to proper running routes, and more.
Strengths and Limitations
Strengths:
Dataset is massive, with hundreds of thousands of rows, for two full years. This provides high statistical power and detects even small but meaningful changes in data.
Using 2019 data provides a natural baseline data source to use as the pre-pandemic baseline data. This allows for a direct comparison of training behaviour to that of 2020.
The stringency index is a standardised and quantitative measure of pandemic restriction, which enables the report to have an additional variable to help quantify the impact of the pandemic.
Limitations:
The dataset is sourced from a social network of athletes, which, despite its massive size, may not be representative of all types of runners (ie. those without social media accounts to monitor running progress)
The low R² suggests that the predictors used in the regression model may not be enough to capture all factors driving individual training behavious. In fact, other important variables which may be more impactful to a runner’s weekly distance may not be measured
All athletes in the dataset have completed at least one major marathon, which means that the findings may not generalise the running behaviour of all types of runners, especially those who are non-competitive or run for recreational purposes.
Future Investigations
For future investigations, the following can be considered:
Broader geographic scope: Given the size of the dataset, similar analyses could be conducted for runners from other continents. An additional deep dive can also be done to compare continents with each other.
Subgroup Analyses: Looking into running behaviours per gender or per age group can also reveal if certain demographics have different running behaviours as a result of the COVID-19 restrictions
Additional Performance Metrics: Future analyses could extend the metrics to also include the running duration and running pace, which may tell a richer story about changes in a runner’s training behaviour.
Non-marathon runners: Studying running behaviour of athletes who have not completed a marathon may provide a more representative picture of general running behaviour, instead from considering only more experienced, competitive runners.
This report examined the impact of COVID-19 restrictions on the weekly running behaviour of Asian athletes from 2019 and 2020. The analysis focused on Asia, which is the first continent to experience COVID-19 outbreaks.
After performing exploratory data analysis, hypothesis testing, and multiple linear regression, the main finding is that COVID-19 restrictions did significantly reduce weekly running distance among Asian athletes, but the effect was small in practical terms. The pandemic may be one of many factors that influenced an athlete’s training behaviour, but as supported by the low explanatory power of the regression model, running remains to be a deeply individual activity shaped by far more than government policy or the pandemic alone.
Acknowlegement of Gen AI Use
This assignment was completed with the assistance of Claude (Anthropic, 2026), particularly for coding with R, data visualisation, debugging, and exploring different approaches to analyse the results and understand the dataset. However, the direction of the analysis, interpretations, and conclusions are from the author.
Aside from Generative AI, the course modules and weekly exercises (Tafakori, 2026) also provided guidance throughout the report. The R files were used as a reference when organising the overall report, and identifying the flow of the hypothesis tests and regression analysis.
Reference List
Hale, T., Angrist, N., Goldszmidt, R., Kira, B., Petherick, A., Phillips, T., Webster, S., Cameron-Blake, E., Hallas, L., Majumdar, S., & Tatlow, H. (2021). A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker). Nature Human Behaviour, 5, 529–538. https://doi.org/10.1038/s41562-021-01079-8
Mexwell. (2023). Long-distance running dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/mexwell/long-distance-running-dataset/data
Our World in Data. (2021). COVID-19 government response stringency index. https://ourworldindata.org/covid-stringency-index
Tafakori, L. (2026, May 10). Week 9 During Class Worksheet [R File]. Canvas. https://rmit.instructure.com/courses/158207/pages/week-9-during-class-2?module_item_id=8070977
Tafakori, L. (2026, May 17). Week 10 During Class Worksheet [R File]. Canvas. https://rmit.instructure.com/courses/158207/pages/week-10-during-class?module_item_id=8070982
Tafakori, L. (2026, May 24). Week 11 During Class Worksheet [R File]. Canvas. https://rmit.instructure.com/courses/158207/pages/week-11-during-class?module_item_id=8070987
World Health Organization. (2020). WHO timeline — COVID-19. https://www.who.int/news/item/27-04-2020-who-timeline---covid-19