Effect of COVID-19 Lockdown Stringency to Asian Athlete’s Running Behaviour

A Statistical Analysis

Antoinette G. Anastacio (s4137216)

Last updated: 29 May, 2026

RPubs

This assignment has been published in RPubs.

Link: here

Introduction

Running continues to be one of the most accessible forms of exercise all around the world. In fact, every year, millions of runners join the most famous long-distance running events, the World Marathon Majors. These are 42 kilometer long runs in key countries all around the globe, namely: Boston, Tokyo, London, Berlin, Chicago, and New York.

Unfortunately, due to the COVID-19 pandemic which happened on March 2020, restrictions were set in place to fundamentally disrupt daily lives worldwide. As the COVID-19 pandemic spread, governments from different countries established public health restrictions, from stay-at-home orders, limits of outdoor activity, and venue closures. This level of restriction is referred to as the stringency index of a country, with 0 being the lowest and 100 being the highest (Hale et al., 2021).

Since running is mostly an outdoor activity, these COVID-19 restrictions may have significantly affected athletes’ running behaviour. In fact, Asian running behaviour is compelling to study in particular, given that it was the first continent to identify the coronavirus (WHO, 2020).

Problem Statement

This report investigates the following research question:

Did government COVID-19 lockdown stringency significantly affect the weekly running distances of athletes in Asia during 2020, compared to their pre-pandemic running behaviour in 2019?

Methodology

To address this question, the following steps have been applied:

About the Data

Two datasets were used and merged for this report.

Main Datasets

Running data: Sourced from Kaggle — Long-Distance Running Dataset (Mexwell, 2023). This dataset contains 10.7 million training records of 36,412 athletes all around the world (129 countries), from 2019 to 2020. Data was sourced via web scraping of a major social network for athletes. Notably, all runners here have run at least 1 major marathon (42km run). For the purposes of this report, only athletes from Asia have been included in the analysis.

COVID Stringency Index: Sourced from Our World in Data (2021). This Oxford COVID-19 Government Response Stringency Index is a score from 0 (no restrictions) to 100 (strictest restrictions), reflecting the strictness of government responses including lockdowns, school/workplace closures, and travel restrictions. The stringency index has been aggregated to a weekly level to align with the weekly running data.

Data Integration

The two datasets were merged by matching each athlete’s country of origin with the corresponding stringency index value for the week of their run. All 2019 values were assigned a value if NA given that the stringency index only existed in 2020, during the COVID-19 pandemic.

Key Variables The following variables were used throughout the report:

Variable Type Description
year Year Year of the training activity (2019 or 2020)
datetime Date Start of the week
athlete Numeric (int) Computer-generated ID for the athlete
gender Ordered Factors ‘Male’ or ‘Female’
age_group Ordered Factors ‘18 - 34’, ‘35 - 54’, ‘55 +’
country Characters Athlete’s country of origin
distance Numeric (km) Weekly running distance covered by the athlete
stringency_index Numeric (0–100) Government restriction severity during this week
lockdown-level Ordered Factors Level of lockdown based on stringency index

Loading, Examining, & Pre-Processing Running Data

First, the running dataset was loaded and preprocessed, before being merged to the stringency dataset. Pre-processing took the most time out of the whole report since the dataset contains millions of rows of data.

Load the Data

2019 and 2020 datasets were loaded, merged into one dataframe, and examined.

#get the files for running
run_2019 <- read_csv("run_ww_2019_w.csv")
run_2020 <- read_csv("run_ww_2020_w.csv")

#combine to one dataframe
run_data <- bind_rows(run_2019, run_2020)

#examine rows
glimpse(run_data)
## Rows: 3,786,848
## Columns: 9
## $ ...1      <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ datetime  <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01,…
## $ athlete   <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ distance  <dbl> 0.00, 5.27, 9.30, 103.13, 34.67, 69.85, 65.38, 0.00, 32.34, …
## $ duration  <dbl> 0.00000, 30.20000, 98.00000, 453.40000, 185.65000, 353.93333…
## $ gender    <chr> "F", "M", "M", "M", "M", "F", "M", "M", "M", "M", "M", "M", …
## $ age_group <chr> "18 - 34", "35 - 54", "35 - 54", "18 - 34", "35 - 54", "35 -…
## $ country   <chr> "United States", "Germany", "United Kingdom", "United Kingdo…
## $ major     <chr> "CHICAGO 2019", "BERLIN 2016", "LONDON 2018,LONDON 2019", "L…

Format the Data Columns & Values

#delete first column, not needed
run_data <- run_data %>%
  select(-`...1`)

#format datetime as date
run_data$datetime <- as.Date(run_data$datetime)

#change gender column to ordered factors
run_data$gender<-factor(run_data$gender,
                         levels = c("F","M"),
                         labels=c("Female","Male"))

#change age_group column to ordered factors
unique(run_data$age_group)
## [1] "18 - 34" "35 - 54" "55 +"
run_data$age_group<-factor(run_data$age_group,
                         levels = c("18 - 34","35 - 54", "55 +"),
                         labels=c("18 - 34","35 - 54", "55 +"))

#quick skim to make sure there are no misspelled country names
unique(run_data$country)
##   [1] "United States"          "Germany"                "United Kingdom"        
##   [4] "Australia"              "Spain"                  "Canada"                
##   [7] "Colombia"               "Japan"                  "Malaysia"              
##  [10] "Belarus"                "Switzerland"            "Italy"                 
##  [13] "Norway"                 "Netherlands"            "France"                
##  [16] "Mexico"                 "Brazil"                 "Taiwan"                
##  [19] "Peru"                   "Russia"                 "Luxembourg"            
##  [22] "Sweden"                 "Singapore"              NA                      
##  [25] "Slovenia"               "Costa Rica"             "Indonesia"             
##  [28] "Denmark"                "Austria"                "Poland"                
##  [31] "Chile"                  "South Africa"           "Belgium"               
##  [34] "China"                  "Isle of Man"            "Cayman Islands"        
##  [37] "Iceland"                "Portugal"               "Romania"               
##  [40] "Thailand"               "Estonia"                "Finland"               
##  [43] "Moldova"                "South Korea"            "Argentina"             
##  [46] "Czechia"                "Ukraine"                "Slovakia"              
##  [49] "Dominican Republic"     "Israel"                 "Guatemala"             
##  [52] "Jersey"                 "Ireland"                "Turkey"                
##  [55] "United Arab Emirates"   "Uruguay"                "New Zealand"           
##  [58] "Hungary"                "Philippines"            "Myanmar"               
##  [61] "Greece"                 "India"                  "Croatia"               
##  [64] "Panama"                 "Cyprus"                 "Vietnam"               
##  [67] "Guernsey"               "Mongolia"               "Lithuania"             
##  [70] "Bolivia"                "Andorra"                "El Salvador"           
##  [73] "Latvia"                 "Nicaragua"              "Jordan"                
##  [76] "Ecuador"                "Kazakhstan"             "Kosovo"                
##  [79] "Bulgaria"               "Malta"                  "Kenya"                 
##  [82] "Venezuela"              "Serbia"                 "Zimbabwe"              
##  [85] "Monaco"                 "Montenegro"             "Suriname"              
##  [88] "Armenia"                "Bahrain"                "Honduras"              
##  [91] "Tunisia"                "Nigeria"                "Barbados"              
##  [94] "Ghana"                  "Azerbaijan"             "Botswana"              
##  [97] "Liechtenstein"          "Faroe Islands"          "Saudi Arabia"          
## [100] "Paraguay"               "Senegal"                "Angola"                
## [103] "Mauritius"              "Lebanon"                "Bosnia and Herzegovina"
## [106] "Bermuda"                "Gibraltar"              "Uganda"                
## [109] "Afghanistan"            "Anguilla"               "Morocco"               
## [112] "Jamaica"                "Belize"                 "Iran"                  
## [115] "Bahamas"                "Uzbekistan"             "Namibia"               
## [118] "Trinidad and Tobago"    "Ivory Coast"            "San Marino"            
## [121] "Sudan"                  "South Sudan"            "Maldives"              
## [124] "East Timor"             "Kuwait"                 "Fiji"                  
## [127] "Laos"                   "Brunei"                 "Egypt"                 
## [130] "Cape Verde"

Pre-Processing Running Data (Cont.)

Missing Data

Upon checking, there are missing entries for country, so this was investigated further to determine how to deal with the missing entries. On the other hand, by doing this, it was discovered that the column major had no empty entries, meaning that all athletes in this dataset have run at least 1 marathon.

#check for missing data
colSums(is.na(run_data))
##  datetime   athlete  distance  duration    gender age_group   country     major 
##         0         0         0         0         0         0     34216         0

First, we check if the athletes may have other entries where their countries are present, so we could impute that value. However, doing a quick check of all unique athletes and a random athlete number showed that there is nowhere in the master dataset where the countries of those athletes can be found.

#get the athlete numbers with missing country names
missing_athletes <- run_data %>% 
  filter(is.na(country)) %>%
  pull(athlete)

#check the athletes with missing countries in the master dataset if their country exists there
run_data %>%
  filter(athlete %in% missing_athletes) %>%
  select(datetime, athlete, country)
#check 1 sample athlete if any of their countries exists in the master dataset
run_data %>%
  filter(athlete == 95) %>%
  select(datetime, athlete, country, datetime) 
#now that we know we cannot impute the country, decide if we should just remove them
34216/3786848 
## [1] 0.009035483

Since the athletes with missing countries is only 0.9% of the dataset, and their country of origin cannot be imputed from any existing data, they will just be removed.

#remove the athletes with missing countries
run_data_clean <- run_data %>%
  filter(!athlete %in% missing_athletes)

#confirm the deletion has been done 
colSums(is.na(run_data_clean))
##  datetime   athlete  distance  duration    gender age_group   country     major 
##         0         0         0         0         0         0         0         0

Add Extra Columns

To assist with the analysis, extra columns were created. Additionally, the continent of the countries was also identified.

run_data_clean <- run_data_clean %>%
  mutate(
    year = year(datetime),
    week = isoweek(datetime),
    pace = ifelse(distance > 0, duration / distance, 0))  # min/km

# change Kosovo to nearest bordering country, Serbia
run_data_clean <- run_data_clean %>%
  mutate(country = case_when(
    country == "Kosovo" ~ "Serbia",  
    TRUE ~ country))

# create a new column to identify the continent of each country
run_data_clean <- run_data_clean %>%
  mutate(
    continent = countrycode(country, 
                            origin = "country.name", 
                            destination = "continent"))

Loading, Examining, & Pre-Processing Stringency Data

The COVID-19 stringency dataset was also loaded. The column names were just changed to be similar to the ones in the running dataset, and the data was aggregated to the mean weekly stringency index to align with the running dataset’s weekly time frame, and then this data was merged to the running dataset.

#load covid stringency data
stringency_raw <- read_csv("covid-stringency-index.csv")

glimpse(stringency_raw)
## Rows: 74,056
## Columns: 4
## $ Entity           <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghani…
## $ Code             <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG…
## $ Date             <date> 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020…
## $ stringency_index <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#rename column names
stringency_raw <- stringency_raw %>%
  rename(
    country = Entity,
    country_code = Code,
    date = Date) %>%
  mutate(date = as.Date(date))

#aggregate stringency values to weekly instead of daily
stringency_weekly <- stringency_raw %>%
  mutate(
    year = year(date),
    week = isoweek(date)) %>%
  group_by(country, year, week) %>%
  summarise(stringency_index = mean(stringency_index, na.rm = TRUE),
            .groups = "drop")

Merging Datasets

Both the running dataset and the stringency dataset were merged into one full_data dataset. Then, another column named lockdown_level was created to group together the levels of stringency to high, low, or minimal levels.

Since the COVID-19 pandemic happened by 2020, the stringency indexes did not exist during 2019. Therefore, the label for training data for 2019 would be Pre-Pandemic (2019). Stringency indexes less than 20 (incuding no restrictions at all) were labeled as Minimal Restrictions (<20). Entries labeled as Low Stringency were those with indeces from 20 to 59, and those above 60 were labeled as High Stringency.

# Join stringency onto running data by country + year + week
full_data <- run_data_clean %>%
  left_join(stringency_weekly, by = c("country", "year", "week"))

# create a case when to label the level of stringency based on stringency index
full_data <- full_data %>%
  mutate(
    lockdown_level = case_when(
      year == 2019                  ~ "Pre-Pandemic (2019)",
      stringency_index >= 60        ~ "High Stringency (≥60)",
      stringency_index >= 20        ~ "Low Stringency (20–59)",
      stringency_index < 20         ~ "Minimal Restrictions (<20)",
      is.na(stringency_index)       ~ "Minimal Restrictions (<20)" 
    ),
    lockdown_level = factor(lockdown_level, levels = c(
      "Pre-Pandemic (2019)",
      "Minimal Restrictions (<20)",
      "Low Stringency (20–59)",
      "High Stringency (≥60)"
    ))
  )

The table below shows that there are a significant number of athletes and training records per continent. However, for the sake of simplicity, we will only focus on one continent, Asia, which has 2,802 athletes and 291,408 rows of training data. Again, this is because Asia has been one of the first continents to be exposed to the novel coronavirus, which may make for interesting insights and data patterns during the analysis phase.

full_data %>%
  group_by(continent) %>%
  summarise(
    athletes = n_distinct(athlete),
    records  = n()
  ) %>%
  arrange(desc(athletes)) %>%
  knitr::kable(caption = "Athletes and Records per Continent")
Athletes and Records per Continent
continent athletes records
Americas 16712 1738048
Europe 15731 1636024
Asia 2802 291408
Oceania 640 66560
Africa 198 20592

Moving forward, the report will focus on one continent, particularly Asia. The dataset will be filtered to select only data from Asian athletes, and select only the relevant variables for the report. Furthermore, since the focus of this report will focus on run duration, only this metric will remain, and the run distance and pace will be filtered out.

#select needed rows
full_data <- full_data %>%
  filter(continent == "Asia") %>%
  select(year, datetime, athlete, gender, age_group, country, distance, stringency_index, lockdown_level) %>%
  mutate(year = as.character(year))

glimpse(full_data)
## Rows: 291,408
## Columns: 9
## $ year             <chr> "2019", "2019", "2019", "2019", "2019", "2019", "2019…
## $ datetime         <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019…
## $ athlete          <dbl> 18, 21, 60, 64, 77, 84, 87, 102, 163, 181, 208, 223, …
## $ gender           <fct> Male, Male, Male, Female, Male, Female, Male, Male, M…
## $ age_group        <fct> 35 - 54, 18 - 34, 35 - 54, 18 - 34, 35 - 54, 55 +, 35…
## $ country          <chr> "Japan", "Malaysia", "Taiwan", "Taiwan", "Taiwan", "J…
## $ distance         <dbl> 52.53, 58.47, 13.32, 0.00, 40.10, 16.51, 60.80, 22.62…
## $ stringency_index <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lockdown_level   <fct> Pre-Pandemic (2019), Pre-Pandemic (2019), Pre-Pandemi…

Removing Outliers

To ensure that the entries included would reflect realistic runners’ performance, outliers will be removed based on the IQR values and based on what is considered an outlier from domain knowledge about running behaviour. A quick check of the values shows that the data is highly skewed to the right, therefore, the IQR will be used to determine the outliers. Z-scores could not be used since they assume that the datasets have normal distributions (the dataset we are working with does not).

Remove Implausible Values

Weekly distances below 1 km are implausible as a training record, and were removed prior to outlier detection. It is possible that these could be athlete’s mistakes to accidentally record a run, without actually running.

#remove entries where athletes barely ran (0)
full_data_clean <- full_data %>%
  filter(distance > 1)

Examine Distribution of Distance

Looking at the histograms and boxplots of the distance variable shows extreme outliers, confirming that the distance variable is heavily skewed to the right, with the majority of Asian athletes running less than 80km per week. Furthermore, the boxplot identifies values beyond 100km as outliers, justifying the use of the IQR method to remove these outliers.

p1 <- ggplot(full_data_clean, aes(x = distance)) +
  geom_histogram(bins = 10, fill = "#56B4E9", colour = "white") +
  labs(title = "Distribution of Distance", x = "Distance (km)", y = "Count") +
  theme_minimal()

p2 <- ggplot(full_data_clean, aes(x = distance)) +
  geom_boxplot(fill = "#56B4E9", outlier.alpha = 0.1, outlier.size = 0.5) +
  labs(title = "Distribution of Distance", x = "Distance (km)", y = "Count") +
  theme_minimal()

p1 | p2

Applying the IQR Method

The inter quartile ranges will be used to identify the range that will be filtered out. More specifically, values falling below 1.5 times the IQR of the 1st quartile and above 1.5 times the IQR of the 3rd quartile will be identified as outliers and removed.

#distance ranges
Q1_dist <- quantile(full_data_clean$distance, 0.25)
Q3_dist <- quantile(full_data_clean$distance, 0.75)
IQR_dist <- Q3_dist - Q1_dist

#filter data based on the IQRs
full_data_clean <- full_data_clean %>%
  filter(
    distance >= Q1_dist - 1.5 * IQR_dist,
    distance <= Q3_dist + 1.5 * IQR_dist)

Distribution After Outlier Removal

Looking at the boxplots and histograms below, there are now more interpretable. The histogram now shows that most Asian athletes run a weekly distance of around 15-20 km. The boxplot shows that there are some outliers running at 100 km weekly, but overall, the distances fall within the accepted IQR range and represent legitimate high-volume training weeks.

p1 <- ggplot(full_data_clean, aes(x = distance)) +
  geom_histogram(bins = 10, fill = "#56B4E9", colour = "white") +
  labs(title = "Distribution of Distance", x = "Distance (km)", y = "Count") +
  theme_minimal()

p2 <- ggplot(full_data_clean, aes(x = distance)) +
  geom_boxplot(fill = "#56B4E9", outlier.alpha = 0.1, outlier.size = 0.5) +
  labs(title = "Distribution of Distance", x = "Distance (km)", y = "Count") +
  theme_minimal()

p1 | p2

Descriptive Statistics and Visualisation

This segment explores key patterns and trends in the running behaviour of asian athletes across 2019 and 2020. Data visualisations and summary statistics are used to identify any changes that may be attributable to the COVID-19 pandemic, as well as characterise the distribution of weekly running distance.

Mean Weekly Running Distance by Country

The table below summarises the weekly running distance for each country in Asia, providing an initial overview of how training volume differs across countries, with some countries like Laos running a weekly distance of 14.90 km, to countries like Lebanon running a weekly distance of 72.96 km.

table1 <- full_data_clean %>%
  group_by(country) %>%
  summarise(
    Min = min(distance, na.rm = TRUE),
    Mean = mean(distance, na.rm = TRUE),
    Max = max(distance, na.rm = TRUE),
    SD = sd(distance, na.rm = TRUE)) 

knitr::kable(table1, caption = "Weekly Running Distance by Country")
Weekly Running Distance by Country
country Min Mean Max SD
Afghanistan 1.940 31.45095 81.550 18.369030
Armenia 1.240 26.43627 98.170 23.821443
Azerbaijan 1.600 31.82224 99.380 20.138312
Bahrain 2.410 42.16794 101.690 23.834114
Brunei 3.670 14.74541 44.980 8.360544
China 1.010 35.59482 102.290 24.036517
Cyprus 1.160 42.50952 101.360 24.489902
East Timor 2.220 20.11997 60.840 11.128419
India 1.010 36.32849 102.170 22.547213
Indonesia 1.010 28.40569 102.180 19.634022
Iran 3.440 23.35766 65.300 12.877387
Israel 1.150 35.66561 102.230 24.681073
Japan 1.010 33.32658 102.300 22.921029
Jordan 1.230 31.98363 96.380 17.859288
Kazakhstan 1.690 40.54369 102.179 27.582594
Kuwait 2.009 18.12398 59.270 12.402453
Laos 5.040 14.89624 36.889 8.017799
Lebanon 8.050 72.96030 101.510 23.187811
Malaysia 1.020 27.98087 102.110 20.938736
Maldives 3.000 35.70532 72.560 21.544160
Mongolia 1.170 10.92718 29.880 8.801318
Myanmar 1.720 47.71716 102.140 25.250906
Philippines 1.010 27.13469 102.019 19.286379
Saudi Arabia 1.010 28.91812 99.220 21.306816
Singapore 1.010 34.47856 102.260 23.564462
South Korea 1.010 34.87827 102.290 23.229860
Taiwan 1.010 35.31146 102.260 24.461123
Thailand 1.010 31.01599 102.218 21.553591
Turkey 1.130 36.70298 102.220 24.301293
United Arab Emirates 1.010 31.48995 102.170 21.195461
Uzbekistan 1.010 38.72967 89.990 28.760910
Vietnam 1.020 26.62017 82.220 17.279973

Distribution of Weekly Distance

The boxplot below shows the overall distribution of weekly distance across Asian runners from 2019 to 2020. While most athletes run moderate distances (approximately 20 - 50km weekly), there are a smaller proportion of athletes who train at considerably higher volumes, hence the outliers.

# Boxplot of weekly distance
full_data_clean %>%
  ggplot(aes(y = distance)) +
  geom_boxplot(fill = "#56B4E9", outlier.alpha = 0.1, outlier.size = 0.5) +
  labs(
    title = "Weekly Running Distance",
    y = "Distance (km)") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

Mean Weekly Distance Over Time

The time series below plots the mean weekly running distance across all Asian athletes from January 2019 to December 2020. The red dashed line marks the data when the WHO declared COVID-19 as a global pandemic (on March 11, 2020).

Further, several notable patterns are seen in this time series:

# Aggregate mean weekly distance across all athletes
weekly_avg <- full_data_clean %>%
  group_by(datetime) %>%
  summarise(mean_distance = mean(distance, na.rm = TRUE), .groups = "drop")

ggplot(weekly_avg, aes(x = datetime, y = mean_distance)) +
  geom_line(colour = "#009E73") +
  geom_vline(xintercept = as.Date("2020-03-11"), #pandemic WHO declaration
             colour = "red", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = as.Date("2020-03-11"), y = Inf,
           label = "WHO Pandemic", 
           hjust = -0.1, vjust = 1.5, size = 2.5, colour = "red") +
  labs(
    title  = "Mean Weekly Running Distance - Asia (2019–2020)",
    x = "Date",
    y = "Mean Distance (km)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Descriptive Statistics and Visualisation (Cont.)

Mean Weekly Distance for 2019 and 2020

The bar chart below compares the mean weekly running distance between 2019 and 2020, which shows that the 2019 mean running distance is 33.52 km, while the distance for 2020 is 32.84 km. There is a clear decline in the mean distance observed in 2020, providing initial visual evidence that the COVID-19 pandemic may have reduced training volume. This difference will be formally tested in the hypothesis test section.

table3 <- full_data_clean %>%
  group_by(year) %>%
  summarise(
    Mean_Distance = round(mean(distance, na.rm = TRUE), 2),
    SD_Distanc = round(sd(distance, na.rm = TRUE), 2),
    n = n()) %>%
  mutate(year = factor(year))

ggplot(table3, aes(x = year, y = Mean_Distance, fill = year)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = Mean_Distance), 
            vjust = -0.5,
            size  = 4,
            fontface = "bold") +
  scale_fill_manual(values = c("2019" = "#56B4E9",
                                "2020" = "#E69F00")) +
  labs(
    title = "Mean Weekly Running Distance - Asia: 2019 vs 2020",
    x = "Year",
    y = "Mean Distance (km)",
    fill = "Year") +
  theme_minimal()

Mean Weekly Distance by Lockdown Level

The mean running distance for 2020 is compared across all different lockdown levels. The pre-pandemic category in 2019 serves as the baseline (33.52 km) while the remaining categories reflect increasing levels of restriction severity during 2020.

The values below suggest that the mean weekly distance was broadly similar across restriction levels, even as stringency increased (low stringency mean distance was 32.55 km, and high stringency mean distance was 32.65 km). Notably, athletes under Minimal Restrictions (<20) appear to have run more than the pre-pandemic baseline, with a mean distance of 34.41 km. This may potentially reflect a period of adjustment by athletes where they have increased or maintained their training distance before stricter restrictions were imposed. This observation will be formally examined in the regression analysis segment of the report.

table4 <- full_data_clean %>%
  group_by(lockdown_level) %>%
  summarise(
    Mean_Distance = round(mean(distance, na.rm = TRUE), 2),
    SD_Distance = round(sd(distance, na.rm = TRUE), 2),
    n = n())

ggplot(table4, aes(x = lockdown_level, y = Mean_Distance, fill = lockdown_level)) +
  geom_col() +
  geom_text(aes(label = Mean_Distance), 
            vjust = -0.5,
            size = 4,
            fontface = "bold") +
  scale_fill_manual(values = c(
    "Pre-Pandemic (2019)"        = "#56B4E9",
    "Minimal Restrictions (<20)" = "#009E73",
    "Low Stringency (20–59)"     = "#E69F00",
    "High Stringency (≥60)"      = "#D55E00")) +
  labs(
    title = "Mean Weekly Running Distance by Lockdown Level - Asia",
    x = "Lockdown Level",
    y = "Mean Distance (km)",
    fill = "Lockdown Level") +
  ylim(0, 40) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Hypothesis Testing

Hypotheses

This section investigates whether the mean weekly distances ran by Asian athletes from 2019 (pre-pandemic) was significantly different from 2020 (pandemic) A significance level of α = 0.05 is used throughout.

\[H_0: \mu_{2019} = \mu_{2020} \] The null hypothesis states that the mean weekly running distance of Asian athletes in 2019 is equal to that of 2020.

\[H_A: \mu_{2019} \ne \mu_{2020}\] The alternative hypothesis states that the mean weekly running distance of Asian athletes in 2019 is significantly different from that of 2020.

Descriptive Statistics

First, the data is split into pre-pandemic (2019) and pandemic (2020). Then, prior to testing, some descriptive statistics are examined by year.

# Split data into 2019 and 2020
data_2019_clean <- full_data_clean %>% filter(year == "2019")
data_2020_clean <- full_data_clean %>% filter(year == "2020")

The table below summarises the distribution of distance for 2019 and 2020, and the boxplot provides a visual comparison between the two periods.

table_desc <- full_data_clean %>% 
  group_by(year) %>% 
  summarise(Min = min(distance, na.rm = TRUE),
            Q1 = quantile(distance, probs = .25, na.rm = TRUE),
            Median = median(distance, na.rm = TRUE),
            Q3 = quantile(distance, probs = .75, na.rm = TRUE),
            Max = max(distance, na.rm = TRUE),
            Mean = mean(distance, na.rm = TRUE),
            SD = sd(distance, na.rm = TRUE),
            n = n()) 

knitr::kable(table_desc, caption = "Weekly Running Distance by Year")
Weekly Running Distance by Year
year Min Q1 Median Q3 Max Mean SD n
2019 1.01 15.01 29.130 48.08000 102.29 33.51891 22.62318 106874
2020 1.01 13.72 27.435 47.46975 102.30 32.84393 23.23298 94074

The boxplot below suggests that the 2020 weekly mean distance appears to be lower than those of 2019’s. However, the hypothesis test can help us see if these decreases are statistically significant, more than relying on visual inspection alone.

# Boxplot of distance by year, faceted per continent
full_data_clean %>%
  ggplot(aes(x = year, y = distance, fill = year)) +
  geom_boxplot(outlier.alpha = 0.1, outlier.size = 0.3) +
  scale_fill_manual(values = c("2019" = "#56B4E9", "2020" = "#E69F00")) +
  labs(
    title = "Weekly Running Distance: 2019 vs 2020 in Asia",
    x = "Year",
    y = "Distance (km)",
    fill = "Year") +
  theme_minimal()

Hypothesis Testing (cont.)

Check Assumptions

Normality: Normality of the distance variable is assessed using QQ plots. As seem below, the tails deviate from the normal line, indicating that the raw data is not normally distributed. However, as sample sizes are more than 30 observations, the Central Limit Theorem (CLT) applies, and the sampling distribution will approximately be normal regardless of the shape of the raw data. The normality assumption is therefore satisfied.

ggqqplot(full_data_clean, x = "distance", color = "year")

Homogeneity of variances: Levene’s test is used to check if the spread of variances are equal between 2019 and 2020. If false (p < 0.05), then Welch’s test must be used instead of the standard two sample t test. As seen below, Levene’s test returned a value of p-value < 0.05. Therefore, Asia does not have equal variances between 2019 and 2020. Welch’s t-test (var.equal = FALSE) will therefore be used.

levene_results <- full_data_clean %>%
  summarise(
    levene_p = leveneTest(distance ~ factor(year), 
                          data = cur_data())$`Pr(>F)`[1]) %>%
  mutate(
    levene_p = round(levene_p, 4), #round off
    `Equal Variance?` = ifelse(levene_p > 0.05, "TRUE", "FALSE"))

knitr::kable(levene_results, caption = "Levene's Test for Homogeneity of Variances")
Levene’s Test for Homogeneity of Variances
levene_p Equal Variance?
0 FALSE

Run Hypothesis Test

First, the distances for each year (2010 and 2020) is pulled.

asia_2019 <- full_data_clean %>% filter(year == "2019") %>% pull(distance)
asia_2020 <- full_data_clean %>% filter(year == "2020") %>% pull(distance)

Then, the t-tests are run. Again, since Asia’s p-value is < 0.05, Welch’s test is used (var.equal=FALSE).

test_asia <- t.test(asia_2019, asia_2020, var.equal = FALSE, alternative = "two.sided")

The results are summarised in a table below. Between 2019 and 2020, there is a minimal difference between Asian athlete’s mean distance ran. However, the t-test results show that the mean differences are statistically significant.

results_table <- data.frame(
  Continent = "Asia",
  Mean_2019 = round(mean(asia_2019), 2),
  Mean_2020 = round(mean(asia_2020), 2),
  Difference = round(mean(asia_2020) - mean(asia_2019), 2),
  T_Stat = round(test_asia$statistic,   3),
  DF = round(test_asia$parameter,   1),
  P_Value = round(test_asia$p.value,     4),
  CI_Lower = round(test_asia$conf.int[1], 2),
  CI_Upper = round(test_asia$conf.int[2], 2),
  Significant = "Yes")

knitr::kable(results_table,
             caption = "Two-Sample T-Test Results: 2019 vs 2020 Distance - Asia")
Two-Sample T-Test Results: 2019 vs 2020 Distance - Asia
Continent Mean_2019 Mean_2020 Difference T_Stat DF P_Value CI_Lower CI_Upper Significant
t Asia 33.52 32.84 -0.67 6.579 196282 0 0.47 0.88 Yes

Hypothesis Testing Results

Results

The p-value < 0.05. The 95% confidence interval does not capture 0. Therefore, we reject the null hypothesis.

Conclusion

The Welch’s two-sample t-test for Asia was statistically significant (t = 6.579, df = 196,282, p = 0.001). The 95% confidence interval [0.47, 0.88] did not capture 0, indicating that the null hypothesis can be rejected. Mean weekly distance in 2020 (34.84 km) was significantly lower than 2019 (33.52 km), with an observed difference of -0.67 km.

While statistically significant, the magnitude of this difference is modest, with a mean decrease of just 0.67km per week. However, the direction of the effect (of the running distance decreasing in 2020), aligns with the bigger narrative that COVID-19 restrictions negatively affected the training behaviour of runners. The extent to which the government stringency specifically predicts this running distance is examined further in the next section.

Multiple Linear Regression

This section investigates if the government stringency level, gender, age group, and country can predict the weekly running distance of Asian athletes during 2020. A multiple linear regression model is fitted using only 2020 data, as the stringency index is only available for this period.

# Filter to 2020 only
data_2020 <- full_data_clean %>% 
  filter(year == 2020, !is.na(stringency_index))

Visualising the Relationship

A simple scatter plot below visualises the possible relationship of the stringency index score and the running distances ran by athletes. However, since the dataset contains a large number of individual records, the mean weekly distance is aggregated for each stringency index score for clarity. That is, the mean weekly running distance of runners will be aggregated per stringency value for visualisation purposes. The regression model will still be fitted on individual level observations.

data_2020_avg <- data_2020 %>%
  mutate(stringency_bin = round(stringency_index)) %>%  # round to whole numbers
  group_by(stringency_bin) %>%
  summarise(mean_distance = mean(distance, na.rm = TRUE),
            n = n(),
            .groups = "drop")

The scatter plot below shows that there is a very slight negative slope. Meaning, as stringency increased, the athlete’s mean weekly distance dropped as well.

While the difference between mean distances run per continent was statistically significant between 2019 and 2020 (from the previous hypothesis test done in the previous task), the stringency index itself may not be a strong predictor of distance run for 2020 for most continents.

ggplot(data_2020_avg, aes(x = stringency_bin, y = mean_distance)) +
  geom_point(aes(size = n), alpha = 0.6) +   # size by number of observations
  geom_smooth(method = "lm", se = TRUE, colour = "red", linewidth = 0.8) +
  scale_size_continuous(range = c(1, 6)) +
  labs(
    title = "Mean Weekly Running Distance vs Stringency Index in Asia (2020)",
    subtitle = "Point size reflects number of observations at each stringency level",
    x = "COVID-19 Stringency Index",
    y = "Mean Weekly Distance (km)") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Fitting the Model

A multiple linear regression model is fitted using stringency index, gender, age group, and country as predictors of weekly running distance.

The baseline reference group for the initial equation contains these attributes: * from Afghanistan (the first country alphabetically in the countries) * Female * 18 - 34 years old (youngest in the age group) * stringency index is 0

\[distance=β0​+β1​(stringencyindex)+β2​(gender)+β3​(agegroup)+β4​(country)+ε\]

model1 <- lm(distance ~ stringency_index + gender + age_group + country, data = data_2020)
model1 %>% summary()
## 
## Call:
## lm(formula = distance ~ stringency_index + gender + age_group + 
##     country, data = data_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.573 -18.570  -5.115  14.539  75.540 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  31.797200   2.452808  12.964  < 2e-16 ***
## stringency_index             -0.031724   0.003926  -8.080 6.57e-16 ***
## genderMale                    3.895603   0.210370  18.518  < 2e-16 ***
## age_group35 - 54              0.517905   0.184991   2.800 0.005117 ** 
## age_group55 +                -0.579058   0.336378  -1.721 0.085173 .  
## countryAzerbaijan            -2.757121   3.119053  -0.884 0.376719    
## countryBahrain                2.309793   3.122008   0.740 0.459398    
## countryBrunei               -20.809586   4.230299  -4.919 8.71e-07 ***
## countryChina                  2.944684   2.445936   1.204 0.228628    
## countryCyprus                 8.995736   2.621931   3.431 0.000602 ***
## countryIndia                  1.367375   2.462072   0.555 0.578639    
## countryIndonesia             -5.869439   2.447839  -2.398 0.016496 *  
## countryIran                  -3.285086   4.092766  -0.803 0.422175    
## countryIsrael                -0.478382   2.498194  -0.191 0.848141    
## countryJapan                 -1.246258   2.437632  -0.511 0.609172    
## countryJordan                 0.451656   3.017194   0.150 0.881006    
## countryKazakhstan             8.085302   2.680709   3.016 0.002561 ** 
## countryKuwait               -15.046716   3.861164  -3.897 9.75e-05 ***
## countryLaos                 -19.188631   5.120728  -3.747 0.000179 ***
## countryLebanon               47.483096   4.679782  10.146  < 2e-16 ***
## countryMalaysia              -6.984807   2.475604  -2.821 0.004782 ** 
## countryMongolia             -20.112953   6.413080  -3.136 0.001712 ** 
## countryMyanmar               16.683408   3.254535   5.126 2.96e-07 ***
## countryPhilippines           -7.622778   2.507979  -3.039 0.002371 ** 
## countrySaudi Arabia          -4.981555   2.783052  -1.790 0.073463 .  
## countrySingapore              2.113906   2.453099   0.862 0.388839    
## countrySouth Korea           -0.720911   2.473391  -0.291 0.770695    
## countryTaiwan                -0.271956   2.445368  -0.111 0.911448    
## countryThailand              -3.771169   2.444520  -1.543 0.122906    
## countryTurkey                 0.957620   2.498320   0.383 0.701494    
## countryUnited Arab Emirates  -2.854262   2.476420  -1.153 0.249087    
## countryUzbekistan           -20.426635   5.469381  -3.735 0.000188 ***
## countryVietnam               -6.774907   2.690815  -2.518 0.011811 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.95 on 93924 degrees of freedom
## Multiple R-squared:  0.02457,    Adjusted R-squared:  0.02424 
## F-statistic: 73.95 on 32 and 93924 DF,  p-value: < 2.2e-16

The estimated regression equation is:

\[\hat{distance}=31.797−0.0317(stringency)+3.896(Male)+0.518(Age{35-54})−0.579(Age{55+})+\ldots(country)\]

Multiple Linear Regression (Cont.)

Interpreting the Coefficients

Model Fit

Although several predictors were statistically significant, the model explained only a small proportion of the variance in running distance. More specifically, the model produced an R² of 0.0245. This means the model explains only 2.45% of variation in running distance.

This suggests that while stringency, gender, age group, and country all contribute meaningfully, the majority of variability in individual training behaviour is driven by other unmeasured factors. Examples of these factors could be personal fitness levels, individual training plans, access to proper running routes, or personal motivation.

Multiple Linear Regression (Cont.)

Check Assumptions

par(mfrow = c(2, 2))
plot(model1)

par(mfrow = c(1, 1))

Overall, while minor assumption violations were identified, the large sample size mitigates their impact on the reliability of the regression results.

Multiple Linear Regression Results

Sample Predictions

To illustrate the practical application of the model, the predicted weeklyrunning distance is estimated for a male athlete aged 35–54 from the Philippines during a period of moderate restriction (stringency index = 50).

prediction_data <- data.frame(
  stringency_index = 50,
  gender = "Male",
  age_group = "35 - 54",
  country = "Philippines")

predicted <- predict(model1, 
                     newdata = prediction_data, 
                     interval = "confidence")

predicted
##        fit      lwr      upr
## 1 27.00174 25.82261 28.18087

For a male athlete aged 35-54 during a high stringency period (index = 50), the predicted weekly running distance was 27.0 km (95% CI: 25.8 to 28.2 km). Once again, since the model has an R² value of 0.0245, predictions may not be accurate and should be interpreted cautiously, as many other factors may influence the running distance.

Conclusion

Overall, the regression model was statistically significant, F(32, 93924) = 73.95, p < 0.001. However, the model explained only 2.45% of the variance in weekly running distance. Higher stringency index values were significantly associated with lower weekly running distance, suggesting a small but statistically significant negative relationship between COVID-19 restrictions and training behaviour. Male athletes and those aged 35–54 also ran significantly more than their respective baseline groups. All in all, these findings suggest that while COVID-19 restrictions did influence running behaviour, majority of other factors may not have been captured in this model, given that the variables used did not capture majority of the variation in weekly training distance.

Discussion

Summary of Findings

This report investigated the running behaviour of athletes in Asia using training data from 2019 to 2020, considering the Oxford COVID-19 Government Response Stringency Index.

Strengths and Limitations

Strengths:

Limitations:

Discussion (Cont.)

Future Investigations

For future investigations, the following can be considered:

Conclusion

This report examined the impact of COVID-19 restrictions on the weekly running behaviour of Asian athletes from 2019 and 2020. The analysis focused on Asia, which is the first continent to experience COVID-19 outbreaks.

After performing exploratory data analysis, hypothesis testing, and multiple linear regression, the main finding is that COVID-19 restrictions did significantly reduce weekly running distance among Asian athletes, but the effect was small in practical terms. The pandemic may be one of many factors that influenced an athlete’s training behaviour, but as supported by the low explanatory power of the regression model, running remains to be a deeply individual activity shaped by far more than government policy or the pandemic alone.

References

Acknowlegement of Gen AI Use

This assignment was completed with the assistance of Claude (Anthropic, 2026), particularly for coding with R, data visualisation, debugging, and exploring different approaches to analyse the results and understand the dataset. However, the direction of the analysis, interpretations, and conclusions are from the author.

Aside from Generative AI, the course modules and weekly exercises (Tafakori, 2026) also provided guidance throughout the report. The R files were used as a reference when organising the overall report, and identifying the flow of the hypothesis tests and regression analysis.

Reference List

Hale, T., Angrist, N., Goldszmidt, R., Kira, B., Petherick, A., Phillips, T., Webster, S., Cameron-Blake, E., Hallas, L., Majumdar, S., & Tatlow, H. (2021). A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker). Nature Human Behaviour, 5, 529–538. https://doi.org/10.1038/s41562-021-01079-8

Mexwell. (2023). Long-distance running dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/mexwell/long-distance-running-dataset/data

Our World in Data. (2021). COVID-19 government response stringency index. https://ourworldindata.org/covid-stringency-index

Tafakori, L. (2026, May 10). Week 9 During Class Worksheet [R File]. Canvas. https://rmit.instructure.com/courses/158207/pages/week-9-during-class-2?module_item_id=8070977

Tafakori, L. (2026, May 17). Week 10 During Class Worksheet [R File]. Canvas. https://rmit.instructure.com/courses/158207/pages/week-10-during-class?module_item_id=8070982

Tafakori, L. (2026, May 24). Week 11 During Class Worksheet [R File]. Canvas. https://rmit.instructure.com/courses/158207/pages/week-11-during-class?module_item_id=8070987

World Health Organization. (2020). WHO timeline — COVID-19. https://www.who.int/news/item/27-04-2020-who-timeline---covid-19