Introduction

Research Question: Do females, on average, have a higher run time than males in the Cherry Blossom Run of 2009?

The dataset I selected provides information on the Cherry Blossom Run of 2009, which, according to the source, is an “annual road race held in Washington, D.C”. There 14974 observations and 14 variables. The variables that are included are the following: place, time, net_time, pace, age, gender, first (name but the word “name” is not included), last (“name”), city, state, country, div, div_place, and div_tot. Each row of is the information of a runner that participated in this specific run.

The variables I will be focusing on are time and gender. The gender is a categorical variable, where “F” stands for female and “M” stands for male. The time is a quantitative variable, which is the time it took for the runner to run the full run. I will be using these variables to see if females have a higher run times than males by conducting a two sample t-test.

The link to the source: https://www.openintro.org/data/index.php?data=run09. This dataset was retrieved from the site OpenIntro.org Data Sets and provides information on the Cherry Blossom Run of 2009.

1. Importing my dataset and libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

blossom_run <- read_csv("C:/DATA101/run09.csv")
## Rows: 14974 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): gender, first, last, city, state, country
## dbl (8): place, time, net_time, pace, age, div, div_place, div_tot
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Exploring my data - EDA Functions

str(blossom_run)
## spc_tbl_ [14,974 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ place    : num [1:14974] 1 2 3 4 5 6 7 8 9 10 ...
##  $ time     : num [1:14974] 53.5 53.9 54 54.4 54.5 ...
##  $ net_time : num [1:14974] 53.5 53.9 54 54.4 54.5 ...
##  $ pace     : num [1:14974] 5.37 5.4 5.4 5.45 5.45 ...
##  $ age      : num [1:14974] 21 21 22 19 36 28 25 31 23 26 ...
##  $ gender   : chr [1:14974] "F" "F" "F" "F" ...
##  $ first    : chr [1:14974] "Lineth" "Belianesh Zemed" "Teyba" "Abebu" ...
##  $ last     : chr [1:14974] "Chepkurui" "Gebre" "Naser" "Gelan" ...
##  $ city     : chr [1:14974] "Kenya" "Ethiopia" "Ethiopia" "Ethiopia" ...
##  $ state    : chr [1:14974] "NR" "NR" "NR" "NR" ...
##  $ country  : chr [1:14974] "KEN" "ETH" "ETH" "ETH" ...
##  $ div      : num [1:14974] 2 2 2 1 5 3 3 4 2 3 ...
##  $ div_place: num [1:14974] 1 2 3 1 1 1 2 1 4 3 ...
##  $ div_tot  : num [1:14974] 953 953 953 71 1130 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   place = col_double(),
##   ..   time = col_double(),
##   ..   net_time = col_double(),
##   ..   pace = col_double(),
##   ..   age = col_double(),
##   ..   gender = col_character(),
##   ..   first = col_character(),
##   ..   last = col_character(),
##   ..   city = col_character(),
##   ..   state = col_character(),
##   ..   country = col_character(),
##   ..   div = col_double(),
##   ..   div_place = col_double(),
##   ..   div_tot = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(blossom_run)
## # A tibble: 6 × 14
##   place  time net_time  pace   age gender first  last  city  state country   div
##   <dbl> <dbl>    <dbl> <dbl> <dbl> <chr>  <chr>  <chr> <chr> <chr> <chr>   <dbl>
## 1     1  53.5     53.5  5.37    21 F      Lineth Chep… Kenya NR    KEN         2
## 2     2  53.9     53.9  5.4     21 F      Belia… Gebre Ethi… NR    ETH         2
## 3     3  54.0     54.0  5.4     22 F      Teyba  Naser Ethi… NR    ETH         2
## 4     4  54.4     54.4  5.45    19 F      Abebu  Gelan Ethi… NR    ETH         1
## 5     5  54.4     54.4  5.45    36 F      Cathe… Nder… Kenya NR    KEN         5
## 6     6  54.5     54.5  5.47    28 F      Olga   Roma… Russ… NR    RUS         3
## # ℹ 2 more variables: div_place <dbl>, div_tot <dbl>
summary(blossom_run)
##      place           time           net_time           pace       
##  Min.   :   0   Min.   : 45.93   Min.   : 45.93   Min.   : 4.600  
##  1st Qu.:1871   1st Qu.: 89.65   1st Qu.: 83.75   1st Qu.: 8.383  
##  Median :3743   Median :103.77   Median : 93.85   Median : 9.400  
##  Mean   :3790   Mean   :103.46   Mean   : 94.26   Mean   : 9.433  
##  3rd Qu.:5615   3rd Qu.:117.02   3rd Qu.:104.15   3rd Qu.:10.417  
##  Max.   :8323   Max.   :169.62   Max.   :154.37   Max.   :15.450  
##                                  NA's   :1                        
##       age           gender             first               last          
##  Min.   : 7.00   Length:14974       Length:14974       Length:14974      
##  1st Qu.:27.00   Class :character   Class :character   Class :character  
##  Median :32.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :34.99                                                           
##  3rd Qu.:41.00                                                           
##  Max.   :85.00                                                           
##  NA's   :1                                                               
##      city              state             country               div        
##  Length:14974       Length:14974       Length:14974       Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 4.000  
##                                                           Mean   : 4.594  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :14.000  
##                                                           NA's   :1       
##    div_place       div_tot    
##  Min.   :   0   Min.   :   0  
##  1st Qu.: 208   1st Qu.: 742  
##  Median : 490   Median :1130  
##  Mean   : 653   Mean   :1305  
##  3rd Qu.: 922   3rd Qu.:1678  
##  Max.   :2706   Max.   :2706  
##  NA's   :1      NA's   :1

3. Cleaning my dataset - EDA Functions

Analysis: For the analysis of my data, I started off by exploring my dataset as shown through the code above. After, I decided to clean the dataset, changing the gender categories from F to Female and M to Male. I also checked for NAs to be sure there is no missing data for the variables I will be using. For the actual two-sample t-test I planned to conduct, I first separated the female times from the male times. I did this by first focusing on female run times. I selected the two variables I would be using, then I filtered the rows I planned to use, which were the “Female” rows. I arranged them in ascending order. Then I selected just the time variable, so I only have the female times. I replicated the same thing for the male times. I also found the average run times for each gender and used this information to create a bar graph for an easy visualization of the difference in overall mean run times for the different genders.

#Changing F to Female and M to Male for easier reading
blossom_run$gender <- ifelse(blossom_run$gender == "F", "Female", "Male")

#Check for any NAs for the variables I will be using
colSums(is.na(blossom_run))
##     place      time  net_time      pace       age    gender     first      last 
##         0         0         1         0         1         0         0         0 
##      city     state   country       div div_place   div_tot 
##         0         0         0         1         1         1

Note: There are no NAs for gender and no NAs for time which are the two variables I will be using in my analysis

4. Analyzing my data

Female Times

female_times <- blossom_run |>
  select(time, gender) |>
  filter(gender == "Female")|>
  arrange(time) |>
  select(time)

female_times
## # A tibble: 8,323 × 1
##     time
##    <dbl>
##  1  53.5
##  2  53.9
##  3  54.0
##  4  54.4
##  5  54.4
##  6  54.5
##  7  54.6
##  8  54.6
##  9  54.7
## 10  54.8
## # ℹ 8,313 more rows

Male Times

male_times <- blossom_run |>
  select(time, gender) |>
  filter(gender == "Male") |>
  arrange(time) |>
  select(time)

male_times
## # A tibble: 6,651 × 1
##     time
##    <dbl>
##  1  45.9
##  2  46.0
##  3  46.0
##  4  46  
##  5  46.1
##  6  46.1
##  7  46.5
##  8  47.0
##  9  47.0
## 10  47.7
## # ℹ 6,641 more rows

Note: From the get-go, it can be noticed that the male run times start of smaller than the female run times. The lowest female run time is 53.53 while the lowest male run time is 45.93.

Average Times - This will be used for the visualization

avg_times_per_gender <- blossom_run |>
  select(time, gender) |>
  group_by(gender) |>
  summarise(avg_time = round(mean(time), digits = 2))

avg_times_per_gender
## # A tibble: 2 × 2
##   gender avg_time
##   <chr>     <dbl>
## 1 Female    109. 
## 2 Male       96.2

5. Visualization

barplot(avg_times_per_gender$avg_time,
        names.arg = c("Female", "Male"),
        col = c("pink", "purple"),ylab = "Average Run Times", xlab = "Genders", 
        main = "Average Run Times by Genders")

6. Two-Sample T-Test

Hypotheses

Null Hypothesis

\(H_0\): \(\mu_1\) = \(\mu_2\)

Alternative Hypothesis

\(H_a\): \(\mu_1\) > \(\mu_2\)

Where,

\(\mu_1\) = mean of the race time for females

\(\mu_2\) = mean of the race time for males

The significance level I will be using: α = 0.05

t.test(female_times, male_times, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  female_times and male_times
## t = 43.041, df = 13637, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  12.51629      Inf
## sample estimates:
## mean of x mean of y 
## 109.23967  96.22603

Results: The p-value is <2.2e-16 which means it is extremely small. The p-value is smaller the the significance level which is α = 0.05, which means the p-value is significant. We have strong evidence that females have a higher run time than males.

Additionally, for the 95% confidence interval, it goes from 12.51629 to Infinity. Since 0 is outside the interval, that means the difference in means is statistically significant, showing run times are higher for females.

Conclusion

In conclusion, I have found significant evidence that the average run time for females in the Cherry Blossom Run of 2009 is higher than the average run time for males which directly addresses the original research question. When compiling the times for females and males and then ordering them in ascending order, it was clear that males had a lower run time than females within the first 10 observations. For females, the first 10 observations showed run times from 53 and above, while for males, the first 10 observations were under 50 but greater than 45. The difference in run times was shown once again through the visualization, where it was evident through the bar graph that the male average run time was smaller than the female run time.

This information can be helpful for varying circumstances. This information can be used to understand and notice patterns with the running performance between the genders. This information can be helpful when conducting studies regarding health and body differences between the genders. With this information, people can learn to train differently for such runs by understanding their bodies. This can also provide a pathway for males and females to understand their bodies better by noticing patterns in differences. There are a plethora of possibilities in which this information can help. In order to make more in-depth studies and gain clearer and more significant patterns, other factors can be added to understand why run times are different between males and females. For one, the dataset provides information on countries and cities. Different countries have different environments, which can impact runners’ training and their running capabilities. There is also information on the runners’ ages. The ages can have a significant impact on how fast or slow runners are, which can further help with understanding run times between genders.

References: https://www.openintro.org/data/index.php?data=run09