Case study: The Cherry Blossom 10 mile run

The data provided show the finish times of 16,924 runners for the Cherry Blossom 10 mile run held in washington DC. Our goal today is to take a look at a famous distribution that describes how many numerical variables found in nature and human activities behave.

  1. Open the Cherry Blossom dataset provided on google classroom. The time variable logs the number of minutes each runner took to complete the race. What type of variable is this? How do you know?
  2. Create a histogram of the time variable for both males and females. Place these histograms in a separate tab and rename them maleDist and femaleDist respectively.
  3. What is the shape of the distribution? Does it exhibit any sort of skew?
  4. Use the AVERAGE function to find the mean finish time for males and females and record this parameter in the table provided.
  5. Use the STDEV function to calculate the standard deviation of runner finish times for males and females and record this information in the table provided.
  6. Compare the mean of finish time between males and females. Which gender is faster on average, and by how much?
  7. Suppose a female runner finishes with a time of 80 minutes. A male runner finishes with a time of 79 minutes. Which is more unusual, given the gender of the runners?
  8. For each gender create a new variable named Normalized that takes each observation within time and subtracts the mean of time from that observation, then divides this number by the standard deviation of time.
  9. What is the mean of the new variable you created? What is the standard deviation of the variable you created?
  10. Now that you have normalized variables for both males and females return to question 7. Find a male runner with a time around 79 minutes, and a female runner with a time around 80 minutes. Compare these observations within the normalized variable. Which is smaller? What do you think this means?