Statistical Theory SImulation

Setup:

# Load standard libraries
library(tidyverse)

Problem 1: Triathlon Times

In triathlons, it is common for racers to be placed into age and gender groups. Fred and Catarina both completed the Hermosa Beach Triathlon, where Fred competed in the Men, Ages 30 - 34 group while Catarina competed in the Women, Ages 25 - 29 group. Fred completed the race in 1:22:28 (4948 seconds), while Catarina completed the race in 1:31:53 (5513 seconds). They are curious about how they did within their respective groups.

Here is some information on the performance of their groups:

The finishing times of the Men, Ages 30 - 34 group has a mean of 4313 seconds with a standard deviation of 583 seconds.
The finishing times of the Women, Ages 25 - 29 group has a mean of 5261 seconds with a standard deviation of 807 seconds.
The distributions of finishing times for both groups are approximately Normal.

Remember: a better performance corresponds to a faster finish.

(a) A short-hand for these two normal distributions.

Let a random man between age 30 and 34 is donated by X. Then X ~N (4313,339889)
Let a random woman between age 25 and 29 is donated by X. Then X ~N (5261,651249)

(b) What are the Z scores for Fred’s and Catarina’s finishing times? What do these Z scores tell you?

z_Fred <- (4948-4313)/583
z_Catarina <- (5513-5261)/807
cat("\nZ Score for Fred = ", z_Fred)

## 
## Z Score for Fred =  1.089194

cat("\nZ Score for Catarina = ", z_Catarina)

## 
## Z Score for Catarina =  0.3122677

Z-scores denote the number of standard deviations a random variable is away from the mean. Therefore:
Fred’s finishing time, as per the above calculation, is within approx. 1.09 standard deviatiations from the mean finishing time for men between the age of 30 and 34, which is 4313s.
Similarly, Catarina’s finishing time, as per the above calculation, is within approx. 0.86 standard deviatiations from the mean finishing time for women between the age of 24 and 29, which is 5261s.

(c) Did Fred or Catarina rank better in their respective groups? Explain your reasoning.

pnorm(4948,4313,583)

## [1] 0.8619658

pnorm(5513,5261,807)

## [1] 0.6225814

As per the probability distribution function of the men’s finishing time, Fred’s finishing time lies in the 86th percentile, that is to say that his finishing time was more than about 86% of the total participants. Therefore, about 86% people did better on the race (had lesser finishing times) than Fred.
As per the probability distribution function of the women’s finishing time, Catarina’s finishing time lies in the 62nd percentile, that is to say that her finishing time was more than about 62% of the total participants. Therefore, about 62% people did better on the race (had lesser finishing times) than Catarina.

(d) What percent of the triathletes did Fred finish faster than in his group?

1-pnorm(4948,4313,583)

## [1] 0.1380342

Fred finished faster than about 13.8% of the total participants. This is calculted by subtracting from one the probability of getting a finishing time as that of Fred’s when selecting a man at random who took part in the race in the men’s category.

(e) What percent of the triathletes did Catarina finish faster than in her group?

1-pnorm(5513,5261,807)

## [1] 0.3774186

Catarina finished faster than about 37.7% of the total participants. This is calculted by subtracting from one the probability of getting a finishing time as that of Catarina’s when selecting a woman at random who took part in the race in the women’s category.

(f) If the distributions of finishing times are not nearly normal, would your answers to parts (b) - (e) change? Explain your reasoning.

Problem 2: Sampling with and without Replacement

In the following situations we assume that half of the specified population is male and the other half is female.

(a) Suppose you’re sampling from a room with 10 people. What is the probability of sampling two females in a row when sampling with replacement? What is the probability when sampling without replacement?

cat("\nProbability of sampling two females in a row when sampling with replacement =",(5/10)*(5/10))

## 
## Probability of sampling two females in a row when sampling with replacement = 0.25

cat("\nProbability of sampling two females in a row when sampling without replacement =",(5/10)*(4/9))

## 
## Probability of sampling two females in a row when sampling without replacement = 0.2222222

(b) Now suppose you’re sampling from a stadium with 10,000 people. What is the probability of sampling two females in a row when sampling with replacement? What is the probability when sampling without replacement?

cat("\nProbability of sampling two females in a row when sampling with replacement =",0.5*0.5)

## 
## Probability of sampling two females in a row when sampling with replacement = 0.25

cat("\nProbability of sampling two females in a row when sampling without replacement =",(0.5)*(4999/9999))

## 
## Probability of sampling two females in a row when sampling without replacement = 0.249975

(c) We often treat individuals who are sampled from a large population as independent. Using your findings from parts (a) and (b), explain whether or not this assumption is reasonable.

This assumption hold true and is demonstrated in the above parts. When the population was small (10 people) there was a significant variation between the probabilities of getting two females in the two cases of with and without replacement. However, when the population size was huge (10000 people) the probabilities of getting 2 females in a row in both the cases of with and without replacement were nearly same.
Thus, it is reasonable to treat individuals who are sampled from a large population as independent for all practical purposes.

Problem 3: Sample Means

You are given the following hypotheses: \(H_0: \mu = 34\), \(H_A: \mu > 34\). We know that the sample standard deviation is 10 and the sample size is 65. For what sample mean would the p-value be equal to 0.05? Assume that all conditions necessary for inference are satisfied.

Assumption: This is a normal distribution

z <- qnorm(0.95) #since p-value = 0.05

#standard error = s.e. = sigma/n^0.5
se <- 10/(65^0.5)

# z = (X-mu)/se
# => X = z*se + mu
x <- (z*se) + 34
x

## [1] 36.04019

Thus, sample mean should be equal to 36.04019 if the p-value = 0.05