Theory I: Final Exam

Luz Melo

2022-12-13

Instructions
Question 1: Simulation
Question 2: Probability
Question 3: Discrete Distributions
Question 4: Continuous Distributions
Question 5: Expectation and Variance
Question 6: Transformations & Sampling Distributions
Question 7: Estimation of CDF and PDF from data
Question 8: Sample from an estimated distribution
Question 9: Inference

Instructions

The final exam will be a one-on-one oral exam with the instructor. Please prepare solutions to the following is a set of questions. During the oral exam, the instructor will ask a series of questions covering topics from the course and the questions. For example, the instructor may ask:

Please explain how you solved a particular question.
Please solve a new question (perhaps closely related to a question below).
Please explain course topic X.

You will be graded on both the accuracy of your responses and the clarity with which you explain course concepts and solutions to questions.

The final exam should represent your own work. Do not consult with or collaborate in any way with anyone other than the instructor.

Prior to meeting with the instructor, you should:

Create a folder in your Github repo; call it 99-final-exam.
Compile, save, and push your solutions to your GitHub repository

Question 1: Simulation

The Monte Hall problem is a classic game show. Contestants on the show where shown three doors. Behind one randomly selected door was a sportscar; behind the other doors were goats.

At the start of the game, contestants would select a door, say door A. Then, the host would open either door B or C to reveal a goat. At that point in the game, the host would ask the contestant if she would like to change her door selection. Once a contestant decided to stay or change, the host would open the chosen door to reveal the game prize, either a goat or a car.

In this problem, consider a modified version of the Monte Hall problem in which the number of doors is variable. Rather than 3 doors, consider a game with 4 or 5 or 50 doors. In the modified version of the game, a contestant would select an initial door, say door A. Then, the host would open one of the remaining doors to reveal a goat. At that point in the game, the host would ask the contestant if she would like to change her door selection. Once a contestant decided to stay or change, the host would open the chosen door to reveal the game prize, either a goat or a car.

Consider two strategies:

Always stay with the first door selected.
Always switch to the unopened door.

C. The function game below plays a single game of Monte Hall. The function returns a vector of length two, the first element is the prize under strategy 1 and the second element is the prize under strategy 2. The function has a single input parameter, N, which is the number of doors in the game.

Use the game function to estimate the probability that both strategies simultaneously result in a goat. Let N=4.

require(magrittr)

## Loading required package: magrittr

require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

game <- function(N){
  if(N<3) stop("Must have at least 3 doors")
  prize <- sample(c(rep("goat",N-1),"car"), N)
  guess <- sample(1:N,1)
  game <- data.frame(door = 1:N, prize = prize, stringsAsFactors = FALSE) %>% 
    mutate(first_guess = case_when(
      door == guess ~ 1
      , TRUE ~ 0
    )) %>% 
    mutate(potential_reveal = case_when(
        first_guess == 1 ~ 0
      , prize == "car" ~ 0
      , TRUE ~ 1
    )) %>% 
    mutate(reveal = 1*(rank(potential_reveal, ties.method = "random") == 3)) %>% 
    mutate(potential_switch = case_when(
      first_guess == 1 ~ 0
      , reveal == 1 ~ 0
      , TRUE ~ 1
    )) %>% 
    mutate(switch = 1*(rank(potential_switch, ties.method = "random") == 3))
  c(game$prize[game$first_guess == 1], game$prize[game$switch == 1])
}

The probability that both strategies simultaneously result in a goat is

R <- 1000

g1 <- replicate(R,game(N=4))
prop <- mean(g1[1,]=="goat" & g1[2,]=="goat")
prop

## [1] 0.369

B. Communicate the precision of your simulated probability in part C by calculating a 99% confidence interval.

The R functions prop.test() can be used as follow:

prop.test(x, n, p = NULL, alternative = “two.sided”, correct = TRUE)

x: a vector of counts of successes n: a vector of count trials alternative: a character string specifying the alternative hypothesis (“two.sided”,“less”, or “greater”) correct: a logical indicating whether Yates’ continuity correction should be applied where possible

R <- 10000

prop.test(sum(g1[1,]=="goat" & g1[2,]=="goat"),R,conf.level = 0.99)$conf.int

## [1] 0.03229630 0.04212413
## attr(,"conf.level")
## [1] 0.99

This interval means that the probability that both strategies simultaneously result in a goat is likely to lie between our confidence interval 99 percent of the time.

A. Let D(N) be the difference between the difference in probabilities between strategy 2 and strategy 1.

\[ D(N) = P(\text{win strategy 2}|\text{N doors}) - P(\text{win strategy 1}|\text{N doors}) \]

Create a plot that shows how D changes as N increases. Put N on the x-axis, ranging from 3 to 10. Put D on the y-axis.

# Optimized
R <- 1000
N <- 3:10
D <- numeric(length(N))

for (i in seq_along(N)) {
  n <- N[i]
  results <- replicate(R, game(N=n))
  D[i] <- mean(results[2, ] == "car") - mean(results[1, ] == "car")
}

#R <- 100
#N <- 3:10
#D <- NA*N

#for (n in N){
#  D[n-2]<- table(replicate(R, game(N=n)[2]))[["car"]]/R - 
#    table(replicate(R, game(N=n)[1]))[["car"]]/R
#}

#plot(N,D)

# Create a data frame for ggplot2
df <- data.frame(N = N, D = D)
# Plot using ggplot2
ggplot(df, aes(x = N, y = D)) +
  geom_line(color = "red", size = 1.5) +    # Line plot with thicker line
  geom_point(color = "black", size = 3) +    # Add points for emphasis
  labs(title = "Difference in Proportion of Car Wins vs Goat Wins",  # Title
       subtitle = "Comparison of First and Second Choice in the Game",
       x = "Number of Doors (N)",  # X-axis label
       y = "Difference in Proportion") +  # Y-axis label
  theme_minimal() +  # Use a clean theme
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),  # Title style
        plot.subtitle = element_text(hjust = 0.5, size = 12),  # Subtitle style
        axis.title.x = element_text(size = 12, face = "bold"),  # X-axis label style
        axis.title.y = element_text(size = 12, face = "bold"))  # Y-axis label style

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Question 2: Probability

Consider a test for a rare genetic condition. Let T+ denote a test result that indicates the condition is present, while T- denotes absence. Let D+ and D- denote the true status of the disease.

C. Fill-in the probability table using the following information:

P(T+|D+) = .85, and
P(T-|D-) = .95, and
P(D+) = 0.001

	D+	D-	Row Total
T+	.001*.85=.00085 .01671583 P(T+\|D+) = .85	0.999-.949=0.05 .98328417 .05	P(T+) = .05085
T-	.001-.00085=.00015 .000158 .15	.999*.95=.949 P(D-\|T-)= .99984 P(T-\|D-) = .95	P(T-) = .94915
Column Total	P(D+) = .001	P(D-) = .999	1

T+

.001*.85=.00085

.01671583

P(T+|D+) = .85

0.999-.949=0.05

.98328417

.05

P(T+) = .05085

T-

.001-.00085=.00015

.000158

.15

.999*.95=.949

P(D-|T-)= .99984

P(T-|D-) = .95

P(T-) = .94915

Column Total

P(D+) = .001

P(D-) = .999

Rules:

Sum of all cell probabilities equals to 1.
Law of total probability:

Cell probabilities on the same row/column sum to the row/column marginal probability, respectively.

Sum of rows/columns probabilities in the same row/column equals to 1.
Sum of the marginal probabilities sum to 1.
Row/column probability is equal to cell probability divided by row/column marginal, respectively.
Cell probability is equal to the row/column probability times the row/column marginal, respectively.

Steps:

Find P(D+ & T+) and P(D- & T-). (6)
Find P(D+ & T-) and P(D- & T+). (3)
Find P(T+) and P(T-). (2)
Find all row/column probabilities. (5)

B. Calculate the negative predictive value of the test, P(D-|T-).

P(D-|T-)= .99984

A Create a plot that shows how the positive predictive value as a function of the prevalence of disease, P(D+).

P(T+|D+) = .85 and
P(T-|D-) = .95

Let

P(D+) = p

Goal: Get P(D+|T+) in terms of p.

	D+	D-	Row Total
T+	p(0.85) p(0.85)/P(T+) P(T+\|D+) = .85	(1-p)- [(1-p)(.95)] = (1-p)[1-.95] = (1-p)(.05)	P(T+) = p(.85)+ (1-p)(.05)
T-		(1-p)(.95) P(T-\|D-) = .95
Column Total	P(D+) = p	P(D-) = 1-p	1

T+

p(0.85)

p(0.85)/P(T+)

P(T+|D+) = .85

(1-p)- [(1-p)(.95)] = (1-p)[1-.95] = (1-p)(.05)

P(T+) = p(.85)+ (1-p)(.05)

T-

(1-p)(.95)

P(T-|D-) = .95

Column Total

P(D+) = p

P(D-) = 1-p

pp <- function(p) p*(0.85)/(p*(0.85)+(1-p)*(0.05))
curve(pp)

# Check:
# pp(0.001) = P(D+|T+) as in part A.
pp(0.001)

## [1] 0.01673228

Question 3: Discrete Distributions

Suppose the yearly hospital charges (in thousands of dollars) for a randomly selected UVA student is a mixture distribution.

For 60% of students, the hospital charges will be $0.
For the remaining 40% of students, the hospital charges are a random variable described by a gamma distribution with shape = 2 and scale = 2. (Again, in thousands of dollars.)

C. What is the 90th percentile for yearly hospital charges for a randomly selected UVA student? (Hint: Which approach might be easier: analytical or simulation?)

The nth percentile of a dataset is the value that cuts off the first n percent of the data values when all of the values are sorted from least to greatest.
Calculating a percentile (or a p-quantile) is equivalent to finding the inverse of a CDF.

N <- 1000

hc <- function(n) 0*rbinom(n,1,0.6) + .4*rgamma(n, shape = 2, scale = 2)
quantile(hc(N), 0.9)

##      90% 
## 3.055283

The 90th percentile for yearly hospital charges for a randomly selected UVA student is approximately 3000 dollars. In other words, 90% of UVA students are charged less than 3000 dollars for yearly hospital charges; 10% are charged more than 3000 dollars for yearly hospital charges.

B. Consider the class average yearly hospital charge for the students in a class of size 30. Plot the density function or a simulated histogram of the class average yearly hospital charge.

N <- 1000

hc <- function(n) 0*rbinom(n,1,0.6) + .4*rgamma(n, shape = 2, scale = 2)

charge <- replicate(N, mean(hc(30)))

par(mfrow = c(1,2))
plot(ecdf(charge))
hist(charge, freq = FALSE, breaks = 50)

A. What is the probability that a randomly selected class of size 30 students will have less than 10 students with zero yearly hospital charges?

N <- 1000

d1 <- replicate(N,sum(hc(30)==0))
mean(d1<10)

## [1] 1

Question 4: Continuous Distributions

C. Suppose diastolic blood pressure (DBP) follows a normal distribution with mean 80 mmHg and SD 15 mmHg. What is the probability that a randomly sampled person’s DBP exceeds 104 mmHg?

We want to find P(DBP > 104 | mean = 80, sd = 15):

1 - pnorm(104, mean = 80, sd = 15)

## [1] 0.05479929

The probability that a randomly sampled person’s DBP exceeds 104 is approximately 5.5 %.

B. Suppose a human femur was discovered that is 37 cm long. Also suppose that using the NHANES data, researchers believe the distribution of femur bones, by sex, are distributed as follows:

Female adult femur $\sim N(36, 3.3)$
Male adult femur $\sim N(40, 3.4)$

Under the assumption that male and females are equally likely, what is the probability that the discovered femur was from a male?

Given:

P(male) = P(female)

Goal: P(male| female=37)

Bayes Rule:

\[ P(male| length=37) = \frac{density(length=37|male)*P(male)}{density(length=37|male)*P(male) + density(length=37|female)*P(female)} \]

mu_f = 36
sigma_f = 3.3

mu_m = 40
sigma_m = 3.4

P <- function(x) 1/(1+dnorm(x,mu_f, sigma_f)/dnorm(x,mu_m,sigma_m))
curve(P(x),22,52)

P(37)

## [1] 0.407765

There is a 41% probability that the discovered femur was from a male.

A. Continuing part B, at what value of $x$ is P(femur from male | femur length = x) = P(femur from female | femur length = x). (Hint: a figure might help.)

We want to find x such that

\[ P(male| length=x) = P(female | length=x) \]

This is the same as

\[ \frac{1}{1 + \frac{density(length=x|female)}{density(length=x|male)}} = \frac{1}{1 + \frac{density(length=x|male)}{density(length=x|female)}} \]

mu_f <- 36      # Mean femur length for females
sigma_f <- 3.3  # Standard deviation for females
mu_m <- 40      # Mean femur length for males
sigma_m <- 3.4  # Standard deviation for males

P1 <- function(x) 1 / (1 + dnorm(x, mu_f, sigma_f) / dnorm(x, mu_m, sigma_m)) # Male
P2 <- function(x) 1 / (1 + dnorm(x, mu_m, sigma_m) / dnorm(x, mu_f, sigma_f)) # Female

x <- seq(22, 52, by = 0.01)

equal_point <- which(abs(P1(x) - P2(x)) < 1e-3)  # Check for approximate equality

equal_point

## [1] 1606

x[equal_point]

## [1] 38.05

Question 5: Expectation and Variance

Let us revisit the yearly hospital charges distribution from a previous section.

Recall: The yearly hospital charges (in thousands of dollars) for a randomly selected UVA student is a mixture distribution. For 60% of students, the hospital charges will be $0. For the remaining 40% of students, the hospital charges are a random variable described by a gamma distribution with shape = 2 and scale = 2. (Again, in thousands of dollars.)

We say that the expected value of X is the long-run average value of X, as we repeat the experiment indefinitely.

C. What is E[yearly hospital charges]?

N <- 100000

hc <- function(n) 0*rbinom(n,1,0.6) + .4*rgamma(n, shape = 2, scale = 2)

EX1 = mean(hc(N))

# Built in function: weighted.mean(vals, probs)

EX2 = weighted.mean(0*rbinom(N,1,0.6) + rgamma(N, shape = 2, scale = 2)*(1-rbinom(N,1,0.6)),rbinom(N,1,0.6))

EX1

## [1] 1.602883

EX2

## [1] 1.60327

B. Suppose UVA implements a cap of $1,500 on yearly student hospital charges. What is the mean yearly hospital charge under the new policy (E(X))?

# 1500/1000 = 1.5
# yearly student hospital charges, hc(N), cannot exceed 1.5

N <- 10000
y = hc(N)

for (c in seq_along(y)){
  if (y[c] > 1.5){
    y[c] <- 1.5
  }
}

EY = mean(y)
EY

## [1] 1.118663

A. What is the variance of yearly hospital charge under the new policy?

The variance is the sum of the squared deviations from the mean weighted by their probabilities:

var_f <- function(n) {
  mean((hc(n) - mean(hc(n)))^2)
}

var(hc(N))

## [1] 1.264598

var_f(N)

## [1] 1.224501

Question 6: Transformations & Sampling Distributions

C. Consider the log normal distribution. If X is a log normal random variable, then log(X) is a normal random variable. One way to create pseudo-random draws from the log normal distribution is to generate draws from a normal distribution and then to transform the draws by exponentiating. The parameters of the log normal distribution are the parameters of the underlying normal distribution, $\mu$ and $\sigma$ (or $\sigma^2$).

Log normal data are prevalent is biological systems and econometrics.

Suppose a blood chemistry measure has a log normal distribution with $\mu$ = 0 and $\sigma$ = 1. Generate an histogram or density curve for the sampling distribution of the median when the sample size is 101.

N <- 10000

rqc <- function(n,q){
  x<-exp(rnorm(n,mean = 0,sd = 1))
  quantile(x,q)
}

par(mfrow = c(1,2))
d1 <- replicate(N,rqc(101,0.5)) # Median is the 0.5 percentile
hist(d1,breaks = 50, main = "Histogram of Median")

d2 <- replicate(N,median(rlnorm(101,0,1)))
hist(d2,breaks = 50, main = "Histogram of Median")

B. Below is the CDF function for the kth order statistic when the underlying distribution is log normal with $\mu$ = 0 and $\sigma$ = 1. Create a plot of the ECDF of the simulated sampling distribution generated in C and overlay the CDF using the function below.

Fk <- function(x,k,n){
  pbinom(k-1, n, plnorm(x), lower.tail = FALSE)
}

plot(ecdf(d2),lwd = 5)
curve(Fk(x,51,101),add = TRUE, col = "red")

A. Of the 25th, 50th, and 75th quantiles of the distribution from B, which will have the tightest 95% CI? (Show the sampling distribution of each.)

Notice that the histogram for the 50th distribution is given in part C.

N <- 10000

par(mfrow = c(1,2))
d3 <- replicate(N,rqc(101,0.25)) # 25th quantile
d4 <- replicate(N,rqc(101,0.75)) # 75th quantile
hist(d3, breaks = 50, main = "Histogram of 25th Q")
hist(d4, breaks =50, main = "Histogram of 75th Q")

N <- 10000

d1 <- data.frame(replicate(N,rqc(101,0.25)))
d2 <- data.frame(replicate(N,rqc(101,0.5)))
d3 <- data.frame(replicate(N,rqc(101,0.75)))

t.test(replicate(N,rqc(101,0.25))~1, data = d1, conf.level = 0.95)$conf.int

## [1] 0.5178585 0.5206152
## attr(,"conf.level")
## [1] 0.95

t.test(replicate(N,rqc(101,0.5))~1, data = d2, conf.level = 0.95)$conf.int

## [1] 1.004346 1.009216
## attr(,"conf.level")
## [1] 0.95

t.test(replicate(N,rqc(101,0.75))~1, data = d3, conf.level = 0.95)$conf.int

## [1] 1.958190 1.968684
## attr(,"conf.level")
## [1] 0.95

The 25th percentile has the tightest 95% confidence interval.

Question 7: Estimation of CDF and PDF from data

The following code will load the NHANES data and select the first 500 rows.

Hmisc::getHdata(nhgh)
d1 <- nhgh[1:500,]

C. Estimate the distribution of standing height for adult (age > 18) males using the MLE method with a normal distribution. Create a plot of the estimated density function.

a1 <- subset(d1, age > 18 & sex == "male")
ht <- a1$ht

require(stats4)

## Loading required package: stats4

nLL <- function(mean, sd){
  fs <- dnorm(
        x = ht
      , mean = mean
      , sd = sd
      , log = TRUE
    ) 
  -sum(fs)
}
fit <- mle(
    nLL
  , start = list(mean = 1, sd = 1)
  , method = "L-BFGS-B"
  , lower = c(0, 0.01)
)

par(mfrow = c(1,2))
hist(ht, freq = FALSE, main = "")
curve(
    dnorm(x, coef(fit)[1], coef(fit)[2])
  , add = TRUE
  , col = "red"
  , lwd = 2
)
plot(ecdf(ht), main = "")
curve(
    pnorm(x, coef(fit)[1], coef(fit)[2])
  , add = TRUE
  , col = "red"
  , lwd = 2
)

B. Estimate the distribution of BMI for adult (age > 18) females using using the method of moment method with the gamma distribution. Create a plot of the estimated density function.

a2 <- subset(d1, age > 18 & sex == "female")
bmi <- a2$bmi

Steps:

1a. Choose a parametric distribution, FX(x|θ)

X ∼ gamma(shape,scale)

1b. Identify the distribution parameters, θ

θ =(shape, scale)

2. Calculate/find distribution moments (need as many moments as distribution parameters)

3. Calculate sample moments.

xbar <- mean(bmi)
s2 <- var(bmi)

4. Create system of equations equating distribution moments to sample moments.

shape∗scale=xbar shape∗scale2=s2

5. Solve system of equations for distribution parameters. (In this example, we use the plug-in method.)

θ_hat =(shape_hat, scale_hat)

shape_hat <- xbar^2/s2
scale_hat <- s2/xbar

6. F_X_hat=F_X(x|θ=θ_hat)

par(mfrow = c(1,2))

fbmi <- function(x){dgamma(x, shape = shape_hat, scale = scale_hat)}
hist(bmi, freq = FALSE, main = "BMI")
curve(fbmi(x), add = TRUE, col = "red", lwd = 2, main = "BMI")

Fbmi <- function(x){pgamma(x, shape = shape_hat, scale = scale_hat)}
plot(ecdf(bmi))
curve(Fbmi(x), add = TRUE, col = "red", lwd = 2, main = "BMI")

A. Estimate the distribution of creatinine (SCr) for adults (age > 18) using the kernel density method with a Gaussian kernel. Create a plot of the estimated density function.

Key idea: A kernel density estimate with the gaussian kernel has a well defined q function.

d2 <- subset(d1, age > 18)

# Filter out missing values from the SCr (creatinine) column and extract the SCr values
SCr <- d2 %>% filter(!is.na(SCr)) %>% `[[`("SCr")

# Compute the kernel density estimate for the SCr data
den1 <- density(SCr)

# Get the number of SCr values (the length of the SCr vector)
N_SCr <- length(SCr)

# Define a custom function ecdfstar that computes a kernel-smoothed ECDF
# 't' is the vector of points where the ECDF is evaluated
# 'data' is the SCr data
# 'smooth' is the smoothing parameter (bandwidth)
ecdfstar <- function(t, data, smooth){
  outer(t, data, function(a,b){ pnorm(a, b, smooth) }) %>% rowMeans
}

# Set up a 1x2 plotting layout
par(mfrow = c(1,2))

# Plot the histogram (EPDF) of SCr with density estimation overlaid
hist(SCr, main = "EPDF", freq = FALSE)  # Histogram of SCr
lines(density(SCr, adjust = 5), lwd = 2, col = "red")  # Add kernel density estimate to the plot

# Plot the empirical cumulative distribution function (ECDF) of SCr
plot(ecdf(SCr), main = "ECDF")

# Overlay the kernel-smoothed ECDF using the custom ecdfstar function
curve(ecdfstar(x, SCr, den1$bw), add = TRUE, lwd = 2, col = "red")

Question 8: Sample from an estimated distribution

The following code will load the low birth weight data from the MASS package. The description of the variables in the dataset can be found in the birthwt documentation with the command ?MASS::birthwt.

bwt <- MASS::birthwt
b_s <- bwt %>% filter(smoke == 1) %>% pull(bwt) # Smokers
b_ns <- bwt %>% filter(smoke == 0) %>% pull(bwt) # Non-smokers

C. Generate a 95% confidence interval for the mean birthweight of infants whose mothers did smoke during pregnancy using the bootstrap.

# Using Replicate Command:
R <- 10000
# Sample size the same size of original data
one_rep <- function(){
idx <- sample.int(length(b_s), length(b_s), replace = TRUE)#b_ns[idx] # Bootstraps replicate
mean(b_s[idx])
}
mean_s<- replicate(R, one_rep())
quantile(mean_s, c(0.025, 0.975))

##     2.5%    97.5% 
## 2620.770 2916.546

# Using For Loop:
means <- rep(NA, R)
for(i in 1:R){
  s <- sample(b_s, length(b_s), replace = TRUE)
  means[i] <- mean(s, na.rm = TRUE)
}
alpha <- 0.05
means %>% quantile(c(alpha/2, 1-alpha/2))

##     2.5%    97.5% 
## 2622.864 2923.054

B. Generate a 95% confidence interval for the mean birthweight of infants whose mothers did smoke during pregnancy using the Central Limit Theorem shortcut.

t1 <- data.frame(b_s)
t.test(b_s~1, data = t1, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  b_s
## t = 36.149, df = 73, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2619.094 2924.744
## sample estimates:
## mean of x 
##  2771.919

A. Let $\mu_s$ be the mean birthweight of infants whose mothers smoked during pregnancy. Let $\mu_{ns}$ be the mean for the non-smoking group. Use simulation to calculate the 95% confidence interval for $\mu_s/\mu_{ns}$.

R <- 10000
means <- rep(NA, R)

# Smokers
mean_s <- mean(b_s, na.rm = TRUE)
sd_s <- sd(b_s, na.rm = TRUE) 

# Non-smokers
mean_ns <- mean(b_ns, na.rm = TRUE)
sd_ns <- sd(b_ns, na.rm = TRUE)

for(i in 1:R){
  s <- rnorm(length(b_s), mean = mean_s, sd = sd_s)
  ns <- rnorm(length(b_ns), mean = mean_ns, sd = sd_ns)
  means[i] <- mean(s)/mean(ns)
}
alpha <- 0.05
t0 <- quantile(means, c(alpha/2, 1-alpha/2))
t0

##      2.5%     97.5% 
## 0.8451132 0.9734206

Question 9: Inference

C. Suppose two studies were performed looking at the risk of mild complication after hernia repair using open and laparoscopic surgical approaches. The study results are below. Using the data from each study individually, perform the hypothesis test that the risk of complication between open and laparoscopic repairs are the same under the usual point null. What is the p-value from each study? What do you conclude from each study?

H_0: The null hypothesis states that the risk of complication between open and laparoscopic surgical approaches are the same.

H_1: The alternative hypothesis states that open and laparoscopic surgical approaches do have an effect on the risk of complication.

Study 1	Comp	No comp	Row Total
Open	30	70	100
Lap	35	65	100
Column Total	65	135	200

z <- function(n,p1,p2) (p1 - p2)/(sqrt(p2*(1-p1))/n)
z2 <- z(200,30/100,35/100)
z2

## [1] -20.20305

2*pnorm(q=z2, lower.tail=FALSE)

## [1] 2

prop.test(x = c(30, 35), n = c(100, 100))$p.value

## [1] 0.5459227

Study 2	Comp	No comp	Row Total
Open	600	1400	2000
Lap	619	1381	2000
Column Total	1219	2781	2000

prop.test(x = c(600, 619), n = c(2000, 2000))$p.value

## [1] 0.5363768

The p-value of both studies are greater than the significance level $\alpha$ = 0.05. Thus, we fail to reject the null hypothesis and conclude that the risk of complication for both open and laparoscopic surgical approaches is the same.

B. Suppose that prior to the studies, the researchers established an equivalence threshold of 6 percentage points. Using the confidence intervals, which studies (if any) showed a conclusive similarity between surgical approaches for the complication rate. Explain why.

To discover conclusive differences:

A conclusive similarity is identified when the confidence interval for the difference falls within the equivalence threshold.
A conclusive difference is identified when the confidence interval for the difference falls outside the equivalence threshold.
An inconclusive result occurs when the confidence interval for the difference straddles the equivalence threshold.

# 6/100 = 0.06
prop.test(35, 100)$conf.int - prop.test(30, 100)$conf.int

## [1] 0.04458092 0.05149560
## attr(,"conf.level")
## [1] 0.95

prop.test(619, 2000)$conf.int - prop.test(600, 2000)$conf.int

## [1] 0.009305443 0.009657601
## attr(,"conf.level")
## [1] 0.95

The confidence interval for study 1 and study 2 is given above, respectively. The 6% points lies outside the differences in the confidence intervals of the open and laparoscopic surgical approaches for both studies. This means that both studies did not showed a conclusive similarity between surgical approaches the the complication rate.

A. If the data from the studies were combined, what is the smallest equivalence threshold that would identify a conclusive similarity between the surgical approaches?

Study 1 & 2	Comp	No comp	Row Total
Open	630	1470	2100
Lap	654	1446	2100
Column Total	1284	2916	4200

prop.test(654, 2100)$conf.int - prop.test(630, 2100)$conf.int

## [1] 0.01120185 0.01161296
## attr(,"conf.level")
## [1] 0.95

The smallest equivalence threshold that would identify a conclusive similarity between the surgical approaches would be 1.12% points.