Getting Started

Case Study

People providing an organ for donation sometimes seek the help of a special “medical consultant”. These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant’s clients.

One consultant tried to attract patients by noting that the average complication rate for liver donor surgeries in the US is about 10%, but her clients have only had 3 complications in the 62 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).

Learning goals

  • Work on our visualization skills in R
  • Investigate what a simulation in R looks like
  • Conducting a hypothesis test for proportions

Packages and Data

We’ll use the tidyverse for visualization, so will load it into the console by running the following:

library(tidyverse) 

The data we need for this lab is found on Canvas in the file organ.csv. To load this data into R, we will need download the file and save it to the same place that we have saved the Lab 3 template. When we load data into R it is important to remember that the file name should be in quotation marks and that we need to include the file extension (e.g., .csv) as well.

Take a look back at Lab 1 for specific instructions and information on errors messages that may occur when you try and load data into R.

organ_donor <- read_csv("organ.csv")

Instructions

  • Except for the code needed to load the tidyverse, and the data, delete all the text prior to Exercises
  • After Exercises the only text that should appear are the questions and your answers to them
  • Delete all the comments that have been left for you in the code chunks
  • Include all code and figures in the final submission

Exercises

Excercise 1: Exploring the data

The first step in the analysis of a new data set is getting acquainted with the data. Use at least one of the tools we have explored in previous labs (e.g., str, glimpse) to answer the following questions:

# Below type a line of code to look at the data
  1. What is the sample size?

  2. How many variables are in the data set?

  3. What are the variables in the data set?

Exercise 2: Defining our hypothesis

  1. Which of the following is the correct null and alternative or hypotheses? Delete the wrong answers so that only the correct one remains in the submitted assignment.
  1. \(H_0: \pi = 0.10\); \(H_A: \pi \ne 0.10\)

  2. \(H_0: \pi = 0.10\); \(H_A: \pi > 0.10\)

  3. \(H_0: \pi = 0.10\); \(H_A: \pi < 0.10\)

  4. \(H_0: \hat{p} = 0.10\); \(H_A: \hat{p} \ne 0.10\)

  5. \(H_0: \hat{p} = 0.10\); \(H_A: \hat{p} > 0.10\)

  6. \(H_0: \hat{p} = 0.10\); \(H_A: \hat{p} < 0.10\)

  1. Have the validity conditions for a theory-based approach to hypothesis testing been met? Explain.

Exercise 3: Simulation for Hypothesis Testing

Set a seed!

When we perform simulations we generate random samples. Every time we run R we therefore get slightly different results, since our simulations will be different every time. However, when working in groups we may want a way to ensure that everyone’s code should produce identical results. This can be done with the command set.seed in R. The input to the function can be any number you like, from \(-\infty\) to \(\infty\), what matters is that everyone use the same number.

One simulation

When taking a simulation based approach to hypothesis testing we want to replicate our data collection over and over again under the assumption that the null hypothesis is true, calculating a value for the statistic each time.

So for our organ donor example our data collection consisted of recording how many of the consultants 62 patients had complications in their liver transplant surgery. When we simulate this data collection we will assume that the null hypothesis is true, so the complication rate would be equal to 10%. Then, once we have a replicate of the data collection we would calculate the percentage of the patients, out of the 62, that were observed to have complications.

So our steps for this simulation are:

  1. simulate 62 liver donors, where 10% had complications
  2. calculate the proportion of liver donors who had complications (\(p_{simulated}\))
  3. repeat the two steps above a large number of times
  4. plot the distribution of simulated proportions centered on \(\pi\)
  5. count the number of simulated proportions that were less than or equal to the observed proportion (\(p_{simulated} \leq -\hat{p}\)), where the observed proportion is 3/62 for our case study
  6. divide the count from the step 6 by the total number of simulations that were done to get the p-value

Let’s start by looking at just one simulation. Here we are going to set a seed so that everyone in the class gets the same \(p_{simulated}\).

  1. What is the value you calculated for sim.prop?

Step 3: We don’t want to have to manually repeat steps one over and over and over again for Step 3, so instead we can automate the process. We can do this by making a small change to the R function rbinom in the above code. This function draws n random values from a binomial distribution, each with a sample size of size and a probability of success equal to prob. Therefore we can completely generalize our code to simulate a proportion to work with any number of simulations (n), any possible sample size (size), and any probability of success between zero and one (prob)

Note Do not run the code below (it won’t work), and delete it from the final submission.

simulation <- rbinom(n,size,prob)
sim.prop   <- simulation/size

In the space for the code chunk below recreate the objects simulation and sim.prop. Now, however, the objectsimulation will hold 10000 simulations from a binomial distribution with a sample size equal to 62 and \(\pi=0.10\), and sim.prop will have the associated 10000 simulated values of the proportion of complications (\(p_{simulated}\)).

If we want the results of the simulation to be the same every time we run the code (or if different people run the code), then we should include the function set.seed before the line of code creating the object simulation.

NOTE Delete eval=FALSE before knitting the document

# Step 3: Repeat steps 1 and 2 many, many times
# Do the following in the code below:
#  Change n to the number of simulations (10000)
#  Change size to the sample size from our case study
#  Change prob to the hypothesized true proportion from our case study

simulation <- rbinom(n,size,prob)
sim.prop   <- simulation/size

Step 4: For Step 4 we want to visualize the simulated data. We will do this using ggplot, creating a histogram of the simulated null distribution, including a vertical line in the color of your choice at the value of the observed value for \(\hat{p}\). However, ggplot requires that any data it plots be in a data frame. Let’s look at what type of object sim.prop is:

# Run this code without making any changes
typeof(sim.prop)
  1. What type of object is sim.prop? Is this a data frame, a number or a character?

We can change sim.prop to a data frame using the function data.frame

NOTE Delete eval=FALSE before knitting the document

# Run this code without making any changes to it
sim.prop <- data.frame(sim.prop)

Now we are ready to create the histogram.

# Step 4: Visualization
# In the code below:
#   Choose the geom to make a histogram
#   Add an informative title and x-axis label
#   Choose a color for the vertical line (name of the color should be in quotation marks)
ggplot(sim.prop, aes(sim.prop))+
   geom_()+
   labs()+
   geom_vline(xintercept=3/62,col=)

Notice that when we create the histogram, we get an error message from R:

This is telling us that the default number of bins (i.e., the number of bars on the histogram), which is 30, doesn’t make for a good figure. We can modify how many bins are used by a histogram using the bins argument inside the geom function. bins should equal the number of bars on a histogram that we want R to use.

NOTE Delete eval=FALSE before knitting the document

# Step 4: Visualization
# In this code chunk the argument bins has been added
#  The value can be left at 17
#  It is worth playing around with different values to see what effect it has
# In the code below:
#   Choose the geom to make a histogram
#   Add an informative title and x-axis label
#   Choose a color for the vertical line (name of the color should be in quotation marks)
# The same answers from the previous code chunk can be used here
ggplot(sim.prop, aes(sim.prop))+
   geom_(bins=17)+
   labs()+
   geom_vline(xintercept=3/62,col=)

Step 5: We are doing a one-tailed test where the alternative hypothesis is that the consultant’s complication rate is less than 10%. Therefore, for this simulation we only want to count the number of simulated statistics that less than or equal to the one we observed. We can do this using some of the logical operators we learned in Lab 1, specifically, <=

NOTE Delete eval=FALSE before knitting the document

# Step 5: Counting
# In the code below to the following:
#   Replace [less than or equal to] with the correct logical operator
#   Replace [p-hat] with the value for the observed proportion

p_sim <- sum(sim.prop [less than or equal to]  [p-hat])

Step 6: To calculate the p-value we want to divide the number of simulated statistics that were at least as extreme as the one we observed by the total number of simulations.

Note Delete eval=FALSE before knitting the document

# Step 6: p-value
# Divide p_sim by the total number of simulations
p_sim/
  1. What p-value was calculated for the hypothesis test?

  2. At a significance level of \(\alpha=0.05\) what decision do you make with regards to the null hypothesis? Is the consultant’s complication rate for liver donation surgeries consistent with the national average?

Exercise 4: Theory-based Approach

In class we have learned how to use a z-test to take a theory-based approach to hypothesis testing. In addition to all it’s functions, R can also be used as a calculator, helping us to calculate the standardized statistic. Some arithmetic operators in R are:

  • + - addition
  • - - subtraction
  • * - multiplication
  • / - division
  • ^ - raise to the power (e.g., 3^2 will calculate 3 to the power of 2)
  • sqrt() - square-root (e.g., sqrt(4) will calculate the square-root of 4)
  • exp() - exponent (e.g., exp(0) will calculate the exponent of 0, also known as e to the power of 0)
  • log() - log base 10 (i.e., natural log) (e.g. log(6) will calculate the natural log of 6)

Now we will use these operators to calculated the standard error and z-statistic. The relevant equations are:

Something that can be helpful when working in R is to calculate the different components of an equation separately and assign them to object using <-. This makes it easier to keep track of parentheses and order of operations.

NOTE Delete eval=FALSE before knitting the document

# Use this code chunk to:
#  Calculate the pi*(1-pi) and assign it to the object mult
#  Calculate mult divided by the sample size and assign it to the value frac
#  Calculate the standard error by taking the square-root of frac and assign it to the 
#    object SE
#  Calculate the difference between the observed statistic and the hypothesized statistic 
#    and assign it to the object diff
#  Calculate the z-statistic and assign it to the object z
mult <- 
frac <-
SE   <- 
diff <- 
z    <- 
z
  1. What is our z-statistic? What conclusions would you draw regarding the strength of evidence against the null hypothesis based on this value?

While all theory-based approaches make assumptions, they don’t all make the same assumptions. In R we can using a function called prop.test (proportion testing) to do a significance test for proportions. It uses a different approach, a \(\chi^2\) test (pronounced chi-squared test), to find the standardized statistic (\(\chi^2\)-statistic in this case) and p-value.

To use the function we need to know its arguments, which are the inputs it require to run. Sometimes arguments have defaults, which mean we don’t need to do anything about them, so long as the default value is the one we want to use. In other cases we will be required to give a function specific types of information in order to do the analysis. For prop.test, we are interested in the following four arguments, although the test itself does have more than that:

  • x: the number of successes
  • n: the sample size
  • p: the hypothesized probability of success
  • alternative: whether we want to do a two-sided test (two.sided) or a one-sided test (either less or greater). The name of the direction must always be in quotation marks.

NOTE Delete eval=FALSE before knitting the document

# In the code below:
#  Replace x with the obeserved number of complications
#  Replace n with the total number of patients
#  Replace p with the hypothesized true rate of complications
#  Indicate that we are testing the alternative hypothesis "less"
prop.test(x,n,p,alternative=)
  1. What is the p-value?

  2. At a significance level of \(\alpha=0.05\) what decision do you make with regards to the null hypothesis? Is the consultant’s complication rate for liver donation surgeries consistent with the national average?

  3. Which approach to assessing the consultant’s claims (the simulation, z-test or prop.test) should we trust the most? Why?