This lab is concerned with data on almost 8000 patients treated for kidney stones using one of two treatments. The data consist of a data frame containing the following variables.

Variable Description
size Size of the kidney stone, one of small, medium or large
treatment Treatment used, one of A and B
outcome Outcome of treatment, successful or unsuccessful

The data are categorical: What we’re interested in is whether the success rate is different between treatments, across the range of kidney stone sizes.

To do this, we’ll be generating some tables so we can compute the proportion of successes for each treatment, and then perform a hypothesis test to see if the proportion is the same (i.e. the difference between success rates is zero).

We’ll start by reading the data in and doing a summary.

library(ggplot2)
kidney = read.csv("http://www.massey.ac.nz/~jcmarsha/227212/data/kidney.csv")
summary(kidney)
##      size      treatment         outcome    
##  large :2744   A:3998    successful  :6446  
##  medium:2394   B:3996    unsuccessful:1548  
##  small :2856

Next, we’ll do a plot of treatment by outcome.

ggplot(kidney) + geom_bar(aes(x=treatment, fill=outcome))

Which treatment is more successful overall? Write some notes here. Treatment B is more sucessful

Next, we’ll present the same information in tabular form. This allows us to see the actual numbers:

tab = table(kidney$treatment, kidney$outcome)
tab
##    
##     successful unsuccessful
##   A       3180          818
##   B       3266          730

You might find it hard to tell here that the success rate for B is a bit better than A. It’s easier if we present the table as proportions though. prop.table can do this for us (compute proportions on a table). We specify margin=1 so that it uses the rows (margin=2 would do the proportions down the columns instead).

prop.table(tab, margin = 1)
##    
##     successful unsuccessful
##   A  0.7953977    0.2046023
##   B  0.8173173    0.1826827

From here you should see that B is more successful. Next, we’ll perform a hypothesis test to see whether what we see in this sample might hold in the general population. We do this using prop.test (a test to compare proportions).

tab = table(kidney$treatment, kidney$outcome)
prop.test(tab)
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  tab
## X-squared = 6.0099, df = 1, p-value = 0.01423
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.03948753 -0.00435171
## sample estimates:
##    prop 1    prop 2 
## 0.7953977 0.8173173

Write some notes on what hypothesis is being tested and what your conclusion is from both the P-value and the confidence interval.Hypothesis could be that treatment B is equal to A but as we see there is a difference between them. from the p value and confidence interval we can see that there is infact a small difference.

Next, we’ll break the data down a little and assess what effect the size of the kidney stone has on treatment. In the code block below we reproduce the plot from before. Modify this by adding on facet_wrap so that you get a different plot for each size of kidney stone. You might also want to explore setting position='fill' in the geom_bar function.

ggplot(kidney) + geom_bar(aes(x=treatment, fill=outcome), position='fill' ) + facet_wrap (~size)

What is your conclusion from these? Is it unexpected? Now treatment A seems to be more sucessful than treatment B

Next, we’ll see if what we see in our data also holds in the population by performing a hypothesis test similar to the one we did before, but this time for only small stones. This is a simple way to control for the effect of stone size. First we take only the small stones, and look at the table we get for them:

small = subset(kidney, size == "small")
small_tab = table(small$treatment, small$outcome)
small_tab
##    
##     successful unsuccessful
##   A        648           48
##   B       1872          288

Now, add another code block to perform the hypothesis test like we did before, this time using small_tab in place of tab. What is your conclusion?

small = subset(kidney, size == "small")
tab = table(small$treatment, small$outcome)
tab
##    
##     successful unsuccessful
##   A        648           48
##   B       1872          288

Finally, do the same tests for medium stones and large stones. You should find that treatment A is mroe successful than treatment B for all stone sizes. How can this be, when above we found that treatment B was better overall? Discuss this with those around you and see if you can figure out what is going on.

medium = subset(kidney, size == "medium")
small_tab = table(medium$treatment, medium$outcome)
small_tab
##    
##     successful unsuccessful
##   A        996          202
##   B        954          242
large = subset(kidney, size == "large")
small_tab = table(large$treatment, large$outcome)
small_tab
##    
##     successful unsuccessful
##   A       1536          568
##   B        440          200

When looking at the porportions of the data we can see that Treatment A is more sucessful but treatment B had a larger amount of sucessful treatments.