Assessing the Effectiveness of Kidney Treatment Procedures

Introduction

In 1986, a group of urologists in London published a research paper in The British Medical Journal that compared the effectiveness of two different methods to remove kidney stones. Treatment A was open surgery (invasive), and treatment B was percutaneous nephrolithotomy (less invasive). When they looked at the results from 700 patients, treatment B had a higher success rate. However, when they only looked at the subgroup of patients different kidney stone sizes, treatment A had a better success rate. This known statistical phenomenon is called Simpon’s paradox. Simpon’s paradox occurs when trends appear in subgroups but disappear or reverse when subgroups are combined. The goal of this study is to determine if Treatment A is superior to Treatment B after accounting for the severity of the kidney stones.

Methods

A. Data

The dataset contains 700 observations each representing a unique treatment session. Each data point corresponds to a different patient.

B. Variables

Response Variable: Success is a quantitative binary response variable with two levels (“Success” - coded as “1”, “Failure” - coded as “0”). It indicates the outcome of the treatment.

Predictor Variables: 1. treatment is a categorical variable with 2 levels: A and B, indicating different types of treatment. 2. stone_size is a categorical variable with two levels, coded as ‘small’ or ‘large’.

C. Statistical Methods

This study predicts the odds of successful surgery outcome using a multiple logistic regression model.

Exploratory Data Analysis (EDA)

I will now compare the success and failure of each treatment overall and by kidney stone size. To achieve this, I calculated proportions of each result without accounting for kidney stone size and with it.

# Calculate the number and frequency of success and failure of each treatment
summary_data <- data |>
group_by(treatment, success) |>
summarize(N = n()) |>
mutate(Freq = round(N / sum(N), 3))

## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.

# Calculate the number and frequency of success and failure of each treatment by
# kidney stone size
sum_data <- data |>
group_by(treatment, success, stone_size) |>
summarize(N = n()) |>
mutate(Freq = round(N / sum(N), 3))

## `summarise()` has grouped output by 'treatment', 'success'. You can override
## using the `.groups` argument.

Based on the overall distribution, treatment B appears to be slighty more successful, than treatment A, with 82.6% of successes compared to 78% respectively. The distribution by kidney stone size, however, reveals that treatment A is more successful in treating larger kidney stones, while treatment B is more successful in treating smaller kidney stones.

Testing

A Chi-squared test of independence was conducted to examine the relationship between treatment type and kidney stone size. The results revealed a statistically significant association between the two variables, p < .001. This indicates that the choice of treatment is not independent of stone size — certain treatments are more likely to be used depending on whether the stone is small or large. These findings suggest the need to consider stone size when comparing treatment outcomes, as it may influence both treatment selection and success rates.

# Run a Chi-squared test
res <- chisq.test(data$treatment, data$stone_size)
tidy(res)

## # A tibble: 1 × 4
##   statistic  p.value parameter method                                           
##       <dbl>    <dbl>     <int> <chr>                                            
## 1      189. 4.40e-43         1 Pearson's Chi-squared test with Yates' continuit…

Reporting the final model:

\(\pi = \text{Probability of Successful Outcome}\) \[\log(\frac{\widehat{\pi}}{1-\widehat{\pi}}) = 1.0332140 + 1.2605654(StoneSizeSmall)-0.3572287(TreatmentB)\]

# Run a multiple logistic regression
data$success <- as.numeric(data$success)
logistic_model <- glm(data = data, success ~ stone_size + treatment, 
                      family ='binomial')
tidy(logistic_model)

## # A tibble: 3 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        1.03      0.134      7.68 1.55e-14
## 2 stone_sizesmall    1.26      0.239      5.27 1.33e- 7
## 3 treatmentB        -0.357     0.229     -1.56 1.19e- 1

Model interpretation

A logistic regression model was fitted to estimate the probability of treatment success based on kidney stone size and treatment type. The analysis shows that small stones significantly increase the likelihood of success (log-odds = 1.26, p < 0.001), while treatment B has a non-significant negative effect (log-odds = -0.36, p = 0.12).

Predicted probabilities illustrate these effects clearly. For patients with large stones, the probability of success is 73.8% with treatment A and 66.3% with treatment B. In contrast, for small stones, the probability of success rises to 90.8% with treatment A and 87.3% with treatment B. These results indicate that stone size is a strong predictor of treatment success, and although treatment A tends to perform slightly better, the difference is less pronounced when treating small stones.

Conclusions & Discussion

This analysis investigated the relationship between treatment type, kidney stone size, and the probability of treatment success using logistic regression and Chi-squared testing. The Chi-squared test revealed a statistically significant association between treatment choice and stone size (p < 0.001), suggesting that treatment selection is not independent of the size of the kidney stone.

The logistic regression model further demonstrated that stone size is a strong and statistically significant predictor of treatment success. Patients with small stones had substantially higher odds of success compared to those with large stones. Although treatment B was associated with slightly lower odds of success compared to treatment A, this effect was not statistically significant (p = 0.12), indicating that treatment type may have a less consistent influence once stone size is accounted for.

Predicted probabilities reinforce these findings. These results suggest that treatment effectiveness is more heavily influenced by stone size than by the treatment type itself. From a clinical or policy perspective, these insights underscore the importance of stratifying patients by stone size when evaluating treatment options. Further analysis with larger sample sizes or additional patient-level data could help clarify any subtle differences between treatments and guide personalized medical decisions.