Required Sample Size

2026-06-07

Topic

Find how many samples are necessary to detect significantly different proportion values between two independent populations

Example Study Description

Study question: Does a chosen police officer have a rate of resisted arrests that is different from that of other police officers?

Population probabilities

\(p_{1}\): Probability of resisted arrest for target officer
\(p_{2}\): Probability of resisted arrest for other officers

Statistical hypotheses

\(H_{0}: p_{1}=p_{2}\)
\(H_{1}: p_{1}\neq p_{2}\)

Dataset

This project utilizes police arrest data generated from the police department (Record Management System (RMS)) in a chosen city.

df = read.csv("Police_Arrests.csv")

df = df %>% 
  select(c(ArrestNum, StatuteCode,
             OfficerSerialNum))
head(df)

##     ArrestNum  StatuteCode OfficerSerialNum
## 1 2026-000104     28-693A              1273
## 2 2026-000104 28-701.02A2              1273
## 3 2026-000104     28-708A              1273
## 4 2026-000113    28-3473A              1722
## 5 2026-000106    13-1201A              1663
## 6 2026-000106     28-693A              1663

Dataset

Data Pre-Processing:

df <- df %>%
  mutate(across(where(is.character), str_trim))%>%
  group_by(ArrestNum) %>%
    summarise(StatuteCode = list(StatuteCode), 
              OfficerSerialNum = first(OfficerSerialNum)) %>%
mutate(resArrest=sapply(StatuteCode,function(x)
  any(unlist(c("13-2508A3","13-2508A1","13-2508A2"))%in% x)))

Formulas

For a proportion hypothesis test, the Z-statistic is used: \[z={\hat{p}_{1}-\hat{p}_{2} \over \sqrt{\hat{p}\left(1−\hat{p}\right)\left[{1 \over n_{1}} + {1 \over n_{2}}\right]}}\]

Where:

\(\hat{p}_{1}, \hat{p}_{2}\) are the sample proportion for population 1 and 2
\(n_{1}, n_{2}\) are the sample size for population 1 and 2
\(\hat{p}={n_{1}\hat{p}_{1} + n_{2}\hat{p}_{2} \over n_{1} + n_{2}}\)

Solving for sample size

Code:

bsamsize <- function(p1, p2, fraction, alpha, power)
{
  z.alpha <- qnorm(1-alpha/2)
  z.beta  <- qnorm(power)

  ratio <- (1-fraction)/fraction
  p <- fraction*p1+(1-fraction)*p2

  n1 <- (z.alpha * sqrt((ratio+1) * p * (1-p)) +
         z.beta * sqrt(ratio * p1 * (1-p1) + p2 * (1 - p2))
        )^2/ratio/((p1-p2)^2)
  
  n2 <- ratio*n1
  c(n1=n1, n2=n2)
}

Parameter Selection

The necessary sample size decreases as:

power decreases
\({n_{2} \over n_{1}}\to 1.0\)
\(|p_{1}-p{2}|\) increases
\(\alpha\) increases

Parameter Selection

Example Calculation

How many samples are needed to detect a change of 7 percent in the probability of an arrest being resisted if the baseline probability is around

mean(df$resArrest)

## [1] 0.02464282

The necessary sample size is greatest when \(p_{1}\) is .5. Using the previous estimate, we assume \(p_{1}\) could potentially be .15 at highest

p1=.15

Example Calculation

Use a power of 0.8, a type-I error probability of 0.1, and the sample ratio \(n_{2}=.1 * n_{1}\).

n_calc=bsamsize(p1,p2=p1-.07,fraction=.1, alpha=.1, power=.8)
n_calc

##        n1        n2 
##  130.1751 1171.5761

Therefore we isolate our dataset to officers with \(n1\) samples or more

df = df %>% group_by(OfficerSerialNum) %>%
    summarise(resArrestTot=sum(resArrest),
              resArrestProp=mean(resArrest),
              n = n()) %>%
    filter(n>= n_calc[[1]])