Find how many samples are necessary to detect significantly different proportion values between two independent populations
2026-06-07
Find how many samples are necessary to detect significantly different proportion values between two independent populations
Study question: Does a chosen police officer have a rate of resisted arrests that is different from that of other police officers?
Population probabilities
Statistical hypotheses
This project utilizes police arrest data generated from the police department (Record Management System (RMS)) in a chosen city.
df = read.csv("Police_Arrests.csv")
df = df %>%
select(c(ArrestNum, StatuteCode,
OfficerSerialNum))
head(df)
## ArrestNum StatuteCode OfficerSerialNum ## 1 2026-000104 28-693A 1273 ## 2 2026-000104 28-701.02A2 1273 ## 3 2026-000104 28-708A 1273 ## 4 2026-000113 28-3473A 1722 ## 5 2026-000106 13-1201A 1663 ## 6 2026-000106 28-693A 1663
Data Pre-Processing:
df <- df %>%
mutate(across(where(is.character), str_trim))%>%
group_by(ArrestNum) %>%
summarise(StatuteCode = list(StatuteCode),
OfficerSerialNum = first(OfficerSerialNum)) %>%
mutate(resArrest=sapply(StatuteCode,function(x)
any(unlist(c("13-2508A3","13-2508A1","13-2508A2"))%in% x)))
For a proportion hypothesis test, the Z-statistic is used: \[z={\hat{p}_{1}-\hat{p}_{2} \over \sqrt{\hat{p}\left(1−\hat{p}\right)\left[{1 \over n_{1}} + {1 \over n_{2}}\right]}}\]
Where:
Code:
bsamsize <- function(p1, p2, fraction, alpha, power)
{
z.alpha <- qnorm(1-alpha/2)
z.beta <- qnorm(power)
ratio <- (1-fraction)/fraction
p <- fraction*p1+(1-fraction)*p2
n1 <- (z.alpha * sqrt((ratio+1) * p * (1-p)) +
z.beta * sqrt(ratio * p1 * (1-p1) + p2 * (1 - p2))
)^2/ratio/((p1-p2)^2)
n2 <- ratio*n1
c(n1=n1, n2=n2)
}
The necessary sample size decreases as:
How many samples are needed to detect a change of 7 percent in the probability of an arrest being resisted if the baseline probability is around
mean(df$resArrest)
## [1] 0.02464282
The necessary sample size is greatest when \(p_{1}\) is .5. Using the previous estimate, we assume \(p_{1}\) could potentially be .15 at highest
p1=.15
Use a power of 0.8, a type-I error probability of 0.1, and the sample ratio \(n_{2}=.1 * n_{1}\).
n_calc=bsamsize(p1,p2=p1-.07,fraction=.1, alpha=.1, power=.8) n_calc
## n1 n2 ## 130.1751 1171.5761
Therefore we isolate our dataset to officers with \(n1\) samples or more
df = df %>% group_by(OfficerSerialNum) %>%
summarise(resArrestTot=sum(resArrest),
resArrestProp=mean(resArrest),
n = n()) %>%
filter(n>= n_calc[[1]])