Note on notation. It is common to use q as a notation for 1 - p

You can change the p and n below to see how that effects the Normality of the P hats.

p <- .2
n <- 10
npq <- p*(1-p)*n
yes <- rep("YES",round(p*10000))
no <- rep("NO",10000 - round(p*10000))
c <- c(yes,no)
counts <-table(c)
barplot(counts, main = "Population of Qualitative Data",
                xlab = "Yes or No", 
                col  = c("red","darkgreen")
        ) 

We can see the population has a p = \(0.2\) and our npq = \(1.6\).

The claim is that \(\hat{P_n} \sim \mathcal{N} (p,\sqrt{p(1-p)/n)}\) if \(np(1-p)\ge10\)

We think of \(\hat{P_n}\) as all possible sample proportions of sample size n.

The claim states that the relative frequency distribution of all of these possible \(\hat{p_n}\) should look Normal.

We can test the claim using the computer by having it repeatedly take samples of size n and then make a relative frequency histogram and see if it in fact looks Normal.

We are using sample size of \(10\) thus \(n(p)(1-p) =\) \(1.6\) thus we are expecting the relative frequency histogram to not look normal.

You will notice that the criteria \(np(1-p)\ge10\) is pretty conservative. For values of npq as small as 3.5 we seem to get something pretty “Normal” looking.

Note we still haven’t answered Why this true, for now we are just setting the stage.

npq2.1 <-  replicate(100000, {
            s <- sample(c, size = n)
            sum(s=="YES")/n
        })
hist(npq2.1,main = paste("npq = ",npq,sep=" "), probability = T)