MATH239-3A

Lecture 3A: Categorical Data Analysis and Randomization

In this lecture we will connect our previous understanding of simple categorical data analysis with the commonly taught normal approximation to simulation based methods.

Example 1:

In the national debate on same-sex marriage, it is commonly stated that half of all Americans favor same-sex marriage.
In 2014, Pew Research conducted a poll of millennials and found that 67% answered yes when asked: Do you support same-sex marriage?
The poll was a random sample of 75 millennials. 50 were in support.
Does this poll provide convincing evidence that the support from millennials for same-sex marriage is higher than that of the larger general population of Americans?

Exact Binomial Test

The since these data inherently come from a binomial distribution we can calculate the exact probability that 50 or more millennials would say they support same-sex marriage out of a sample of 75, assuming that the underlying probability is .5.

# USING THE PBINOM FUNCTION
# p = sum of probabilities in some interval
# defaults to lower tail probabilities 
pbinom(49, 75, .5, lower.tail = FALSE)

## [1] 0.002614122

# note that we don't want to include 50
1-pbinom(49, 75, .5)

## [1] 0.002614122

# COOKING FROM SCRATCH
# WRITING OUR OWN LOOP
tot<-0
for(i in 50:75){
  tot<-tot+dbinom(i, 75, .5)
}
tot

## [1] 0.002614122

# COOKING FROM A MIX (i.e. a built in function)
x_obs<-50
n<-75
p0<-0.5

# EXACT BINOMIAL 
binom.test(x_obs, n, p0, alternative="greater")

## 
##  Exact binomial test
## 
## data:  x_obs and n
## number of successes = 50, number of trials = 75, p-value =
## 0.002614
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.5665713 1.0000000
## sample estimates:
## probability of success 
##              0.6666667

Binomial Simulation

If we had to calculate the probability from the distribution function by hand that would be a lot of hard work. So instead, we can simulate from the null distribution. Recall that the null distribution is the distribution that assume the null is true. In this case that means we are using a binomial distribution with n=75 and p=.5. After obtaining random observations from this distribution we can calculate the p-value, which the probability of observing something as or more extreme.

# SIMULATIONED BINOMIAL
nsim=10000
nullDist<-rbinom(nsim, n, p0)

head(nullDist)

## [1] 34 36 38 38 41 39

# make a histograme
hist(nullDist)
# add a line to illustrate where the observed value is
abline(v=50, col="red")

head(nullDist>=x_obs)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

mean(nullDist>=x_obs)

## [1] 0.0034

Normal Approximations

For one sample proportion:

This is the most common method taught in introductory statistics courses (probably because we often using Z-tables). This should only be thought of as an approximation because, essentially, we are estimating a discrete distribution with a continuous distribution.

# NORMAL DISTR
# LARGE SAMPLE APPROX 
# ONE SAMPLE PROPORTION 

# HYPOTHESIS TEST
# check the conditions first
n*p0

## [1] 37.5

n*(1-p0)

## [1] 37.5

SE0<-sqrt(p0*(1-p0)/n)
SE0

## [1] 0.05773503

p_hat<-x_obs/n
p_hat

## [1] 0.6666667

test_stat<-(p_hat-p0)/SE0
test_stat

## [1] 2.886751

pnorm(test_stat, lower.tail=FALSE)

## [1] 0.001946209

# CONFIDENCE INTERVAL 
critVal<-qnorm(0.975)
SE<-sqrt(p_hat*(1-p_hat)/n)

p_hat+c(-1,1)*critVal*SE

## [1] 0.5599797 0.7733536

Example 2:

48 male supervisors given the same personnel file.
Asked: promote or not
Files were identical except gender.
Files were randomly assigned to the supervisors.
Question: Are females unfairly discriminated against?
Data:
- Males: 21 promoted (out of 24)
- Females: 14 promoted (out of 24)

For two sample proportions:

# TWO SAMPLE PROPORTIONS

# HYPOTHESIS TEST
x1<-21
n1<-24
x2<-14
n2<-24

p_hat1<-x1/n1
p_hat2<-x2/n2

pooled<-(x1+x2)/(n1+n2)

test_stat<-(p_hat1-p_hat2)/sqrt(pooled*(1-pooled)*(1/n1+1/n2))

# two-tailed 
pnorm(test_stat, lower.tail = FALSE)*2

## [1] 0.02299039

# CONFIDENCE INTERVAL
critVal<-qnorm(0.975)
SE<-sqrt(p_hat1*(1-p_hat1)/n1+p_hat2*(1-p_hat2)/n2)

(p_hat1-p_hat2)+ c(-1,1)*critVal*SE

## [1] 0.05415812 0.52917522

RANDOMIZATION / PERMUTATION

In this method we mix up the order of the observations and reassign them to groups.

# RANDOMIZATION / PERMUTATION
nsim=10000
gender<-c(rep("M", 24), rep("F", 24))
promote<-c(rep("Yes", 35), rep("No", 13))
d<-rep(NA, nsim)

for(i in 1:nsim){
  permuteGen<-sample(gender)
  tab<-table(permuteGen, promote)
  d[i]<-diff(tab[,2]/24)
}

permuteGen

##  [1] "F" "F" "M" "F" "M" "M" "F" "M" "F" "M" "F" "F" "M" "M" "M" "F" "M"
## [18] "M" "M" "F" "F" "M" "M" "F" "M" "M" "M" "M" "M" "F" "F" "F" "F" "M"
## [35] "F" "F" "F" "M" "F" "F" "M" "F" "F" "F" "M" "M" "M" "F"

tab

##           promote
## permuteGen No Yes
##          F  8  16
##          M  5  19

hist(d)
abline(v=0.292, col="blue", lwd=2, lty=2)

head(d)

## [1] -0.04166667 -0.04166667 -0.04166667 -0.20833333  0.12500000  0.12500000

head(d>=.292)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

# one sided upper alternative 
# discrimination against women
mean(d>=.292)

## [1] 0.0039

# two sided alternative 
# gender discrimination
mean(d>=.292)*2

## [1] 0.0078