Week 4 Discussion - Proportion Test

Discussion Question

Using the train.csv file from the Titanic dataset, build a 1-\(\alpha\)/2% confidence interval for survival of females and then a separate confidence interval for the entire population. Choose the value of \(\alpha\) yourself. Using this same confidence level, test the hypotheses that the female survival rate is higher than the survival rate of the entire population. Provide your R code in the discussion.

Import and Filter Data

First, I import the data and build the table with the pertinent information ( number of males and females who lived or died).

titanic = read.csv("D:\\eric\\Boston College\\1. ADEC7310.02 - Data Analysis\\Week 1\\train.csv")
survival = table(titanic$Survived, titanic$Sex)
rownames(survival) = c("Died", "Survived")
colnames(survival) = c("Female", "Male")
survival

##           
##            Female Male
##   Died         81  468
##   Survived    233  109

Survival Rate Confidence Interval

Female Survival Rate Interval

I calculated the confidence interval for the female survival rate twice, once using the manual way and calculating the values myself and once using the function prop.test(). The values are slightly off since I believe that prop.test() will use a default value of \(p=0.5\) and in the manual way I used \(\hat{p}=0.742\).

total_female = sum(survival[,"Female"])
female_survived = survival["Survived", "Female"]
prop_female = female_survived / total_female
# Manual Way
z = 1.96
se = sqrt((prop_female*(1-prop_female))/total_female)
me = z * se
cat("95% Confidence Interval = [", prop_female - me,", ", prop_female + me, 
    "]\n", sep="")

## 95% Confidence Interval = [0.6936453, 0.7904312]

# Using prop.test
prop.test(female_survived, total_female)

## 
##  1-sample proportions test with continuity correction
## 
## data:  female_survived out of total_female, null probability 0.5
## X-squared = 72.615, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.6892571 0.7887777
## sample estimates:
##         p 
## 0.7420382

rm(se, me)

Total Survival Rate Interval

Same as with the female survival rate I will calculate the total survival rate the manual way and with prop.test().

total_survived = sum(survival["Survived",])
total = (total_survived + sum(survival["Died",]))
prop_survive = total_survived / total
# Manual Way
se = sqrt((prop_survive*(1-prop_survive))/total)
me = z * se
cat("95% Confidence Interval = [", prop_survive - me, ", ", prop_survive + me,
    "]\n", sep="")

## 95% Confidence Interval = [0.3519055, 0.4157713]

# Using prop.test
prop.test(total_survived, total)

## 
##  1-sample proportions test with continuity correction
## 
## data:  total_survived out of total, null probability 0.5
## X-squared = 47.627, df = 1, p-value = 5.154e-12
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.3519194 0.4167722
## sample estimates:
##         p 
## 0.3838384

rm(z, se, me)

Hypothesis Test

Now we want to test whether or not the female survival rate is higher than the population survival rate. To determine this, using the prop.test() makes the most sense. We’ll use a significance value of \(\alpha=0.05\).

# Since we already have the variables set, run the test
prop.test(c(female_survived, total_survived), c(total_female, total))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(female_survived, total_survived) out of c(total_female, total)
## X-squared = 117.98, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.2980682 0.4183315
## sample estimates:
##    prop 1    prop 2 
## 0.7420382 0.3838384

rm(female_survived, prop_female, prop_survive, survival, total, total_female,
   total_survived, titanic)

As we see, the p-value is less than 2.2e-16. With a significance value of \(\alpha=0.05\) we can say there is enough evidence to reject the null hypothesis and conclude that the female survival rate on the titanic is different than the total survival rate of the sinking.