Project #6 - Inference on Categorical Data

Purpose

In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.

Preparation

The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk. Do some exploratory analysis using the str() function and viewing the dataframe. None of this will be graded, just something for you to do on your own.

# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")

# Load dataset into environment
load("atheism.RData")

Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.

# Read the NSCC dataset and store it
nscc_students <- read.csv("../nscc_student_data.csv")

Question 1 - Single Proportion Hypothesis Test

In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)

Write hypotheses and determine tails of the test
If the outcome of the coin toss has no effect on overtime wins, we expect the proportion of games won by the team who wins the coin toss to be 0.5. It’s a one-tailed test because the claim is the coin toss winning team WINS more than the opponent does. Then the hypotheses are as follows:
\(H_o: p_{win} = 0.5\)
\(H_a: p_{win} \gt 0.5\)
one-tailed test
Find p-value of sample data and use to make decision to reject H0 or fail to reject H0

# Calculate sample proportion of games won by coin flip team
#First assign variables
x = 235
n = 414
p = 0.5

#Calculate the proportion of games won by coin flip winner and output the result
(phat = x/n)

## [1] 0.5676329

The proportion of overtime games won by teams who win the coin toss is 0.5676, which is greater than the 0.5 expected if the coin toss confers no advantage. However, this result could come about by chance. We will calculate the p-value to find out how likely that is.

# Calculate SE with the null value (p), not phat
se = sqrt(p*(1-p)/n)

# Calculate the test statistic and output it
(ts = (phat-p)/se)

## [1] 2.75225

# p-value
#Because the test statistic is positive, I won't include the absolute value in the argument for pnorm below.
(pnorm(ts,lower.tail = FALSE))

## [1] 0.002959367

The p-value is 0.002959. Since this p-value is less than \(\alpha = 0.05\), we reject the null hypothesis in favor of the alternative hypothesis.

State conclusion
Data from overtime games suggest that teams that win the coin flip have an advantage and win more often than their opponents do.

Question 2 - Two Independent Proportions Hypothesis Test

For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?

# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)

Write hypotheses and determine tails of the test
\(H_o: p_{Sp2005} - p_{Sp2012} = 0\)
\(H_a: p_{Sp2005} - p_{Sp2012} \ne 0\)
two-tailed test
Find p-value and use to make decision to reject H0 or fail to reject H0

# Use table() to get x1, x2, n1, and n2
table(spain2005$response)

## 
##     atheist non-atheist 
##         115        1031

table(spain2012$response)

## 
##     atheist non-atheist 
##         103        1042

# Store values
x1 <- 115
x2 <- 103
n1 <- 115+1031
n2 <- 103+1042

#Let's ouput the proportions
x1/n1

## [1] 0.100349

x2/n2

## [1] 0.08995633

\(\hat{p}_{Sp2005} = 0.1003\)
\(\hat{p}_{Sp2012} = 0.0900\)

# p-pool
ppool <- (x1 + x2)/(n1 + n2)

# SE
se_Sp <- sqrt(ppool*(1-ppool)/n1 + ppool*(1-ppool)/n2)
# Test statistic
ts_Sp <- (x1/n1 - x2/n2)/se_Sp
# p-value
2*pnorm(abs(ts_Sp), lower.tail = FALSE)

## [1] 0.3966418

Because the p-value is greater than 0.05, we fail to reject the null hypothesis.

State conclusion
Based on our sample data, we cannot say the proportion of Spanish atheists was different in the years 2005 and 2012.

Question 3 - Two Independent Proportions Hypothesis Test

Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?

# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)

Write hypotheses and determine tails of the test
\(H_o: p_{USA2005} - p_{USA2012} = 0\)
\(H_a: p_{USA2005} - p_{USA2012} \ne 0\)
two-tailed test
Find p-value and use to make decision to reject H0 or fail to reject H0

# Use table() to get x1, x2, n1, and n2
table(USA2005$response)

## 
##     atheist non-atheist 
##          10         992

table(USA2012$response)

## 
##     atheist non-atheist 
##          50         952

# Store values
x1 <- 10
x2 <- 50
n1 <- 10+992
n2 <- 50+952

#Let's ouput the proportions
x1/n1

## [1] 0.00998004

x2/n2

## [1] 0.0499002

\(\hat{p}_{USA2005} = 0.0100\)
\(\hat{p}_{USA2012} = 0.0499\)

#Calculate the p-value

# p-pool
ppool <- (x1 + x2)/(n1 + n2)

# SE
se_USA <- sqrt(ppool*(1-ppool)/n1 + ppool*(1-ppool)/n2)

# Test statistic
(ts_USA <- (x1/n1 - x2/n2)/se_USA)

## [1] -5.243063

# p-value
2*pnorm(abs(ts_USA), lower.tail = FALSE)

## [1] 1.579324e-07

Because the p-value is less than 0.05, we reject the null hypothesis in favor of the alternative hypothesis.
c. State conclusion
The sample data suggest there was a change in the proportion of atheists in America between 2005 and 2012.

Let’s try this again with prop.test:

# Create vectors for input
USA_atheists <- c(table(USA2005$response)[[1]], table(USA2012$response)[[1]])
USA_respondents <- c(table(USA2005$response)[[1]]+table(USA2005$response)[[2]], table(USA2012$response)[[1]]+table(USA2012$response)[[2]])

USA_atheists

## [1] 10 50

USA_respondents

## [1] 1002 1002

#Use prop.test
prop.test(USA_atheists,USA_respondents, correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  USA_atheists out of USA_respondents
## X-squared = 27.49, df = 1, p-value = 1.579e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.05474042 -0.02509990
## sample estimates:
##     prop 1     prop 2 
## 0.00998004 0.04990020

This is the same p-value and same proportions. The two methods agree. Woo hoo!

Question 4 - Minimum Sample Size

Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 2%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?

We use the formula for the margin of error (E) to solve for the sample size (n).
\[n = \frac{(z_{\alpha/2})^2*p*q}{E^2}\]
For a confidence interval of 95%, \(z_{\alpha/2}\) = 1.96.
When we have no estimate for p, the safest value to use is p = 0.5, which maximizes p*q.
For this problem, E is set at 0.02.

(1.96*0.5/0.02)^2

## [1] 2401

The minimum number of people to sample is 2401.

Question 5

Use the NSCC Student Dataset for the Questions 5-8.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.

#Let's look at a table of the voter registration data.
table(nscc_students$VoterReg)

## 
##  No Yes 
##   9  31

Strictly speaking, this sampling distribution doesn’t meet the “success-failure condition” because the number of failures (no’s) is less than 10. I’ll charge ahead with the caveat that this analysis might not hold true since the sampling distribution doesn’t meet one of the criteria for being nearly normal.

#Calculate the sample proportion of students that are registered to vote.  
phat_voter <- 31/40

#To find the confidence interval, first calculate the standard error
se_voter <- sqrt(phat_voter*(1-phat_voter)/40)

#Find the lower bound
round(phat_voter - 1.96*se_voter, 3)

## [1] 0.646

#Find the upper bound
round(phat_voter + 1.96*se_voter, 3)

## [1] 0.904

Based on this sample and the assumption that the sampling distribution is nearly normal, we are 95% confident that the percentage of all NSCC students that are registered to vote is between 64.6% and 90.4%.

Let’s see what happens when we use prop.test for the same analysis. I’ll use it without the correction and with the correction.

prop.test(x = 31, n = 40, correct = FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  31 out of 40, null probability 0.5
## X-squared = 12.1, df = 1, p-value = 0.0005042
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.6249690 0.8768391
## sample estimates:
##     p 
## 0.775

prop.test(x = 31, n = 40, correct = TRUE)

## 
##  1-sample proportions test with continuity correction
## 
## data:  31 out of 40, null probability 0.5
## X-squared = 11.025, df = 1, p-value = 0.0008989
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.6114495 0.8859920
## sample estimates:
##     p 
## 0.775

Overall, the three methods result in similar values for the 95% confidence interval:
Calculation by hand: (64.6%, 90.4%)
prop.test w/o correction: (62.5%, 87.7%)
prop.test w/ correction: (61.1%, 88.6%)

I’m not sure what to do in this situation. The criteria for the sample proportion being nearly normal aren’t met, but it seems like a waste not to analyze the data. I’m going with the answer that the 95% confidence interval is (64.6%, 90.4%) only because I understand this method more than the prop.test methods. I don’t know why the confidence intervals calculated by prop.test are not symmetric with respect to the estimate for p (77.5%).

Question 6

Construct a 95% confidence interval of the average height of all NSCC students.

#The height data is numerical, rather than categorical, so we will calculate a sample mean and margin of error.

#Let's look at a table of the data.  
table(nscc_students$Height)

## 
##     6    60  60.4    61    62    63    65    66    67    68  69.5    70 
##     1     3     1     4     4     3     2     3     3     5     1     2 
##  70.8 71.75    72    73    75    76 
##     1     1     2     1     1     1

I’m suspicious of the value at 6“. That seems impossibly short for a living human, especially one who is enrolled in college. Let’s do the calculation including the bad point and then remove it to see how much difference it makes.

#First subset the height data, removing any na values.
nscc_height <- subset(nscc_students,nscc_students$Height!= "NA")

#Calculate the sample mean.  
mn_height <- mean(nscc_height$Height)

#To find the standard error, first calculate the standard deviation of the mean. Then, divide by the square root of the number of observations.
se_height <- sd(nscc_height$Height)/sqrt(nrow(nscc_height))

#Find the lower bound
mn_height - abs(qt(p = 0.025, df = nrow(nscc_height)-1))*se_height

## [1] 61.08699

#Find the upper bound
mn_height + abs(qt(p = 0.025, df = nrow(nscc_height)-1))*se_height

## [1] 67.96173

#What happens if I use t.test instead?
t.test(nscc_height$Height)

## 
##  One Sample t-test
## 
## data:  nscc_height$Height
## t = 38.001, df = 38, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  61.08699 67.96173
## sample estimates:
## mean of x 
##  64.52436

Based on this sample of 39 students (including the suspect data point), we are 95% confident that the mean height of all NSCC students is between 61.1" and 68.0“.

Let’s go through the same analysis removing the data point at 6“.

#Subset the height data, removing any data points that are less than 10.
nscc_height_2 <- subset(nscc_height,nscc_height$Height >= 10)

#Use t.test
t.test(nscc_height_2$Height)

## 
##  One Sample t-test
## 
## data:  nscc_height_2$Height
## t = 90.002, df = 37, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  64.57719 67.55176
## sample estimates:
## mean of x 
##  66.06447

Based on this sample of 38 students, we are 95% confident that the mean height of all NSCC students is between 64.6" and 67.6“.

Removing the bad data point moved the mean height from 64.5" to 66.1" and the lower bound went from 61.1" to 64.6“. A smaller effect was seen on the upper bound. I think the data point at 6” is not a valid point and should be removed, giving us the 95% confidence interval of (64.6“, 67.6”)

Question 7

Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.

a.) Write hypotheses and determine tails of the test
\(H_o: p_{NSCC} = 0.64\)
\(H_a: p_{NSCC} \gt 0.64\)
one-sided test

b.) Calculate sample statistics

#First look at a table of the coffee data
table(nscc_students$Coffee)

## 
##  No Yes 
##  10  30

No ’NA’s here.

#Use prop.test to find the sample stats.  
prop.test(x = 30, n = 40, p = 0.64, alternative = "greater", correct = FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  30 out of 40, null probability 0.64
## X-squared = 2.1007, df = 1, p-value = 0.07362
## alternative hypothesis: true p is greater than 0.64
## 95 percent confidence interval:
##  0.6240271 1.0000000
## sample estimates:
##    p 
## 0.75

c.) Determine probability of getting sample data by chance and use that to reject Ho or fail to reject Ho
The probability of getting a sample mean of 75% when the true population mean is 64% is 0.0736. Since this value is larger than \(\alpha\) = 0.05, we fail to reject the hypothesis.

d.) Conclusion
Based on these sample data, we cannot say that the proportion of NSCC students who drink coffee is larger than the 2015 national average.