Purpose

In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.


Preparation

The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk. Do some exploratory analysis using the str() function and viewing the dataframe. None of this will be graded, just something for you to do on your own.

# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")

# Load dataset into environment
load("atheism.RData")

str(atheism)
## 'data.frame':    88032 obs. of  3 variables:
##  $ nationality: Factor w/ 57 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ response   : Factor w/ 2 levels "atheist","non-atheist": 2 2 2 2 2 2 2 2 2 2 ...
##  $ year       : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...

Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.

#Load the nscc_student_data dataset.

nscc_student_data <- read.csv("nscc_student_data.csv")

Question 1 - Single Proportion Hypothesis Test

In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)

  1. Write hypotheses and determine tails of the test

\(H_0 : p_{cf} = 0.5\)

\(H_A : p_{cf} > 0.5\)

Our Null hypothesis is that the proportion of games won by teams who win the coin toss is the same as the proportion of the games won by the teams who lost the coin toss. The alternative hypothesis is that the teams who win the coin toss win more games than the teams who lose the coin toss. The test is one-tailed (upper tail).

  1. Find p-value of sample data and use to make decision to reject H0 or fail to reject H0
# Calculate sample proportion of games won by teams who won coin toss.
235/414
## [1] 0.5676329

The sample proportion of the teams who win the games after winning the coin toss is 0.5676 (or 56.76%).

# SE
(SE_nfl <- sqrt(0.5676 * (1 - 0.5676)/414))
## [1] 0.02434803
# Test statistic
(ts_nfl <- (0.5676 - 0.5)/SE_nfl)
## [1] 2.776405
# p-value
pnorm(ts_nfl, lower.tail = FALSE)
## [1] 0.002748184

The p-value of 0.0027 is less than the significance level of 0.05, therefore, we reject the Null hypothesis.

  1. State conclusion

The evidence shows that the NFL teams who win the overtime coin toss have better chances to win the game.

Question 2 - Two Independent Proportions Hypothesis Test

For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?

# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)
  1. Write hypotheses and determine tails of the test

\(H_0: p_{SP2005} = p_{SP2012}\)

\(H_A: p_{SP2005} \neq p_{SP2012}\)

According to the Null hypothesis, the atheism index is the same in 2012 as it was in 2005. The alternative hypothesis is that the atheism index had changed in 2012 compared to 2005. The test is two-tailed.

  1. Find p-value and use to make decision to reject H0 or fail to reject H0
#Store number of atheists and number of observations from subsets spain2005 and spain2012.
table(spain2005$response)
## 
##     atheist non-atheist 
##         115        1031
table(spain2012$response)
## 
##     atheist non-atheist 
##         103        1042
x1 <- 115
x2 <- 103
n1 <- 1146
n2 <- 1145

#Find the pooled proportion.

p_pool <- (x1+x2)/(n1+n2)

#Find the standard error.

SE <- sqrt(p_pool*(1-p_pool)/1146 + p_pool*(1-p_pool)/1145)

#Find the point estimate.

point_est <- x1/n1 - x2/n2

#Find the test statistic.

ts <- point_est/SE

#Find the p-value
pnorm(abs(ts), lower.tail = FALSE)*2
## [1] 0.3966418

The p-value of 0.3966 is greater than the significance level of 0.05, therefore, we fail to reject the Null hypothesis.

  1. State conclusion

There is not enough evidence to prove that the atheism index in Spain has changed in 2005 compared to 2012.

Question 3 - Two Independent Proportions Hypothesis Test

Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?

# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)
  1. Write hypotheses and determine tails of the test

\(H_0: p_{US2005} = p_{US2012}\)

\(H_A: p_{US2005} \neq p_{US2012}\)

According to the Null hypothesis, the atheism index didn’t change in the US in 2012 compared to 2005. The alternative hypothesis states that the atheism index in the US changed in 2012 compared to 2005. The test is two-tailed.

  1. Find p-value and use to make decision to reject H0 or fail to reject H0
#Store number of atheists and number of observations from subsets USA2005 and USA2012.
table(USA2005$response)
## 
##     atheist non-atheist 
##          10         992
table(USA2012$response)
## 
##     atheist non-atheist 
##          50         952
x1US <- 10
x2US <- 50
n1US <- 1002
n2US <- 1002

#Find the pooled proportion.

p_pool_US <- (x1US+x2US)/(n1US+n2US)

#Find the standard error.

SE_US <- sqrt(p_pool_US*(1-p_pool_US)/1002 + p_pool_US*(1-p_pool_US)/1002)

#Find the point estimate.

point_est_US <- x1US/n1US - x2US/n2US

#Find the test statistic.

ts_US <- point_est_US/SE_US

#Find the p-value
pnorm(abs(ts_US), lower.tail = FALSE)*2
## [1] 1.579324e-07

Since the p-value is smaller than the significance level of 0.05, we reject the Null hypothesis.

  1. State conclusion

The evidence shows that the atheism index in the United States was different in 2012 compared to the atheism index in the United States in 2005.

Question 4 - Minimum Sample Size

Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 3%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?

#First we would need to find the z score.
Z <- qnorm(0.025, lower.tail = FALSE)

#The p and the compliment q are unknown, thus, we will use 0.5 for p and 0.5 for q. Now we can calculate the minimum sample size required for the confidence level of 95% and margin of error no greater than 3%.

Z^2*0.5*0.5/0.03^2
## [1] 1067.072

The minimum sample size required to determine proportion of residents in my state that attend religious service on a weekly basis is 1067, assuming the confidence level of 95% and the margin of error no greater than 3%.

Question 5

Use the NSCC Student Dataset for the Questions 5-7.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.

#Find how many students in the sample nscc_student_data are registered voters.
table(nscc_student_data$VoterReg)
## 
##  No Yes 
##   9  31

Out of 40 students in the dataset nscc_student_data, 31 are registered voters.

#First we need to find the proportion of the students who are registered voters (or the p hat).
phat <- 31/40

#Now we need to find the standard error.
SE_nscc_vote <- sqrt(phat*(1-phat)/40)

#Now we can construct the 95% confidence level of the true proportion of all NSCC students that are registered voters. 

phat - 1.96*SE_nscc_vote
## [1] 0.6455899
phat + 1.96*SE_nscc_vote
## [1] 0.9044101

We can be 95% confident that the true proportion of all the NSCC students who are registered voters lies between 0.6456 and 0.9044 (or between 64.56% and 90.44%).

Question 6

Construct a 95% confidence interval of the average height of all NSCC students.

# We need to find the standard deviation and the sample size to calculate the standard error.
sd(nscc_student_data$Height, na.rm = TRUE)
## [1] 10.60386
table(!is.na(nscc_student_data$Height))
## 
## FALSE  TRUE 
##     1    39

The standard deviation is 10.6, and the sample size is 39.

#Now we can find the standard error.
10.6/sqrt(39)
## [1] 1.697358

The standard error is 1.697.

#The last step is to find the 95% confidence interval.

mean(nscc_student_data$Height, na.rm = TRUE) - 1.96*1.697
## [1] 61.19824
mean(nscc_student_data$Height, na.rm = TRUE) + 1.96*1.697
## [1] 67.85048

We can be 95% confident that the true population mean of the NSCC students of the variable height lies between 61.2 and 67.85 inches.

Question 7

Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.

a.) Write hypotheses and determine tails of the test

\(H_0: p_{students} = 0.64\)

\(H_A: p_{students} > 0.64\)

Our Null hypothesis is that the proportion of the NSCC students who drink coffee is the same as the national propotion of coffee drinkers, which is 0.64 (or 64%). The alternative hypothesis is that the proportion of the NSCC students who drink coffee is greater than the national proportion of 0.64 (or 64%). The test is one-tailed (upper tail).

b.) Calculate sample statistics

#Fine the phat, or proportion of the NSCC students sample who drink coffee.
table(nscc_student_data$Coffee)
## 
##  No Yes 
##  10  30
30/40
## [1] 0.75

The sample proportion of students at the NSCC who drink coffee is 0.75 (or 75%).

c.) Determine probability of getting sample data by chance and use that to reject Ho or fail to reject Ho

#Now we have to find the standard error and the test statistic.

(se <- sqrt(0.75*(1-0.75)/40))
## [1] 0.06846532
(0.75 - 0.64)/se
## [1] 1.606653

The standard error is 0.068, and the test statistic is 1.61.

#Find the p-value to reject or to accept the Null hypothesis. 
pnorm(1.61, lower.tail = FALSE)
## [1] 0.05369893

The p-value of 0.0537 is greater than the significance level of 0.05, therefore, we cannot reject the Null hypothesis.

d.) Conclusion

We don’t have enough evidence to prove that on the true proportion of coffee drinkers at NSCC greater than the national average of 0.64 (or 64%).