Purpose

In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.


Preparation

The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk and do some exploratory analysis of the dataframe. None of this will be graded, just something for you to do on your own.

# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")

# Load dataset into environment
load("atheism.RData")

Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.

#Load nscc_student_data.csv dataset

nscc_student_data <- read.csv("~/Desktop/stats/nscc_student_data.csv")
View(nscc_student_data)

Question 1 - Single Proportion Hypothesis Test

In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins games more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)

  1. Write hypotheses and determine tails of the test

\(H_0: p_1 = 0.5\)
\(H_A: p_1 > 0.5\)

  1. Find p-value of sample data occurring by chance
# Calculate sample proportion of games won by coin flip team
phat <- 235/414

The teams that won the coin toss in 2010, won the game 56.7% of the time.

# SE
p <- 0.5
se <- sqrt(phat*(1-p)/414)

# Test statistic
z <- (phat-p)/se

# p-value
pnorm(z, lower.tail = FALSE)
## [1] 0.004896026
  1. Decision
    The P-value is less that 0.05 therefore we reject the H0.

  2. State conclusion
    The data suggestes that the NFL team that won an overtime coin toss (by luck) had a significant advantage to win the game.

Question 2 - Two Independent Proportions Hypothesis Test

For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?

# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)
  1. Write hypotheses and determine tails of the test

\(H_0: p_1 = p_2\)
\(H_A: p_1 \neq p_2\)

  1. Find p-value of sample data occurring by chance
#Find sample proportion of 2005 and 2012 spains that identifying as atheist and non atheist
table(spain2005$response)
## 
##     atheist non-atheist 
##         115        1031
table(spain2012$response)
## 
##     atheist non-atheist 
##         103        1042
#Store Values
x1 <- 115
x2 <- 103
n1 <- 115 + 1031
n2 <- 103 + 1042

#Find pooled proportion.
ppool <- (x1+x2)/(n1+n2)

#SE

se <- sqrt((ppool*(1-ppool)/n1)+(ppool*(1-ppool)/n2))

#Test statistic
p1 <- x1/n1
p2 <- x2/n2
(p1-p2)/se
## [1] 0.8476341
#P-value
pnorm(0.8476341, lower.tail = FALSE)*2
## [1] 0.3966418
  1. Decision The P-value is above 0.05, so we fail to reject the null hypothesis.

  2. State conclusion
    There is not suffiecient eveidence to prove that Spain has seen a change in its atheism index from 2005 to 2012.

Question 3 - Two Independent Proportions Hypothesis Test

Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?

# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)
  1. Write hypotheses and determine tails of the test

\(H_0: p_1 = p_2\)
\(H_A: p_1 \neq p_2\)

  1. Find p-value of sample data occurring by chance
## Find sample proportion of 2005 and 2012 Americans identifying as atheist
table(USA2005$response)
## 
##     atheist non-atheist 
##          10         992
table(USA2012$response)
## 
##     atheist non-atheist 
##          50         952
#Store Values
x1 <- 10
x2 <- 50
n1 <- 10 + 992
n2 <- 50 + 952

#Find pooled proportion.
ppool <- (x1+x2)/(n1+n2)

#SE
se <- sqrt((ppool*(1-ppool)/n1)+(ppool*(1-ppool)/n2))

#Test statistic
p1 <- x1/n1
p2 <- x2/n2
z <- (p1-p2)/se

#P-value
pnorm(z)*2
## [1] 1.579324e-07
  1. Decision
    The P-value is less that 0.05, therefore we reject the Null hypothesis.

  2. State conclusion
    The data shows evidence that the United States has seen a significant change in its atheism index from 2005 to 2012.

Question 4 - Minimum Sample Size

Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 3%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?

#The Z- score of 95% confidence interval is 1.96
#We dont know the P and the compliment of q; therefore, we are going to use 0.5 for p and 0.5 for q. 

1.96^2*0.5*0.5/0.03^2
## [1] 1067.111

The sample size would have to be 1067 to determine the proportion of the residents in my state that attend a religious service on a weekly basis.

Question 5

Use the NSCC Student Dataset for the Questions 5-7.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.

#Find sample proportion of NSCC students registered to vote
table(nscc_student_data$VoterReg)
## 
##  No Yes 
##   9  31
#Find sample data
phat <- 31/40

n<- 40

# Calculate SE standard error
se <- sqrt(phat*(1-phat)/n)

# Lower Bound of CI
phat - 1.96*se
## [1] 0.6455899
# Upper Bound of CI
phat + 1.96*se
## [1] 0.9044101

We can be 95% confident that the true proportion of all the NSCC students that are registered to vote is between 0.6456 and 0.9044 or between 65% and 90%.

Question 6

Construct a 95% confidence interval of the average height of all NSCC students.

#Find mean, standard deviation and sample size of average height of all NSCC students.
mean <- mean(nscc_student_data$Height, na.rm = TRUE)
sd <-sd(nscc_student_data$Height, na.rm = TRUE)

table(is.na(nscc_student_data$Height))
## 
## FALSE  TRUE 
##    39     1
#Lower bound of 95% confidence interval
mean - 1.96*sd/sqrt(40)
## [1] 61.23819
#Upper bound of 95% confidence interval
mean + 1.96*sd/sqrt(40)
## [1] 67.81053

We can be 95% confident that the average height of all NSCC students is between 61.2 inches and 67.8 inches.

Question 7

Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.

a.) Write hypotheses and determine tails of the test

\(H_0: p_1 = 0.64\)
\(H_A: p_1 >0.64\)

The test is one tailed - Upper tail

b.) Calculate sample statistics

table(nscc_student_data$Coffee)
## 
##  No Yes 
##  10  30

c.) Determine probability of getting sample data by chance

 #Find sample data
phat <- 30/40
p <- 0.64
n<- 40

# Calculate SE standard error
se <- sqrt(phat*(1-phat)/n)

# Test statistic
z <- (phat-p)/se

# p-value
pnorm(z, lower.tail = FALSE)
## [1] 0.05406527

d.) Decision
The p-value is greater than 0.05, therefore we fail to reject the Null hypothesis.

e.) Conclusion

There is not sufficient evidence to prove that more NSCC students drink coffee than other Americans.