Project #6 - Inference on Categorical Data

Purpose

In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.

Preparation

The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk. Do some exploratory analysis using the str() function and viewing the dataframe. None of this will be graded, just something for you to do on your own.

# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")

# Load dataset into environment
load("atheism.RData")

Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.

nscc_student_data <- read.csv("nscc_student_data.csv")

Question 1 - Single Proportion Hypothesis Test

In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)

Write hypotheses and determine tails of the test
\[H_0: p = \frac{1}{2}\] \[H_A: p > \frac{1}{2}\] Where \(p\) is the probability of a team which wins the coin flip winning the game.
The test is for the the upper tail.
Find p-value of sample data and use to make decision to reject H0 or fail to reject H0

paste("Hypothesised Proportion:", p_hyp <- 0.5)

## [1] "Hypothesised Proportion: 0.5"

paste("Number of Games Recorded:", n <- 414)

## [1] "Number of Games Recorded: 414"

paste("Sample Proportion:", round(p_hat <- 235 / 414, 4))

## [1] "Sample Proportion: 0.5676"

paste("Standard Error:", round(
  se <- sqrt(p_hyp*(1-p_hyp)/n)
,4))

## [1] "Standard Error: 0.0246"

paste("Test Statistic:", round(
  t_stat <- (p_hat - p_hyp) / se
,4))

## [1] "Test Statistic: 2.7522"

paste("P-Value:", round(
  pnorm(abs(t_stat), lower.tail = FALSE)
,4))

## [1] "P-Value: 0.003"

Our p-value is sufficiently low to reject the null hypothesis and accept the alternative.

State conclusion

There is sufficient evidence to suggest that the team who wins the coin flip in an NFL playoff game has an advantage over their opponent.

Question 2 - Two Independent Proportions Hypothesis Test

For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?

# Create subsets for Spain 2005 and 2012
spain05 <- subset(atheism, nationality == "Spain" & year == 2005)
spain12 <- subset(atheism, nationality == "Spain" & year == 2012)

Write hypotheses and determine tails of the test \[H_0: A_{2005} = A_{2012}\] \[H_A: A_{2005} \neq A_{2012}\] Where \(A_{n}\) is the atheism index for \(n\).
This is a two tailed test.
Find p-value and use to make decision to reject H0 or fail to reject H0

paste("Responses in 2005:",
      n05 <- length(spain05$response)
)

## [1] "Responses in 2005: 1146"

paste("Responses in 2012:",
      n12 <- length(spain12$response)
)

## [1] "Responses in 2012: 1145"

paste("Atheists in 2005:",
      x05 <- table(spain05$response)[["atheist"]]
)

## [1] "Atheists in 2005: 115"

paste("Atheists in 2012:",
      x12 <- table(spain12$response)[["atheist"]]
)

## [1] "Atheists in 2012: 103"

paste("Pooled Proportion:", round(
  pp <- (x05+x12)/(n05+n12)
,4))

## [1] "Pooled Proportion: 0.0952"

paste("Complement of Pooled Proportion:", round(
  pq <- 1-pp
,4))

## [1] "Complement of Pooled Proportion: 0.9048"

paste("Standard Error:", round(
  se <- sqrt((pp*pq/n05)+(pp*pq/n12))
,4))

## [1] "Standard Error: 0.0123"

paste("Point Estimate:", round(
  pe <- (x05/n05)-(x12/n12)
,4))

## [1] "Point Estimate: 0.0104"

paste("Test Statistic:", round(
  t_stat <- pe / se
,4))

## [1] "Test Statistic: 0.8476"

paste("P-Value:", round(
  pnorm(abs(t_stat), lower.tail = FALSE)*2
,4))

## [1] "P-Value: 0.3966"

As our p-value is not sufficiently low, we cannot reject the null hypothesis.

State conclusion

There is not sufficient evidence to support the claim that the proportion of atheists in spain has changed between 2005 and 2012.

Question 3 - Two Independent Proportions Hypothesis Test

Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?

# Create subsets for USA 2005 and 2012
USA05 <- subset(atheism, nationality == "United States" & year == 2005)
USA12 <- subset(atheism, nationality == "United States" & year == 2012)

Write hypotheses and determine tails of the test \[H_0: A_{2005} = A_{2012}\] \[H_A: A_{2005} \neq A_{2012}\] Where \(A_{n}\) is the atheism index for \(n\).
This is a two tailed test.
Find p-value and use to make decision to reject H0 or fail to reject H0

paste("Responses in 2005:",
      n05 <- length(USA05$response))

## [1] "Responses in 2005: 1002"

paste("Responses in 2012:",
      n12 <- length(USA12$response))

## [1] "Responses in 2012: 1002"

paste("Atheists in 2005:",
      x05 <- table(USA05$response)[["atheist"]]
)

## [1] "Atheists in 2005: 10"

paste("Atheists in 2012:",
      x12 <- table(USA12$response)[["atheist"]]
)

## [1] "Atheists in 2012: 50"

paste("Pooled Proportion:", round(
  pp <- (x05+x12)/(n05+n12)
,4))

## [1] "Pooled Proportion: 0.0299"

paste("Complement of Pooled Proportion:", round(
  pq <- 1-pp
,4))

## [1] "Complement of Pooled Proportion: 0.9701"

paste("Standard Error:", round(
  se <- sqrt((pp*pq/n05)+(pp*pq/n12))
,4))

## [1] "Standard Error: 0.0076"

paste("Point Estimate:", round(
  pe <- (x05/n05)-(x12/n12)
,4))

## [1] "Point Estimate: -0.0399"

paste("Test Statistic:", round(
  t_stat <- pe / se
,4))

## [1] "Test Statistic: -5.2431"

paste("P-Value:", round(
  pnorm(abs(t_stat), lower.tail = FALSE)*2
,8))

## [1] "P-Value: 1.6e-07"

As our p-value is sufficiently low, we may reject the null hypothesis in favor of the alternative.

State conclusion

There is sufficient evidence to suggest that the proportion of atheists in the United States of America has changed between 2005 and 2012.

Question 4 - Minimum Sample Size

Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 3%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?

paste("The minimum number of people required to fit the given criteria is", ceiling(
  qnorm(0.975)^2*0.5*0.5/0.03^2 # Formula for number of samples given the criteria
))

## [1] "The minimum number of people required to fit the given criteria is 1068"

Question 5

Use the NSCC Student Dataset for the Questions 5-8.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.

paste("Number of Students Polled:",
  n <- length(nscc_student_data$VoterReg)
)

## [1] "Number of Students Polled: 40"

paste("Number of Students Registered:",
  x <- table(nscc_student_data$VoterReg)[["Yes"]]
)

## [1] "Number of Students Registered: 31"

paste("Sample Proportion:", round(
  p_hat <- x/n
,4))

## [1] "Sample Proportion: 0.775"

paste("Standard Error:", round(
  se <- sqrt(p_hat*(1-p_hat)/n)
,4))

## [1] "Standard Error: 0.066"

paste("Margin of Error:", round(
  me <- qnorm(0.975)*se
,4))

## [1] "Margin of Error: 0.1294"

paste("We can be 95% confident that the true proportion of students registered to vote is between ",
      round((p_hat-me)*100, 1),
      "% and ",
      round((p_hat+me)*100, 1),
      "%", sep=""
)

## [1] "We can be 95% confident that the true proportion of students registered to vote is between 64.6% and 90.4%"

Question 6

Construct a 95% confidence interval of the average height of all NSCC students.

# Extract sample data and remove NAs and one typo.
sample_data <- nscc_student_data[!is.na(nscc_student_data$Height) & nscc_student_data$Height > 25,]$Height

paste("Number of Samples:",
  n <- length(sample_data)
)

## [1] "Number of Samples: 38"

paste("Average Sample Height:", round(
  x <- mean(sample_data)
,4))

## [1] "Average Sample Height: 66.0645"

paste("Standard Deviation:", round(
  sd <- sd(sample_data)
,4))

## [1] "Standard Deviation: 4.5249"

paste("Margin of Error:", round(
  me <- qnorm(0.975)*sd
,4))

## [1] "Margin of Error: 8.8686"

paste("We can be 95% confident that the true mean height of the students is between",
      round(x-me, 1),
      "and",
      round(x+me, 1),
      "inches"
)

## [1] "We can be 95% confident that the true mean height of the students is between 57.2 and 74.9 inches"

Question 7

Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.

a.) Write hypotheses and determine tails of the test
\[H_0: p = 64\%\] \[H_A: p > 64\%\] Where p is the proportion of NSCC students who drink coffee.
This is a one-tailed test.

b.) Calculate sample statistics

paste("Number of Students Polled:",
  n <- length(nscc_student_data$Coffee)
)

## [1] "Number of Students Polled: 40"

paste("Number of Students Who Drink Coffee:",
  x <- table(nscc_student_data$Coffee)[["Yes"]]
)

## [1] "Number of Students Who Drink Coffee: 30"

paste("Sample Proportion:", round(
  p_hat <- x/n
,4))

## [1] "Sample Proportion: 0.75"

paste("Standard Error:", round(
  se <- sqrt(p_hat*(1-p_hat)/n)
,4))

## [1] "Standard Error: 0.0685"

paste("Test Statistic:", round(
  t_stat <- (p_hat - 0.64) / se
,4))

## [1] "Test Statistic: 1.6067"

c.) Determine probability of getting sample data by chance and use that to reject Ho or fail to reject Ho

paste("P-Value:", round(
  pnorm(abs(t_stat), lower.tail = FALSE)
,4))

## [1] "P-Value: 0.0541"

The p-value is not sufficiently low, therefore we cannot reject the null hypothesis.

d.) Conclusion

There is not sufficient evidence to suggest that there is an above average proportion of coffee drinkers at NSCC.