Project #6 - Inference on Categorical Data

Preparation

The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk. Do some exploratory analysis using the str() function and viewing the dataframe. None of this will be graded, just something for you to do on your own.

# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")

# Load dataset into environment
load("atheism.RData")

Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.

nscc_student_data <- read.csv("~/Downloads/nscc_student_data.csv")

Question 1 - Single Proportion Hypothesis Test

In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)

Write hypotheses and determine tails of the test

Ho: μ=0.5

Ha:μ≠0.5

Two-tailed

Find p-value of sample data and use to make decision to reject H0 or fail to reject H0

# Calculate sample proportion of games won by coin flip team
pHat <- 235/414

# SE
p <- 0.5
se <- sqrt(p*(1-p)/414)

# Test statistic

(pHat-p)/se

## [1] 2.75225

# p-value

pnorm(2.75225, lower.tail=FALSE)*2

## [1] 0.005918732

State conclusion
The data does suggest that the coin toss does infact give a signicifacnt advantage to win.

Question 2 - Two Independent Proportions Hypothesis Test

For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?

# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)

Write hypotheses and determine tails of the test

HO:p1≠p2

HA:p1≠p2

Find p-value and use to make decision to reject H0 or fail to reject H0

# Use table() to get x1, x2, n1, and n2
table(spain2005$response)

## 
##     atheist non-atheist 
##         115        1031

table(spain2012$response)

## 
##     atheist non-atheist 
##         103        1042

# Store values
x1 <- 115
x2 <- 103
n1 <- 115+1031
n2 <- 103+1042

# p-pool
ppool <- (x1+x2)/(n1+n2)

# SE
se <- sqrt((ppool*(1-ppool)/n1)+(ppool*(1-ppool)/n2))
# Test statistic
p1 <- x1/n1
p2 <- x2/n2
(p1-p2)/se

## [1] 0.8476341

# p-value
pnorm(-0.8476341)*2

## [1] 0.3966418

State conclusion
There is not enough evidence to suggest a difference between the years 2005 and 2012.

Question 3 - Two Independent Proportions Hypothesis Test

Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?

# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)

Write hypotheses and determine tails of the test

H0:p1−p2=0

HA:p1−p2≠0

Find p-value and use to make decision to reject H0 or fail to reject H0

table(USA2005$response)

## 
##     atheist non-atheist 
##          10         992

table(USA2012$response)

## 
##     atheist non-atheist 
##          50         952

x1 <- 10
x2 <- 50
n1 <- 10+992
n2 <- 50+952

ppool <- (x1+x2)/(n1+n2)

se <- sqrt((ppool*(1-ppool)/n1)+(ppool*(1-ppool)/n2)) 

p1 <- x1/n1
p2 <- x2/n2
(p1-p2)/se

## [1] -5.243063

pnorm(-5.243063)*2

## [1] 1.579326e-07

State conclusion

The data suggests there is a difference between the years in the U.S.

Question 4 - Minimum Sample Size

Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 2%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?

1.96^2*0.5*0.5/0.02^2

## [1] 2401

The sample size had at least 2401 people in it.

Question 5

Use the NSCC Student Dataset for the Questions 5-8.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.

table(nscc_student_data$VoterReg)

## 
##  No Yes 
##   9  31

registered <- 31/40

ss <- 40

se <- sqrt((registered)*(1-registered)/ss)

registered+1.96*se

## [1] 0.9044101

registered-1.96*se

## [1] 0.6455899

That data shows that between 65% and 90% of Northshore Students vote, this is shown with a 95% confidence interval.

Question 6

Construct a 95% confidence interval of the average height of all NSCC students.

mean <- mean(nscc_student_data$Height, na.rm = TRUE)

sd <- sd(nscc_student_data$Height, na.rm = TRUE)

mean + 1.96 * sd / sqrt(40)

## [1] 67.81053

mean - 1.96 * sd / sqrt(40)

## [1] 61.23819

The average height of Northshore Students in the sample size are 61.2 and 67.8in. The average size can be said within a 95% confidence interval.

Question 7

Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.

a.) Write hypotheses and determine tails of the test

Ho: μ=0.64

Ha:μ>0.64

b.) Calculate sample statistics

table(nscc_student_data$Coffee)

## 
##  No Yes 
##  10  30

c.) Determine probability of getting sample data by chance and use that to reject Ho or fail to reject Ho

prop.test(x=30, n=40, p=0.64, alternative="greater", correct=FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  30 out of 40, null probability 0.64
## X-squared = 2.1007, df = 1, p-value = 0.07362
## alternative hypothesis: true p is greater than 0.64
## 95 percent confidence interval:
##  0.6240271 1.0000000
## sample estimates:
##    p 
## 0.75

The p-value is 0.07362 which means we fail to reject the null hypothesis.

d.) Conclusion

The data does not prove that Northshore students drink more coffee than the U.S average.