Project #6 - Inference on Categorical Data

Purpose

In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.

Preparation

The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk. Do some exploratory analysis using the str() function and viewing the dataframe. None of this will be graded, just something for you to do on your own.

# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")

# Load dataset into environment
load("atheism.RData")

str(atheism)

## 'data.frame':    88032 obs. of  3 variables:
##  $ nationality: Factor w/ 57 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ response   : Factor w/ 2 levels "atheist","non-atheist": 2 2 2 2 2 2 2 2 2 2 ...
##  $ year       : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...

Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.

#Loading the nscc_student_data file
getwd()
## [1] "/Users/ryanduggan/Downloads"
nscc_student_data <- read.csv("nscc_student_data-2.csv")

Question 1 - Single Proportion Hypothesis Test

In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)

Write hypotheses and determine tails of the test

\(H_0: p = 0.5\)

\(H_A: p > 0.5\)

This is a one tailed test, looking at the upper tail

Find p-value of sample data and use to make decision to reject H0 or fail to reject H0

# Calculate sample proportion of games won by coin flip team
(p_hat <- 235/414)

## [1] 0.5676329

# SE
SE <- sqrt((p_hat*(1-p_hat))/441)

# Test statistic
(0.5676329-.5)/SE
## [1] 2.866931

# p-value
pnorm(2.866931, lower.tail = FALSE)
## [1] 0.002072367

The p-value (0.002) is less than alpha (0.05), and we therefore reject the null hypothesis in favor of the alternate.

State conclusion
The data suggest that the team which wins the coin flip wins more often than its opponent.

Question 2 - Two Independent Proportions Hypothesis Test

For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?

# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)

Write hypotheses and determine tails of the test

\(H_0: p_1 = p_2\)

\(H_A: p_1 \neq p_2\)

This is a two tailed test

Find p-value and use to make decision to reject H0 or fail to reject H0

# Use table() to get x1, x2, n1, and n2
table(spain2005$response)
## 
##     atheist non-atheist 
##         115        1031
table(spain2012$response)
## 
##     atheist non-atheist 
##         103        1042

# Store values
x1 <- 115
x2 <- 103
n1 <- 115+1031
n2 <- 103+1042

# p-pool
(ppool <- (x1+x2)/(n1+n2))
## [1] 0.09515495

#Storing q
(q <- 1 - ppool)
## [1] 0.904845

# SE
SE <- sqrt(((ppool*q)/n1)+((ppool*q)/n2))

# Test statistic
p1 <- x1/n1
p2 <- x2/n2

((p1-p2)-0)/SE
## [1] 0.8476341

# p-value
pnorm( 0.8476341, lower.tail = FALSE)*2
## [1] 0.3966418

The p-value (0.397) is greater than alpha (0.05), therefore we fail to reject the null hypothesis.

State conclusion
There is not sufficient evidence to say that Spain has seen a change in its atheism index between 2005 and 2012.

Question 3 - Two Independent Proportions Hypothesis Test

Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?

# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)

Write hypotheses and determine tails of the test

\(H_0: p_1 - p_2 = 0\)

\(H_A: p_1 - p_2 \neq 0\)

This is a two tailed test

Find p-value and use to make decision to reject H0 or fail to reject H0

#Using a table to find x1, x2, n1, and n2
table(USA2005$response)
## 
##     atheist non-atheist 
##          10         992
table(USA2012$response)
## 
##     atheist non-atheist 
##          50         952

#Storing values
x1 <- 10
x2 <- 50
n1 <- 10 + 992
n2 <- 50 + 952

ppool <- (x1+x2)/(n1+n2)
q <- 1-ppool

#Finding the standard error
se <- sqrt(((ppool*q)/n1)+((ppool*q)/n2))

#Test statistic
p1 <- x1/n1
p2 <- x2/n2

(((p1-p2)-0)/se)
## [1] -5.243063

#p-value
pnorm(-5.243063)*2
## [1] 1.579326e-07

The p-value (1.579326e-07) is less than alpha (0.05), and we therefore reject the nul hypothesis in favor of the alternate.

State conclusion
The data suggest that the United States has indeed seen an increase in its atheism index from 2005 to 2012.

Question 4 - Minimum Sample Size

Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 2%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?

#Calculating minimum sample size
(1.96^2*.5*.5)/.02^2
## [1] 2401

At least 2401 people would have to be sampled to ensure that the data is within the specified margin of error and confidence interval.

Question 5

Use the NSCC Student Dataset for the Questions 5-8.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.

#Creating a table of students in the sample to find the number of registered and unregistered voters
table(nscc_student_data$VoterReg)
## 
##  No Yes 
##   9  31

#Finding the standard error
(SE <- sqrt(((31/40)*(9/40))/40))
## [1] 0.06602556

#Calculating the lower and upper bounds of the confidence interval
(31/40) - 1.96*0.06602556
## [1] 0.6455899
(31/40) + 1.96*0.06602556
## [1] 0.9044101

We can be 95% confident that between 64.56% and 90.44% of all NSCC students are registered voters.

Question 6

Construct a 95% confidence interval of the average height of all NSCC students.

#Storing the mean and standard deviation for the height of students
mean_height <- mean(nscc_student_data$Height, na.rm=TRUE)
sd_height <- sd(nscc_student_data$Height, na.rm=TRUE)

#Checking how many values of "N.A." are present in the height variable to find an accurate sample size
table(is.na(nscc_student_data$Height))
## 
## FALSE  TRUE 
##    39     1

#Finding the t critical value
t_height <- abs(qt(p=0.025, df=38))

#Calculating standard error
SE_height <- sd_height/sqrt(39)

#Calculating the lower and upper bounds of the confidence interval
mean_height - t_height*SE_height
## [1] 61.08699
mean_height + t_height*SE_height
## [1] 67.96173

We can be 95% confident that the average height of all NSCC students is between 61.09 and 67.96.

Question 7

Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.

a.) Write hypotheses and determine tails of the test

\(H_0: p = 0.64\)

\(H_A: p > 0.64\)

This is a single tailed test, looking at the upper tail

b.) Calculate sample statistics

#Creating a table to see how many NSCC students do and no not drink coffee
table(nscc_student_data$Coffee)
## 
##  No Yes 
##  10  30

#Storing values
p <- 30/40
q <- 10/40
n <- 40

c.) Determine probability of getting sample data by chance and use that to reject Ho or fail to reject Ho

#Calculating the standard error
se_coffee <- sqrt((p*q)/n)

#Calculating the test-statistic
(p-.64)/se_coffee
## [1] 1.606653

#Finding the p-value
pnorm(1.606653, lower.tail = FALSE)
## [1] 0.05406525

The p- value (0.054) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.

#Finding the minimum sample size for accurate data analysis of coffee drinkers at NSCC
(1.96^2*p*q)/.05^2

## [1] 288.12

When finding the minimum sample size needed to create an interval of 95% confidence with a margin of error of 5%, it is seen that at least 289 people would need to be surveyed. In this dataset, only 40 people were surveyed. If a larger sample of students were surveyed, the results may or may not change due to a more accurate representation of the student body.

d.) Conclusion
There is not sufficient evidence to say that NSCC students drink more coffee than other Americans. However, only 40 individuals were involved in this study and they cannot entirely reflect the whole student body of NSCC. If more students were surveyed, there may or may not be evidence to suggest that NSCC students drink more coffe than average Americans.