In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.
The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk and do some exploratory analysis of the dataframe. None of this will be graded, just something for you to do on your own.
# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")
# Load dataset into environment
load("atheism.RData")
Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.
#Loading and storing NSCC student data in environment
nscc_student_data <- read.csv("C:/Users/jessi/Music/Statistics/nscc_student_data.csv")
In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins games more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)
\(H_0: p_ = 0.5\)
\(H_A: p > 0.5\)
Right-tailed
# Calculate sample proportion of games won by coin flip team
phat_NFL <- 235/414
#storing n and p
n_NFL <- 414
p_NFL <- 0.5
# SE
se_NFL <- sqrt(p_NFL*(1-p_NFL) / n_NFL)
# Test statistic
ts_NFL <- ((phat_NFL - p_NFL) / se_NFL)
# p-value
(pnorm(ts_NFL, lower.tail=FALSE))
## [1] 0.002959367
The p-value is 0.002959, which is significantly smaller than 0.05. Because of this, we will reject the null hypothesis in favor of the alternative hypothesis.
There is significant evidence that the team who wins the overtime coin flip ultimately wins the game more often and thus has an advantage over its opponent.
For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?
# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)
\(H_0: p_1 = p_2\)
\(H_A: p_1 \neq p_2\)
Two-tailed
# Using table function to calculate x1, x2, n1, n2
table(spain2005$response)
##
## atheist non-atheist
## 115 1031
table(spain2012$response)
##
## atheist non-atheist
## 103 1042
#Storing values
x1_sp <- 115
n1_sp <- 115+1031
x2_sp <- 103
n2_sp <- 103+1042
#Calculating p-pool
ppool_sp <- (x1_sp + x2_sp) / (n1_sp + n2_sp)
qpool_sp <- 1-ppool_sp
#standard error
se_sp <- sqrt( ((ppool_sp*qpool_sp)/n1_sp) + ((ppool_sp*qpool_sp)/n2_sp))
#test statistic
phat1_sp <- (x1_sp / n1_sp)
phat2_sp <- (x2_sp / n2_sp)
ts_sp <- (phat1_sp - phat2_sp) / se_sp
#p-value
pnorm(ts_sp, lower.tail=FALSE)*2
## [1] 0.3966418
Our p-value of 0.3966 is significantly greater than 0.05. We therefore fail to reject the null hypothesis.
There is not significant evidence that Spain has seen a change in its atheism index from 2005 to 2012.
Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?
# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)
\(H_0: p_1 = p_2\)
\(H_A: p_1 \neq p_2\)
Two-tailed
# Using table function to calculate x1, x2, n1, n2
table(USA2005$response)
##
## atheist non-atheist
## 10 992
table(USA2012$response)
##
## atheist non-atheist
## 50 952
#Storing values
x1_USA <- 10
n1_USA <- 10+992
x2_USA <- 50
n2_USA <- 50+952
#Calculating p-pool
ppool_USA <- (x1_USA + x2_USA) / (n1_USA + n2_USA)
qpool_USA <- 1-ppool_USA
#Calulating standard error
se_USA <- sqrt( ((ppool_USA*qpool_USA)/n1_USA) + ((ppool_USA*qpool_USA)/n2_USA))
#Calcluating test statistic
phat1_USA <- (x1_USA / n1_USA)
phat2_USA <- (x2_USA / n2_USA)
ts_USA <- (phat1_USA - phat2_USA) / se_USA
#Calculating p-value
pnorm(ts_USA)*2
## [1] 1.579324e-07
The p-value is significantly smaller than 0.05, therefore we will reject the null hypothesis in favor of the alternative hypothesis.
There is significant evidence that the United States has seen a change in its atheism index from 2005 to 2012.
Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 3%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?
#Calculating minimum sample size using a z-score of 1.96 and margin of error of 0.03
((1.96)^2 * 0.5 * 0.5) / (0.03)^2
## [1] 1067.111
The minimum sample size for this survey would need to be at least 1068 people.
Use the NSCC Student Dataset for the Questions 5-7.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.
#Finding sample proportion and storing
table(nscc_student_data$VoterReg)
##
## No Yes
## 9 31
phat_vote <- 31/(9+31)
qhat_vote <- 1-phat_vote
n_vote <- 9+31
#Calculating SE
se_vote <- sqrt((phat_vote*qhat_vote) / n_vote)
#Lower bound of confidence interval
(phat_vote - (1.96 * se_vote))
## [1] 0.6455899
#Upper bound of confidence interval
(phat_vote + (1.96 * se_vote))
## [1] 0.9044101
We can be 95% confident that of all NSCC students, 64.6% to 90.4% are registered to vote.
Construct a 95% confidence interval of the average height of all NSCC students.
#Finding mean, standard deviation, and sample size
mean_height <- mean(nscc_student_data$Height, na.rm = TRUE)
sd_height <- sd(nscc_student_data$Height, na.rm = TRUE)
table(is.na(nscc_student_data$Height))
##
## FALSE TRUE
## 39 1
n_height <- 39
#Lower bound of confidence interval
mean_height - 1.96 * (sd_height / sqrt(n_height))
## [1] 61.19633
#Lower bound of confidence interval
mean_height + 1.96 * (sd_height / sqrt(n_height))
## [1] 67.85239
We are 95% confident that the average height of all NSCC students is between 61.20 and 67.85 inches.
Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.
a.) Write hypotheses and determine tails of the test
\(H_0: p = 0.64\)
\(H_A: p > 0.64\)
Right-tailed
b.) Calculate sample statistics
#Finding and storing sample proportion data of NSCC coffee drinkers
table(nscc_student_data$Coffee)
##
## No Yes
## 10 30
# Calculate sample proportion of games won by coin flip team
phat_cof <- 30/40
#storing n and p
n_cof <- 40
p_cof <- 0.64
c.) Determine probability of getting sample data by chance
# Standard error
se_cof <- sqrt(p_cof*(1-p_cof) / n_cof)
# Test statistic
ts_cof <- ((phat_cof - p_cof) / se_cof)
# p-value
(pnorm(ts_cof, lower.tail=FALSE))
## [1] 0.07361613
d.) Decision
Our p-value of 0.0736 is not significantly smaller than 0.05 (though it is close!). We therefore fail to reject the null hypothesis.
e.) Conclusion
The data does not show sufficient evidence to prove that NSCC students drink more coffee than the average American.