In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.
The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk and do some exploratory analysis of the dataframe. None of this will be graded, just something for you to do on your own.
# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")
# Load dataset into environment
load("atheism.RData")
Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.
nscc_student_data <- read.csv("C:/Users/Guard/Desktop/nscc_student_data.csv")
In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins games more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)
\(H_O: P = 0.5\)
\(H_A: P > 0.5\)
One-tailed test
# Calculate sample proportion of games won by coin flip team
(phat<-235/414)
## [1] 0.5676329
The sample proportion of games won is 56.7%
# SE
(se<-sqrt((235/414)*(179/414)/414))
## [1] 0.02434781
# Store p as 0.5
p<-0.5
# Test statistic
(phat-p)/se
## [1] 2.777779
# p-value
pnorm(2.777779, lower.tail = FALSE)
## [1] 0.002736591
The standard error is 0.02 with a 2.7 test statistic and a 0.002 p-value
Since the p-value of 0.002 is less than the significance level of 0.05, we reject the null hypothesis.
Since the p-value is less than the significance level of 0.05 there is sufficient evidence to suggest that teams who win the coin toss, win more games.
For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?
# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)
\(H_O: Pspain2005 = Pspain2012\)
\(H_A: Pspain2005 \neq Pspain2012\)
Two-tailed test
# Find number of Spain responses in 2005 and 2012
table(spain2005$response)
##
## atheist non-atheist
## 115 1031
table(spain2012$response)
##
## atheist non-atheist
## 103 1042
# Store variables
x1<-115
x2<-103
n1<-1146
n2<-1145
# Find and store pooled proportion
(Pooled<-(x1+x2)/(n1+n2))
## [1] 0.09515495
# Find and store point estimate
(PE<-x1/n1 - x2/n2)
## [1] 0.01039271
# Find and store standard error
(SE<-sqrt(Pooled*(1-Pooled)/n1 + Pooled*(1-Pooled)/n2))
## [1] 0.01226084
# Find test statistic
(PE-0)/SE
## [1] 0.8476341
# Find p-value
pnorm(abs(0.8476341), lower.tail = FALSE)*2
## [1] 0.3966418
Since the p-value of 0.39 is greater than the significance value of 0.05, we fail to reject the null hypothesis.
There is not sufficient evidence to suggest there is a change in Spain’s atheism index from 2005 to 2012.
Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?
# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)
\(H_O: Pus2005 = Pus2012\)
\(H_A: Pus2005 \neq Pus2012\)
Two-tailed test
# Find number of US responses in 2005 and 2012
table(USA2005$response)
##
## atheist non-atheist
## 10 992
table(USA2012$response)
##
## atheist non-atheist
## 50 952
# Store variables
X1_us<-10
X2_us<-50
n1_us<-1002
n2_us<-1002
# Find pooled proportion
(X1_us+X2_us)/(n1_us+n2_us)
## [1] 0.02994012
# Find and store point estimate
(PE_us<-X1_us/n1_us - X2_us/n2_us)
## [1] -0.03992016
# Find and store the standard error
(SE_us<-sqrt(0.02994012*(1-0.02994012)/n1_us + 0.02994012*(1-0.02994012)/n2_us))
## [1] 0.0076139
# Find test statistic
(PE_us-0)/SE_us
## [1] -5.243063
# Find p-value
pnorm(abs(-5.243063), lower.tail = FALSE)*2
## [1] 1.579326e-07
Since the p-value is significantly smaller than 0.05, we reject the null hypothesis in favor of the alternative hypothesis.
There is sufficient evidence to suggest there is a change in the United States atheism index from 2005 to 2012.
Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 3%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?
# Find the minimum sample size
(1.96^2*0.5*0.5)/0.02^2
## [1] 2401
To be 95% confident with a margin of error no greater than 3%, at least 2,401 people need to be sampled.
Use the NSCC Student Dataset for the Questions 5-7.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.
# Find the sample size
table(nscc_student_data$VoterReg)
##
## No Yes
## 9 31
# Find and store the standard error
(SE_voter<-sqrt(((31/40)*(9/40))/40))
## [1] 0.06602556
# Calculate the lower bound of a 95% C.I.
(31/40) - 1.96*SE_voter
## [1] 0.6455899
# Calculate the upper bound of a 95% C.I.
(31/40) + 1.96*SE_voter
## [1] 0.9044101
We can be 95% confident that between 64.6% and 90.4% of all NSCC student are registered to vote.
Construct a 95% confidence interval of the average height of all NSCC students.
# Find the sample size
table(is.na(nscc_student_data$Height))
##
## FALSE TRUE
## 39 1
# Find and store the mean of NSCC students' heights
(heightmean<-mean(nscc_student_data$Height, na.rm = TRUE))
## [1] 64.52436
# Find and store the standard deviation
(heightsd<-sd(nscc_student_data$Height, na.rm = TRUE))
## [1] 10.60386
# Find and store the standard error
(SE_height<-heightsd/sqrt(39))
## [1] 1.697976
# Calculate the lower bound of a 95% C.I.
heightmean - 1.96*SE_height
## [1] 61.19633
# Calculate the upper bound of a 95% C.I.
heightmean + 1.96*SE_height
## [1] 67.85239
We can be 95% confident that the average height of all NSCC students is between 61.2 and 67.9 inches.
Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.
a.) Write hypotheses and determine tails of the test
\(H_O: Pstudents = 0.64\)
\(H_A: Pstudents > 0.64\)
One-tailed test
b.) Calculate sample statistics
# Find the sample size of students who drink coffee
table(nscc_student_data$Coffee)
##
## No Yes
## 10 30
# Store variables
p_coffee<-30/40
q_coffee<-10/40
n_coffee<-40
Out of 40 responses, 30 NSCC Students stated they drink coffee, 10 do not.
c.) Determine probability of getting sample data by chance
# Find and store the standard error
(SE_coffee<-sqrt((p_coffee*q_coffee)/n_coffee))
## [1] 0.06846532
# Find the test statistic
(p_coffee - 0.64)/SE_coffee
## [1] 1.606653
# Find the p-value
pnorm(1.606653, lower.tail = FALSE)
## [1] 0.05406525
d.) Decision
Since the p-value of 0.054 is greater than the significance level of 0.05, we fail to reject the null hypothesis.
e.) Conclusion
There is just enough evidence to suggest that NSCC students do not drink more coffee than the national average.