In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.
The project will use two datasets from the internet – atheism and nscc_student_data. Store the atheism dataset in your environment by using the following R chunk. Do some exploratory analysis using the str() function and viewing the dataframe. None of this will be graded, just something for you to do on your own.
# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")
# Load dataset into environment
load("atheism.RData")
Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.
#Load dataset into environment
nscc_data <- read.csv("C:/Users/selma/Desktop/Stats/nscc_data.csv")
In the 2010 playoffs, the National Football League (NFL) changed their overtime rules amid concerns that whichever team won an overtime coin toss (by luck) had a significant advantage to win the game. Nicholas Gorgievski, et al, published research in 2010 that stated that out of 414 games won in overtime up to that point, 235 were won by the team that won the coin toss. Test the claim that the team which wins the coin flip wins more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to win if they do not win any more or less than their opponent.)
\(H_0: p_1 = 0.5\)
\(H_A: p_1 > 0.5\)
# Calculate sample proportion of games won by coin flip teams.
234/414
## [1] 0.5652174
The sample proportion of games won by coin flip is 57%
p <- 0.5
q <- 1 - 0.5
n <- 414
# SE
se <- sqrt(p*q/n)
# Test statistic
z <- (0.5652 - p)/se
# p-value
1- pnorm(z)
## [1] 0.00398607
The p-value of the sample data is smaller than 0.5, Therefor we reject H_0.
the data supporte the claim that the team which wins the coin flip wins more often than its opponent.
For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?
# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)
\(H_0: p_1 = p_2\)
\(H_A: p_1 \neq p_2\)
# Use table() to get x1, x2, n1, and n2
table(spain2005$response)
##
## atheist non-atheist
## 115 1031
table(spain2012$response)
##
## atheist non-atheist
## 103 1042
# Store values
x1 <- 115
x2 <- 103
n1 <- 115+ 1031
n2 <- 103+ 1042
p1 <- x1/n1
p2 <- x2/n2
# p-pool
ppool <- (x1 + x2)/(n1 +n2)
Ppool is 0.095
qpool <- 1 - ppool
# Calculate SE
se2 <- sqrt(((ppool*qpool)/n1)+((ppool*qpool)/n2))
# Test statistic
Test_st = ((p1 -p2) - 0 )/ se2
# p-value
(1-pnorm(Test_st))*2
## [1] 0.3966418
The p-value of the atheism in Spain is 0.3966. Therefor we fail to reject H_0.
Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?
# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)
\(H_0: p_1 = p_2\)
\(H_A: p_1 \neq p_2\)
# Use table() to get x1, x2, n1, and n2
table(USA2005$response)
##
## atheist non-atheist
## 10 992
table(USA2012$response)
##
## atheist non-atheist
## 50 952
# Pooled Proportion (x1 + x2)/(n1 + n2)
x1_3 <- 10
x2_3 <- 50
n1_3 <-10+ 992
n2_3 <- 50 + 952
p1_3 <- x1_3/n1_3
p2_3 <- x2_3/n2_3
p_pool_3 <- (x1_3 + x2_3)/(n1_3 +n2_3)
qpool <- 1 - p_pool_3
# Calculate SE
se3 <- sqrt((p_pool_3*qpool)/n1_3) + ((p_pool_3*qpool)/n2_3)
# Test statistic
Test_st3 = (p1_3 -p2_3) - 0 / se3
# p-value
1-pnorm(Test_st3)*2
## [1] 0.03184322
The p-value is less, than 0.05 so we reject the H_0 c. State conclusion The data suggestes to suport the claim that the atheism index in USA changed between year 2005 and 2012.
Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 2%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?
# Calculate sample size.
n=((1.96^2* 0.5* 0.5)/ 0.02^2)
The sample of residents for this proportion must have a sample margin of error no greater than 2% is 2401.
Use the NSCC Student Dataset for the Questions 5-8.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.
# sample data
table(nscc_data$VoterReg)
##
## No Yes
## 9 31
# calculate p_Hat
p_Hat <- 31/40
q_Hat <- 1-p_Hat
SE <- sqrt(p_Hat*q_Hat)/40
# Lower Bound of CI
p_Hat - 1.96*SE
## [1] 0.7545385
# Upper Bound of CI
p_Hat + 1.96*SE
## [1] 0.7954615
The 95% confidence interval for the proportion of NSCC students identifying as voters is between 75.4% and 79.5% ## Question 6 Construct a 95% confidence interval of the average height of all NSCC students.
# Sample data
mn<- mean(nscc_data$Height,na.rm =TRUE)
std <- sd(nscc_data$Height,na.rm =TRUE)
# Find the sample size
table(is.na(nscc_data$Height))
##
## FALSE TRUE
## 39 1
# Find the 95%, critical value for df = 38
t <-abs(qt(p = 0.025 ,df = 38 ))
# margen of error
E = t *(std/sqrt(39))
# Find lower bound interval
mn - E
## [1] 61.08699
# Find upper bound interval
mn + E
## [1] 67.96173
We are 95% confident that the average heigth of all NSCC students is between 61.087 and 67.962 iches. ## Question 7 Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.
a.) Write hypotheses and determine tails of the test
\(H_0: p_1 = 0.64\)
\(H_A: p_1 > 0.64\)
b.) Calculate sample statistics
# Find sample proportion of 2015 NSCC students drink coffee.
table(nscc_data$Coffee)
##
## No Yes
## 10 30
# calculate SE
phat <- 30/40
qhat <- 1- phat
se <- sqrt((0.64*.36)/40)
Testst= (0.75-.64)/se
c.) Determine probability of getting sample data by chance and use that to reject Ho or fail to reject Ho
# calculate p-value.
1-pnorm(Testst)
## [1] 0.07361613
The p-value is 0.0736.The p-value is grether than 0.05, we fail to reject the H-0
Conclusion,
Due to this data, we do not have enough evidence to support that the NSCC students drink more coffee than other American students.